SQL (超巨大) 集計データから、行列形式に変換して出力する。

概要

巨大な集計データがあり、それを行列の形に変換して出力したい。
- 集計データを行列化すると、ほとんどの値がゼロの『スパースな』行列。
始めは Python listに一度入れる形で実装したが、メモリが足りなくなった orz..
- dictionary にすることで解決！！

入力データ

PostgreSQL に格納されているデータ群を CSV 化したもの。
「行ID、列ID、重み」の順になっていて、重みがゼロの行は省かれている。
行ID、列ID はそれぞれ 1 始まり。

データ一部抜粋：

1,7,1
1,12,1
1,13,1
…
2,31,1
2,35,1
2,45,1
2,50,1
…

Python で集計 ( list 型版 ) → 失敗 (ノД`)

Python で集計スクリプト作成。
- list で集計し、それを書き出す形。
始めの初期化時にメモリ不足で強制終了 orz...

以下、(失敗版) ソースコード抜粋：

import csv

# ( 関数呼び出し部省略 )

def create_matrix( in_filename, out_filename, row_size, col_size ) :

    # 配列初期化
    print "Create matrix memory..."
    matrix = [ [ x for x in range( col_size + 1 ) ] ] + [ [y+1] + [0] * col_size for y in range( row_size ) ]

    # データ設定
    print "Read matrix data file..."
    f_in   = open( in_filename )
    reader = csv.reader( f_in, lineterminator = '\n' )

    for row in reader :
        if row < 3 :
            print "Warning: Row {0} is too few columns ({1}). skip...".format( row_index + 1, len( row ) )
            continue

        object_id = int( row[0] )
        infon_id  = int( row[1] )
        weight    = int( row[2] )

        matrix[object_id][infon_id] = weight

    f_in.close()

    # データ書き込み
    print "Write matrix to file..."
    f_out  = open( out_filename, 'w' )
    writer = csv.writer( f_out, lineterminator = '\n' )

    for row in matrix :
        writer.writerow( row )

    f_out.close()

Python で集計 ( dictionary 型版 ) → 成功＼（＾＾）／

list の初期化を止め、matrix 変数を dictionary 型に変更。
- ゼロの値の部分を持たなくなったので、大幅にメモリ節約。
ただし、書き込みのスピードは…。

以下、(成功版) ソースコード抜粋：

import csv

# ( 関数呼び出し部省略 )

def create_matrix( in_filename, out_filename, row_size, col_size ) :

    # データ設定
    print "Read matrix data file..."
    f_in   = open( in_filename )
    reader = csv.reader( f_in, lineterminator = '\n' )

    matrix = {}
    for row in reader :
        if row < 3 :
            print "Warning: Row {0} is too few columns ({1}). skip...".format( row_index + 1, len( row ) )
            continue

        object_id = int( row[0] )
        infon_id  = int( row[1] )
        weight    = int( row[2] )

        if not matrix.has_key( object_id ) :
            matrix[object_id] = {}

        matrix[object_id][infon_id] = weight

    f_in.close()

    # データ書き込み
    print "Write matrix to file..."
    f_out  = open( out_filename, 'w' )
    writer = csv.writer( f_out, lineterminator = '\n' )

    header = [ x for x in xrange( col_size + 1 ) ]
    writer.writerow( header )

    for object_id in xrange( 1, row_size + 1 ) :
        row = [ object_id ]
        for infon_id in xrange( 1, col_size + 1 ) :
            if matrix.has_key( object_id ) and matrix[object_id].has_key( infon_id ) :
                row.append( matrix[object_id][infon_id] )
            else :
                row.append( 0 )
        writer.writerow( row )

    f_out.close()

出力結果

1行目が header で列番号
2行目以降、1列目が行番号、2列目以降データ。ほとんどゼロ。

データ一部抜粋

0,1,2,3,4,5,6,…
1,0,0,0,0,0,0,…
2,0,0,1,0,0,0,…
…

まとめ

「Python で巨大なデータを扱うときにはリストを使ってはいけない」ですね。

雑食性雑感雑記

知識の整理場。ため込んだ知識をブログ記事として再構築します。

SQL (超巨大) 集計データから、行列形式に変換して出力する。

概要

入力データ

Python で集計 ( list 型版 ) → 失敗 (ノД`)

Python で集計 ( dictionary 型版 ) → 成功＼（＾＾）／

出力結果

まとめ

概要

入力データ

Python で集計 ( list 型版 ) → 失敗 (ノД`)

Python で集計 ( dictionary 型版 ) → 成功 ＼（＾ ＾）／

出力結果

まとめ

Python で集計 ( dictionary 型版 ) → 成功＼（＾＾）／