Python データパイプライン設計パターン

## はじめに大量データを効率的に処理するパイプラインの設計パターン。 ## Generator パターン ```python def read_csv(filepath): with open(filepath) as f: reader = csv.DictReader(f) for row in reader: yield row def transform(rows): for row in rows: row["price"] = int(row["price"]) * 1.1 yield row def load(rows, db): batch = [] for row in rows: batch.append(row) if len(batch) >= 1000: db.bulk_insert(batch) batch = [] if batch: db.bulk_insert(batch) # パイプライン実行 rows = read_csv("data.csv") transformed = transform(rows) load(transformed, db) ``` メモリ効率が良い。100万行でもメモリ消費は一定。

Comments (0)

No comments yet. Be the first to share your thoughts.