read_gbq と Notebook の組み合わせが悪い問題

BigQuery からデータを読み込み Jupyter Notebook でデータ分析や前処理を行う、という流れをよくする。その時、コード内に read_gbq があると実行するたびに毎回重たいデータを読み込むため時間がかかる。

これを解決するために手動でキャッシュをとっていたが、今回は自動でキャッシュをとる read_gbq_cache メソッドを実装する。

クエリ正規化メソッド

キャッシュの key はクエリとプロジェクトIDに紐付ける。
クエリ内の空白や改行は場合によって変わりそうなので、正規化する。

import re
def format_query(q):
    q = re.sub("\n", " ", q)
    q = re.sub("\s+", " ", q)
    q = re.sub("^\s|\s$", "", q)
    q = q.lower()
    return q

q = """
SELECT
  *
FROM
  omotai_table
"""
project_id = "test"

# ハッシュ用のシード文字列
seed_str = format_query(q) + " project_id:" + project_id
print(seed_str)
# Output: select * from omotai_table project_id:test

# ハッシュ化
key = hashlib.sha256(seed_str.encode()).hexdigest()
print(key)
# Output: f550019a5f6cbf81fbefcf19066e6f7616387faabb6b69104666f9c7261cb451

shunyo

最終コード

read_gbq_cache を pd.read_gbq の代わりに使えばOK

import re
import os
import pandas as pd
import hashlib

def format_query(q):
    q = re.sub("\n", " ", q)
    q = re.sub("\s+", " ", q)
    q = re.sub("^\s|\s$", "", q)
    q = q.lower()
    return q

def read_gbq_cache(q, project_id, use_cache=True, cache_dir="/.gbq_cache"):
    seed_str = format_query(q) + project_id
    key = hashlib.sha256(seed_str.encode()).hexdigest()
    save_path = f"{cache_dir}/{key}.pickle"

    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)

    if not use_cache or not os.path.exists(save_path):
        df = pd.read_gbq(q, project_id, progress_bar_type="tqdm_notebook")
        df.to_pickle(save_path)
        print(f"save to {save_path}")
    else:
        print(f"found cache: {save_path}")
        return pd.read_pickle(save_path)

このスクラップは2021/06/15にクローズされました

ログインするとコメントできます