dbt Python モデルで日本のカレンダーにローカライズされたリファレンステーブルを作成する
dbt Advent Calendar 2022 の 7 日目の記事です。
ビジネス要件の追加・変更で、「インフォメーションマート(データマート)に、日本のカレンダーにローカライズされた集計列を追加する作業」は、分析あるあるだと思います。
- 国民の祝日(元旦、山の日など)
- 第 n 営業日 など
インフォメーションマートを作成・更新するたびに、 SQL の関数を使用して個別に集計列を実装しても良いのですが、ビジネスルールが変更されるたびに、すべてのインフォメーションマートの設定を変更しなければいけません。
どのようなインフォメーションマートにでも使えるような汎用性が高い集計列は、あらかじめリファレンステーブルを作成しておくと、いざ実装するときに join
でまとめて追加することができます。
データ分析基盤の運用がラクになるので、まだリファレンステーブルを取り入れていない方はぜひ取り入れてみてくださいね 👋
さて、話は戻りますが、 DWH アプリケーションの標準の関数だけで、日本のカレンダーにローカライズされたリファレンステーブルを作成することはできません。
通常、何かしらの方法を利用して、Google カレンダーなどのカレンダー情報を DWH にロードして、以下のようなリファレンステーブルを作成します。
date | day_of_week | business_day | is_holiday |
---|---|---|---|
2022-05-01 | 日 | true | |
2022-05-02 | 月 | 1 | false |
2022-05-03 | 火 | true | |
2022-05-04 | 水 | true | |
2022-05-05 | 木 | true | |
2022-05-06 | 金 | 2 | false |
2022-05-07 | 土 | true | |
2022-05-08 | 日 | true | |
2022-05-09 | 月 | 3 | false |
2022-05-10 | 火 | 4 | false |
dbt のバージョン 1.3 以降、dbt Python モデルが使えるようになりました。(Core でも Cloud でも使えます!)
この記事では、dbt run
コマンドが実行されるたびに、「1990 年 1 月 1 日」から「実行された日付」までのリファレンステーブルを以下の環境で作成することをゴールとしています。
- Snowflake × dbt Cloud
- BigQuery(Dataproc クラスタ) × dbt Cloud
Snowflake × dbt Cloud
Snowflake では、デフォルトで dbt Python モデルが利用できる状態になっています 👏
ただし、今回使用する holidays などの外部パッケージを利用したい場合は、Anaconda Python パッケージを画面から有効にする必要があります。
このパッケージを有効にすると、メジャーどころの外部パッケージを pip install
等のコマンドを入力しなくて、すぐに使うことができるのでありがたいです。
全パッケージリストは https://repo.anaconda.com/pkgs/snowflake/ にあります。
手順
-
[Snowflake] 請求と利用規約ページで Anaconda Python パッケージを有効にする(※設定には、ORGADMIN ロールを持つアカウントが必要です)
-
[dbt Cloud] クエリの実行を高速にするために、プライベートプレビュー機能を有効化する(※将来的にデフォルトで有効になる可能性がある)
dbt_project.yml︙ models: use_anonymous_sproc: True
-
[dbt Cloud] dbt Pythom モデルを作成
models フォルダに[python_model_name].py
を作成する -
[dbt Cloud] コードを書き終えたら、
dbt run
コマンドで実行
プログラム
from datetime import datetime
import holidays
import numpy as np
import pandas as pd
def model(dbt, session):
# dbtの設定
dbt.config(
materialized='table',
packages=['holidays']
)
# 日付の設定
jp_holidays = holidays.JP()
start_date = datetime(1990, 1, 1).strftime('%Y-%m-%d')
end_date = datetime.now().strftime('%Y-%m-%d')
# 日付のデータフレームを作成
df = pd.DataFrame({'date': pd.date_range(start_date, end_date, freq='D')})
# 年、年月、四半期、月、日
df['year'] = df['date'].dt.year
df['year_month'] = df['date'].dt.strftime('%Y-%m')
df['quarter'] = df['date'].dt.quarter
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
# 曜日(漢字)
df['day_of_week'] = df['date'].dt.weekday.map({0: '日', 1: '月', 2: '火', 3: '水', 4: '木', 5: '金', 6: '土'})
# 祝日フラグ
df['is_public_holiday'] = df['date'].apply(lambda x: True if x in jp_holidays else False)
# 祝日の名前
df['public_holiday_name'] = df['date'].apply(lambda x: jp_holidays.get(x) if x in jp_holidays else pd.NA)
# 休日フラグ(土、日、祝、三が日)
df['is_holiday'] = np.where(
(df['is_public_holiday']==True)| \
(df['day_of_week']=='土')| \
(df['day_of_week']=='日')| \
(df['month']==1)&(df['day']==2)| \
(df['month']==1)&(df['day']==3), True, False)
# 第n営業日
df['business_day'] = df.groupby(['year_month', 'is_holiday'])['date'] \
.rank(ascending=True, method='first') \
.mask(df['is_holiday']==True, pd.NA) \
.astype('Int64')
# Pandasのdatetime64[ns]型をSnowflake用のDate型に変換
df['date'] = df['date'].astype(str).apply(lambda x: datetime.strptime(x, '%Y-%m-%d').date())
return df
Snowflake の Python コネクタでは、 model 関数の中でデータを処理して、最終的に pandas.DataFrame
オブジェクトを返します。
詰まったところは、Pandas のデータ型を Snowflake のデータ型に変換するところでした。
BigQuery × dbt Cloud
BigQuery では、2 つの方法で dbt Python モデルを利用することができます。
- Dataproc サーバーレス
- Dataproc クラスタ(今回はこっち)
この 2 つの使い分けは、あらかじめ用意されているパッケージで十分ならサーバーレス、それ以外のパッケージを利用したいなたクラスタ、といったところでしょうか。
Dataproc の外部パッケージ一覧(2022 年 11 月 3 日時点)
import subprocess
print(subprocess.run(['pip', 'freeze'], capture_output=True, text=True).stdout)
# access @ file:///home/conda/feedstock_root/build_artifacts/access_1659357695203/work
# affine @ file:///home/conda/feedstock_root/build_artifacts/affine_1648309059244/work
# aiohttp @ file:///home/conda/feedstock_root/build_artifacts/aiohttp_1663850771634/work
# aiosignal @ file:///home/conda/feedstock_root/build_artifacts/aiosignal_1636093929600/work
# alabaster==0.7.12
# alembic @ file:///home/conda/feedstock_root/build_artifacts/alembic_1657813896088/work
# amply @ file:///home/conda/feedstock_root/build_artifacts/amply_1650975715775/work
# ansiwrap==0.8.4
# anyio @ file:///home/conda/feedstock_root/build_artifacts/anyio_1652463879716/work/dist
# appdirs @ file:///home/conda/feedstock_root/build_artifacts/appdirs_1603108395799/work
# argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1640817743617/work
# argon2-cffi-bindings @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi-bindings_1649500309442/work
# arrow @ file:///home/conda/feedstock_root/build_artifacts/arrow_1662382474514/work
# astroid @ file:///home/conda/feedstock_root/build_artifacts/astroid_1663607123174/work
# asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1660605382950/work
# async-generator==1.10
# async-timeout @ file:///home/conda/feedstock_root/build_artifacts/async-timeout_1640026696943/work
# atomicwrites @ file:///home/conda/feedstock_root/build_artifacts/atomicwrites_1657325823582/work
# attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1659291887007/work
# autopep8 @ file:///home/conda/feedstock_root/build_artifacts/autopep8_1615918605177/work
# aws-requests-auth==0.4.2
# Babel @ file:///home/conda/feedstock_root/build_artifacts/babel_1655419414885/work
# backcall @ file:///home/conda/feedstock_root/build_artifacts/backcall_1592338393461/work
# backports.functools-lru-cache @ file:///home/conda/feedstock_root/build_artifacts/backports.functools_lru_cache_1618230623929/work
# bcolz==1.2.1
# beautifulsoup4 @ file:///home/conda/feedstock_root/build_artifacts/beautifulsoup4_1649463573192/work
# binaryornot==0.4.4
# black @ file:///home/conda/feedstock_root/build_artifacts/black-recipe_1622561163993/work
# bleach @ file:///home/conda/feedstock_root/build_artifacts/bleach_1656355450470/work
# blinker @ file:///home/conda/feedstock_root/build_artifacts/blinker_1664823096650/work
# bokeh @ file:///home/conda/feedstock_root/build_artifacts/bokeh_1652969564918/work
# boto3 @ file:///home/conda/feedstock_root/build_artifacts/boto3_1665230905305/work
# botocore @ file:///home/conda/feedstock_root/build_artifacts/botocore_1665741815112/work
# branca @ file:///home/conda/feedstock_root/build_artifacts/branca_1650311966802/work
# brotlipy @ file:///home/conda/feedstock_root/build_artifacts/brotlipy_1648854175163/work
# cachetools==4.2.4
# certifi==2022.9.24
# certipy==0.1.3
# cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1656782821535/work
# chardet @ file:///home/conda/feedstock_root/build_artifacts/chardet_1649184112677/work
# charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1661170624537/work
# click==7.1.2
# click-plugins==1.1.1
# cligj @ file:///home/conda/feedstock_root/build_artifacts/cligj_1633637764473/work
# cloudpickle @ file:///home/conda/feedstock_root/build_artifacts/cloudpickle_1662587369221/work
# colorama @ file:///home/conda/feedstock_root/build_artifacts/colorama_1655412516417/work
# conda==4.9.2
# conda-package-handling @ file:///home/conda/feedstock_root/build_artifacts/conda-package-handling_1663583601093/work
# confuse @ file:///home/conda/feedstock_root/build_artifacts/confuse_1658019724937/work
# cookiecutter @ file:///home/conda/feedstock_root/build_artifacts/cookiecutter_1643669229020/work
# coverage @ file:///home/conda/feedstock_root/build_artifacts/coverage_1664603892600/work
# cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography_1665535545125/work
# cycler @ file:///home/conda/feedstock_root/build_artifacts/cycler_1635519461629/work
# Cython @ file:///home/conda/feedstock_root/build_artifacts/cython_1659101929163/work
# cytoolz @ file:///home/conda/feedstock_root/build_artifacts/cytoolz_1657553437789/work
# dask @ file:///home/conda/feedstock_root/build_artifacts/dask-core_1607657054678/work
# dataclasses @ file:///home/conda/feedstock_root/build_artifacts/dataclasses_1628958434797/work
# debugpy @ file:///home/conda/feedstock_root/build_artifacts/debugpy_1660619017351/work
# decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work
# defusedxml @ file:///home/conda/feedstock_root/build_artifacts/defusedxml_1615232257335/work
# deprecation @ file:///home/conda/feedstock_root/build_artifacts/deprecation_1589881437857/work
# descartes==1.1.0
# diff-match-patch @ file:///home/conda/feedstock_root/build_artifacts/diff-match-patch_1594679019945/work
# dill @ file:///home/conda/feedstock_root/build_artifacts/dill_1653058582944/work
# distlib @ file:///home/conda/feedstock_root/build_artifacts/distlib_1657828092311/work
# distributed @ file:///home/conda/feedstock_root/build_artifacts/distributed_1611361822694/work
# docutils @ file:///home/conda/feedstock_root/build_artifacts/docutils_1657104281902/work
# entrypoints @ file:///home/conda/feedstock_root/build_artifacts/entrypoints_1643888246732/work
# esda @ file:///home/conda/feedstock_root/build_artifacts/esda_1660931045600/work
# executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1665301981797/work
# fastavro @ file:///home/conda/feedstock_root/build_artifacts/fastavro_1652900770404/work
# fastjsonschema @ file:///home/conda/feedstock_root/build_artifacts/python-fastjsonschema_1663619548554/work/dist
# fastparquet @ file:///home/conda/feedstock_root/build_artifacts/fastparquet_1619039245868/work
# filelock @ file:///home/conda/feedstock_root/build_artifacts/filelock_1660129891014/work
# findspark @ file:///home/conda/feedstock_root/build_artifacts/findspark_1644599740637/work
# Fiona @ file:///home/conda/feedstock_root/build_artifacts/fiona_1653911984590/work
# flake8 @ file:///home/conda/feedstock_root/build_artifacts/flake8_1601874335748/work
# flit_core @ file:///home/conda/feedstock_root/build_artifacts/flit-core_1645629044586/work/source/flit_core
# folium @ file:///home/conda/feedstock_root/build_artifacts/folium_1665185302619/work
# frozenlist @ file:///home/conda/feedstock_root/build_artifacts/frozenlist_1659556945411/work
# fsspec @ file:///home/conda/feedstock_root/build_artifacts/fsspec_1618579848600/work
# future @ file:///home/conda/feedstock_root/build_artifacts/future_1649010148304/work
# gcsfs @ file:///home/conda/feedstock_root/build_artifacts/gcsfs_1618251324500/work
# GDAL==3.5.0
# geopandas @ file:///home/conda/feedstock_root/build_artifacts/geopandas_1658688831518/work
# giddy @ file:///home/conda/feedstock_root/build_artifacts/giddy_1591802367630/work
# gitdb @ file:///home/conda/feedstock_root/build_artifacts/gitdb_1635085722655/work
# GitPython @ file:///home/conda/feedstock_root/build_artifacts/gitpython_1665480281556/work
# gmpy2 @ file:///home/conda/feedstock_root/build_artifacts/gmpy2_1641732911382/work
# google-api-core==1.33.2
# google-auth==1.35.0
# google-auth-oauthlib @ file:///home/conda/feedstock_root/build_artifacts/google-auth-oauthlib_1663214056440/work
# google-cloud-bigquery==2.6.2
# google-cloud-bigquery-storage==2.1.0
# google-cloud-bigtable==1.6.1
# google-cloud-container==2.3.1
# google-cloud-core==1.7.3
# google-cloud-datacatalog==3.0.0
# google-cloud-dataproc==2.2.0
# google-cloud-datastore==2.1.6
# google-cloud-language==2.0.0
# google-cloud-logging==2.1.1
# google-cloud-monitoring==2.0.1
# google-cloud-pubsub==2.2.0
# google-cloud-redis==2.0.0
# google-cloud-spanner==2.1.1
# google-cloud-speech==2.0.1
# google-cloud-storage==1.35.1
# google-cloud-texttospeech==2.2.0
# google-cloud-translate==3.0.2
# google-cloud-vision==2.0.0
# google-crc32c==1.5.0
# google-resumable-media==1.3.3
# googleapis-common-protos==1.56.4
# greenlet @ file:///home/conda/feedstock_root/build_artifacts/greenlet_1661444185355/work
# grpc-google-iam-v1==0.12.4
# grpcio==1.49.1
# grpcio-status==1.48.2
# HeapDict==1.0.1
# htmlmin==0.1.12
# httplib2 @ file:///home/conda/feedstock_root/build_artifacts/httplib2_1617134439639/work
# idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1593328102638/work
# imagecodecs @ file:///home/conda/feedstock_root/build_artifacts/imagecodecs_1662930206934/work
# ImageHash @ file:///home/conda/feedstock_root/build_artifacts/imagehash_1664371213222/work
# imageio @ file:///home/conda/feedstock_root/build_artifacts/imageio_1663572338894/work
# imagesize @ file:///home/conda/feedstock_root/build_artifacts/imagesize_1656939531508/work
# importlib-metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1653252793585/work
# importlib-resources @ file:///home/conda/feedstock_root/build_artifacts/importlib_resources_1665204935269/work
# inequality==1.0.0
# inflection @ file:///home/conda/feedstock_root/build_artifacts/inflection_1598089801258/work
# iniconfig @ file:///home/conda/feedstock_root/build_artifacts/iniconfig_1603384189793/work
# intervaltree==3.0.2
# ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1655241626755/work
# ipyparallel @ file:///home/conda/feedstock_root/build_artifacts/ipyparallel_1607986704956/work
# ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1662481517711/work
# ipython-genutils==0.2.0
# ipython-sql @ file:///home/conda/feedstock_root/build_artifacts/ipython-sql_1602667917966/work
# ipywidgets @ file:///home/conda/feedstock_root/build_artifacts/ipywidgets_1660942226216/work
# isort @ file:///home/conda/feedstock_root/build_artifacts/isort_1636447814597/work
# jaraco.classes @ file:///home/conda/feedstock_root/build_artifacts/jaraco.classes_1658175614188/work
# jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1635823949331/work
# jeepney @ file:///home/conda/feedstock_root/build_artifacts/jeepney_1649085214306/work
# Jinja2 @ file:///home/conda/feedstock_root/build_artifacts/jinja2_1636510082894/work
# jinja2-time @ file:///home/conda/feedstock_root/build_artifacts/jinja2-time_1646750632133/work
# jmespath @ file:///home/conda/feedstock_root/build_artifacts/jmespath_1655568249366/work
# joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1663332044897/work
# json5 @ file:///home/conda/feedstock_root/build_artifacts/json5_1600692310011/work
# jsonlines @ file:///home/conda/feedstock_root/build_artifacts/jsonlines_1596777080833/work
# jsonschema @ file:///home/conda/feedstock_root/build_artifacts/jsonschema-meta_1662736858348/work
# jupyter-client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1649327809992/work
# jupyter-contrib-core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_contrib_core_1657548529421/work
# jupyter-contrib-nbextensions @ file:///home/conda/feedstock_root/build_artifacts/jupyter_contrib_nbextensions_1602805456242/work
# jupyter-core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1652365251650/work
# jupyter-highlight-selected-word @ file:///home/conda/feedstock_root/build_artifacts/jupyter_highlight_selected_word_1638382841351/work
# jupyter-http-over-ws @ file:///home/conda/feedstock_root/build_artifacts/jupyter_http_over_ws_1597332535364/work
# jupyter-latex-envs @ file:///home/conda/feedstock_root/build_artifacts/jupyter_latex_envs_1614808832269/work
# jupyter-nbextensions-configurator @ file:///home/conda/feedstock_root/build_artifacts/jupyter_nbextensions_configurator_1611341108640/work
# jupyter-server @ file:///home/conda/feedstock_root/build_artifacts/jupyter_server_1647940913071/work
# jupyter-server-mathjax @ file:///home/conda/feedstock_root/build_artifacts/jupyter-server-mathjax_1657838296256/work
# jupyter-telemetry @ file:///home/conda/feedstock_root/build_artifacts/jupyter_telemetry_1605173804246/work
# jupyterhub @ file:///home/conda/feedstock_root/build_artifacts/jupyterhub-feedstock_1614255305026/work
# jupyterlab @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_1632809509349/work
# jupyterlab-git @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab-git_1620032639379/work
# jupyterlab-pygments @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_pygments_1649936611996/work
# jupyterlab-widgets @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_widgets_1631590465624/work
# jupyterlab_server @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_server_1665686629850/work
# keyring @ file:///home/conda/feedstock_root/build_artifacts/keyring_1663458642481/work
# kiwisolver @ file:///home/conda/feedstock_root/build_artifacts/kiwisolver_1657953078266/work
# koalas @ file:///home/conda/feedstock_root/build_artifacts/koalas_1605320953654/work
# lazy-object-proxy @ file:///home/conda/feedstock_root/build_artifacts/lazy-object-proxy_1649033177169/work
# libcst==0.4.7
# libpysal @ file:///home/conda/feedstock_root/build_artifacts/libpysal_1646347632001/work
# llvmlite==0.36.0
# locket @ file:///home/conda/feedstock_root/build_artifacts/locket_1650660393415/work
# lxml @ file:///home/conda/feedstock_root/build_artifacts/lxml_1649637471763/work
# Mako @ file:///home/conda/feedstock_root/build_artifacts/mako_1663885412539/work
# mamba @ file:///home/conda/feedstock_root/build_artifacts/mamba_1629310321864/work
# mapclassify @ file:///home/conda/feedstock_root/build_artifacts/mapclassify_1627363180059/work
# Markdown @ file:///home/conda/feedstock_root/build_artifacts/markdown_1651821407140/work
# MarkupSafe @ file:///home/conda/feedstock_root/build_artifacts/markupsafe_1648737563195/work
# matplotlib @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-suite_1632416634429/work
# matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1660814786464/work
# mccabe==0.6.1
# metakernel @ file:///home/conda/feedstock_root/build_artifacts/metakernel_1648594625035/work
# mgwr @ file:///home/conda/feedstock_root/build_artifacts/mgwr_1599671096782/work
# missingno==0.4.2
# mistune @ file:///home/conda/feedstock_root/build_artifacts/mistune_1635844675081/work
# mock @ file:///home/conda/feedstock_root/build_artifacts/mock_1648992799371/work
# more-itertools @ file:///home/conda/feedstock_root/build_artifacts/more-itertools_1660060161633/work
# mpmath @ file:///home/conda/feedstock_root/build_artifacts/mpmath_1612895720168/work
# msgpack @ file:///home/conda/feedstock_root/build_artifacts/msgpack-python_1654260646886/work
# multidict @ file:///home/conda/feedstock_root/build_artifacts/multidict_1648882420627/work
# munch==2.5.0
# mypy-extensions @ file:///home/conda/feedstock_root/build_artifacts/mypy_extensions_1649013329883/work
# nbclassic @ file:///home/conda/feedstock_root/build_artifacts/nbclassic_1647450696711/work
# nbclient @ file:///home/conda/feedstock_root/build_artifacts/nbclient_1646999386773/work
# nbconvert @ file:///home/conda/feedstock_root/build_artifacts/nbconvert_1605401836768/work
# nbdime @ file:///home/conda/feedstock_root/build_artifacts/nbdime_1618448032595/work
# nbformat @ file:///home/conda/feedstock_root/build_artifacts/nbformat_1665426034066/work
# nest-asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1664684991461/work
# networkx @ file:///home/conda/feedstock_root/build_artifacts/networkx_1664667048684/work
# nltk @ file:///home/conda/feedstock_root/build_artifacts/nltk_1633093058893/work
# notebook @ file:///home/conda/feedstock_root/build_artifacts/notebook_1610575313697/work
# notebook-shim @ file:///home/conda/feedstock_root/build_artifacts/notebook-shim_1646330736330/work
# numba @ file:///home/conda/feedstock_root/build_artifacts/numba_1623568544775/work
# numexpr @ file:///home/conda/feedstock_root/build_artifacts/numexpr_1658076426113/work
# numpy==1.23.4
# numpydoc @ file:///home/conda/feedstock_root/build_artifacts/numpydoc_1665273484262/work
# oauth2client==4.1.3
# oauthlib @ file:///home/conda/feedstock_root/build_artifacts/oauthlib_1662766753257/work
# packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1637239678211/work
# pamela==1.0.0
# pandas==1.5.1
# pandas-profiling @ file:///home/conda/feedstock_root/build_artifacts/pandas-profiling_1613839428900/work
# pandocfilters @ file:///home/conda/feedstock_root/build_artifacts/pandocfilters_1631603243851/work
# papermill @ file:///home/conda/feedstock_root/build_artifacts/papermill_1604950649566/work
# parso==0.7.0
# partd @ file:///home/conda/feedstock_root/build_artifacts/partd_1660316728562/work
# pathspec @ file:///home/conda/feedstock_root/build_artifacts/pathspec_1662130653746/work
# patsy @ file:///home/conda/feedstock_root/build_artifacts/patsy_1665356157073/work
# pexpect==4.8.0
# phik @ file:///home/conda/feedstock_root/build_artifacts/phik_1647910154052/work
# pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602535658641/work
# Pillow @ file:///home/conda/feedstock_root/build_artifacts/pillow_1660385857893/work
# pkgutil_resolve_name @ file:///home/conda/feedstock_root/build_artifacts/pkgutil-resolve-name_1633981968097/work
# platformdirs @ file:///home/conda/feedstock_root/build_artifacts/platformdirs_1657729053205/work
# pluggy @ file:///home/conda/feedstock_root/build_artifacts/pluggy_1648772598846/work
# pointpats @ file:///home/conda/feedstock_root/build_artifacts/pointpats_1659367300621/work
# pooch @ file:///home/conda/feedstock_root/build_artifacts/pooch_1643032624649/work
# portalocker @ file:///home/conda/feedstock_root/build_artifacts/portalocker_1657463956927/work
# poyo==0.5.0
# prettytable @ file:///home/conda/feedstock_root/build_artifacts/prettytable_1662142252009/work
# prometheus-client @ file:///home/conda/feedstock_root/build_artifacts/prometheus_client_1665692535292/work
# prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1662384672173/work
# proto-plus==1.11.0
# protobuf==3.20.3
# psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1662356143277/work
# ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
# PuLP @ file:///home/conda/feedstock_root/build_artifacts/pulp_1649461391700/work
# pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work
# pure-sasl @ file:///home/conda/feedstock_root/build_artifacts/pure-sasl_1631890804823/work
# py @ file:///home/conda/feedstock_root/build_artifacts/py_1636301881863/work
# py4j==0.10.9
# pyarrow==2.0.0
# pyasn1==0.4.8
# pyasn1-modules==0.2.7
# pycodestyle @ file:///home/conda/feedstock_root/build_artifacts/pycodestyle_1589305246696/work
# pycosat @ file:///home/conda/feedstock_root/build_artifacts/pycosat_1649384811940/work
# pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1636257122734/work
# pycurl==7.45.1
# pydocstyle @ file:///home/conda/feedstock_root/build_artifacts/pydocstyle_1621377123289/work
# pydot @ file:///home/conda/feedstock_root/build_artifacts/pydot_1654961309573/work
# pyflakes==2.2.0
# pygeos @ file:///home/conda/feedstock_root/build_artifacts/pygeos_1649618740943/work
# Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1660666458521/work
# PyHive @ file:///home/conda/feedstock_root/build_artifacts/pyhive_1646707521362/work
# PyJWT @ file:///home/conda/feedstock_root/build_artifacts/pyjwt_1663432570896/work
# pylint @ file:///home/conda/feedstock_root/build_artifacts/pylint_1663619536569/work
# pyls-black @ file:///home/conda/feedstock_root/build_artifacts/pyls-black_1595615126037/work
# pyls-spyder @ file:///home/conda/feedstock_root/build_artifacts/pyls-spyder_1613487177406/work
# pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1665350324128/work
# pyparsing @ file:///home/conda/feedstock_root/build_artifacts/pyparsing_1635267989520/work
# pyproj @ file:///home/conda/feedstock_root/build_artifacts/pyproj_1650803108421/work
# PyQt5==5.12.3
# PyQt5_sip==4.19.18
# PyQtChart==5.12
# PyQtWebEngine==5.12.1
# pyrsistent @ file:///home/conda/feedstock_root/build_artifacts/pyrsistent_1649013354401/work
# pysal @ file:///home/conda/feedstock_root/build_artifacts/pysal_1612819487814/work
# PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work
# # Editable install with no version control (pyspark==3.1.3)
# -e /usr/lib/spark/python
# pytest==7.1.3
# pytest-cov @ file:///home/conda/feedstock_root/build_artifacts/pytest-cov_1664412836798/work
# python-dateutil==2.8.2
# python-json-logger @ file:///home/conda/feedstock_root/build_artifacts/python-json-logger_1602545356084/work
# python-jsonrpc-server @ file:///home/conda/feedstock_root/build_artifacts/python-jsonrpc-server_1599827444631/work
# python-language-server @ file:///home/conda/feedstock_root/build_artifacts/python-language-server_1607720213724/work
# python-slugify @ file:///home/conda/feedstock_root/build_artifacts/python-slugify_1651150815876/work
# pytoolconfig @ file:///home/conda/feedstock_root/build_artifacts/pytoolconfig_1659322367429/work
# pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1664798238822/work
# pyu2f @ file:///home/conda/feedstock_root/build_artifacts/pyu2f_1604248910016/work
# PyWavelets @ file:///home/conda/feedstock_root/build_artifacts/pywavelets_1649616412805/work
# pyxdg @ file:///home/conda/feedstock_root/build_artifacts/pyxdg_1654536799286/work
# PyYAML @ file:///home/conda/feedstock_root/build_artifacts/pyyaml_1648757091578/work
# pyzmq @ file:///home/conda/feedstock_root/build_artifacts/pyzmq_1663830520841/work
# QDarkStyle @ file:///home/conda/feedstock_root/build_artifacts/qdarkstyle_1617328841504/work
# qstylizer @ file:///home/conda/feedstock_root/build_artifacts/qstylizer_1662244505808/work/dist/qstylizer-0.2.2-py2.py3-none-any.whl
# QtAwesome @ file:///home/conda/feedstock_root/build_artifacts/qtawesome_1638212842953/work
# qtconsole @ file:///home/conda/feedstock_root/build_artifacts/qtconsole-base_1661698361365/work
# QtPy @ file:///home/conda/feedstock_root/build_artifacts/qtpy_1664834420615/work
# quantecon @ file:///home/conda/feedstock_root/build_artifacts/quantecon_1655746571862/work
# quilt3 @ file:///home/conda/feedstock_root/build_artifacts/quilt3_1647621686355/work/api/python
# rasterio @ file:///home/conda/feedstock_root/build_artifacts/rasterio_1655388667652/work
# rasterstats @ file:///home/conda/feedstock_root/build_artifacts/rasterstats_1658854667468/work
# regex @ file:///home/conda/feedstock_root/build_artifacts/regex_1617644422046/work
# requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1608156231189/work
# requests-futures==1.0.0
# requests-oauthlib @ file:///home/conda/feedstock_root/build_artifacts/requests-oauthlib_1643557462909/work
# rope @ file:///home/conda/feedstock_root/build_artifacts/rope_1659117721617/work
# rsa @ file:///home/conda/feedstock_root/build_artifacts/rsa_1658328885051/work
# Rtree @ file:///home/conda/feedstock_root/build_artifacts/rtree_1637430736605/work
# ruamel-yaml-conda @ file:///home/conda/feedstock_root/build_artifacts/ruamel_yaml_1653464386701/work
# ruamel.yaml @ file:///home/conda/feedstock_root/build_artifacts/ruamel.yaml_1649033203966/work
# ruamel.yaml.clib @ file:///home/conda/feedstock_root/build_artifacts/ruamel.yaml.clib_1649013063220/work
# rvlib @ file:///home/conda/feedstock_root/build_artifacts/rvlib_1652306299890/work
# s3transfer @ file:///home/conda/feedstock_root/build_artifacts/s3transfer_1654039987929/work
# scikit-image @ file:///home/conda/feedstock_root/build_artifacts/scikit-image_1638363134145/work
# scikit-learn @ file:///home/conda/feedstock_root/build_artifacts/scikit-learn_1630910537183/work
# scipy @ file:///home/conda/feedstock_root/build_artifacts/scipy_1619561901336/work
# seaborn @ file:///home/conda/feedstock_root/build_artifacts/seaborn-split_1629095986539/work
# SecretStorage @ file:///home/conda/feedstock_root/build_artifacts/secretstorage_1660605880168/work
# segregation @ file:///home/conda/feedstock_root/build_artifacts/segregation_1659363234895/work
# Send2Trash @ file:///home/conda/feedstock_root/build_artifacts/send2trash_1628511208346/work
# Shapely @ file:///home/conda/feedstock_root/build_artifacts/shapely_1651793098501/work
# simplejson @ file:///home/conda/feedstock_root/build_artifacts/simplejson_1649013519645/work
# six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
# smmap @ file:///home/conda/feedstock_root/build_artifacts/smmap_1611376390914/work
# sniffio @ file:///home/conda/feedstock_root/build_artifacts/sniffio_1662051266223/work
# snowballstemmer @ file:///home/conda/feedstock_root/build_artifacts/snowballstemmer_1637143057757/work
# snuggs==1.4.7
# sortedcontainers @ file:///home/conda/feedstock_root/build_artifacts/sortedcontainers_1621217038088/work
# soupsieve @ file:///home/conda/feedstock_root/build_artifacts/soupsieve_1658207591808/work
# spaghetti @ file:///home/conda/feedstock_root/build_artifacts/spaghetti_1665544905212/work
# spglm @ file:///home/conda/feedstock_root/build_artifacts/spglm_1599674431915/work
# Sphinx @ file:///home/conda/feedstock_root/build_artifacts/sphinx_1664561477826/work
# sphinxcontrib-applehelp==1.0.2
# sphinxcontrib-devhelp==1.0.2
# sphinxcontrib-htmlhelp @ file:///home/conda/feedstock_root/build_artifacts/sphinxcontrib-htmlhelp_1621704829796/work
# sphinxcontrib-jsmath==1.0.1
# sphinxcontrib-qthelp==1.0.3
# sphinxcontrib-serializinghtml @ file:///home/conda/feedstock_root/build_artifacts/sphinxcontrib-serializinghtml_1649380998999/work
# spint @ file:///home/conda/feedstock_root/build_artifacts/spint_1612437640828/work
# splot @ file:///home/conda/feedstock_root/build_artifacts/splot_1649898658322/work
# spopt @ file:///home/conda/feedstock_root/build_artifacts/spopt_1655150061954/work
# spreg @ file:///home/conda/feedstock_root/build_artifacts/spreg_1624998920023/work
# spvcm==0.3.0
# spyder @ file:///home/conda/feedstock_root/build_artifacts/spyder_1627140945937/work
# spyder-kernels @ file:///home/conda/feedstock_root/build_artifacts/spyder-kernels_1625331173960/work
# spylon==0.3.0
# spylon-kernel==0.4.1
# SQLAlchemy @ file:///home/conda/feedstock_root/build_artifacts/sqlalchemy_1662536019813/work
# sqlparse @ file:///home/conda/feedstock_root/build_artifacts/sqlparse_1663990453245/work
# stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1664126450622/work
# statsmodels @ file:///home/conda/feedstock_root/build_artifacts/statsmodels_1654787099639/work
# sympy @ file:///home/conda/feedstock_root/build_artifacts/sympy_1618015367433/work
# tables @ file:///home/conda/feedstock_root/build_artifacts/pytables_1638208858826/work
# tangled-up-in-unicode @ file:///home/conda/feedstock_root/build_artifacts/tangled-up-in-unicode_1632832610704/work
# tblib @ file:///home/conda/feedstock_root/build_artifacts/tblib_1616261298899/work
# tenacity @ file:///home/conda/feedstock_root/build_artifacts/tenacity_1663807083684/work
# terminado @ file:///home/conda/feedstock_root/build_artifacts/terminado_1664482163879/work
# testpath @ file:///home/conda/feedstock_root/build_artifacts/testpath_1645693042223/work
# text-unidecode==1.3
# textdistance @ file:///home/conda/feedstock_root/build_artifacts/textdistance_1663527496115/work
# textwrap3==0.9.2
# threadpoolctl @ file:///home/conda/feedstock_root/build_artifacts/threadpoolctl_1643647933166/work
# three-merge @ file:///home/conda/feedstock_root/build_artifacts/three-merge_1595515817927/work
# thrift @ file:///home/conda/feedstock_root/build_artifacts/thrift_1649442765282/work/lib/py
# thrift-sasl @ file:///home/conda/feedstock_root/build_artifacts/thrift_sasl_1631824374965/work
# tifffile @ file:///home/conda/feedstock_root/build_artifacts/tifffile_1665588749940/work
# tinycss2 @ file:///home/conda/feedstock_root/build_artifacts/tinycss2_1637612658783/work
# tobler @ file:///home/conda/feedstock_root/build_artifacts/tobler_1645560589513/work
# toml @ file:///home/conda/feedstock_root/build_artifacts/toml_1604308577558/work
# tomli @ file:///home/conda/feedstock_root/build_artifacts/tomli_1644342247877/work
# tomlkit @ file:///home/conda/feedstock_root/build_artifacts/tomlkit_1664410426671/work
# toolz @ file:///home/conda/feedstock_root/build_artifacts/toolz_1657485559105/work
# tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1648827257044/work
# tqdm @ file:///home/conda/feedstock_root/build_artifacts/tqdm_1662214488106/work
# traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1663005918942/work
# typed-ast @ file:///home/conda/feedstock_root/build_artifacts/typed-ast_1653226017431/work
# typing-inspect==0.8.0
# typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1665144421445/work
# ujson @ file:///home/conda/feedstock_root/build_artifacts/ujson_1663334548095/work
# Unidecode @ file:///home/conda/feedstock_root/build_artifacts/unidecode_1664588179651/work
# uritemplate==3.0.1
# urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1603125704209/work
# virtualenv @ file:///home/conda/feedstock_root/build_artifacts/virtualenv_1643238754089/work
# visions @ file:///home/conda/feedstock_root/build_artifacts/visions_1600915384170/work
# watchdog @ file:///home/conda/feedstock_root/build_artifacts/watchdog_1654901050847/work
# wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1600965781394/work
# webencodings==0.5.1
# websocket-client @ file:///home/conda/feedstock_root/build_artifacts/websocket-client_1662334000707/work
# widgetsnbextension @ file:///home/conda/feedstock_root/build_artifacts/widgetsnbextension_1637174139311/work
# wrapt @ file:///home/conda/feedstock_root/build_artifacts/wrapt_1651495238964/work
# wurlitzer @ file:///home/conda/feedstock_root/build_artifacts/wurlitzer_1636132276097/work
# xyzservices @ file:///home/conda/feedstock_root/build_artifacts/xyzservices_1663619671984/work
# yapf @ file:///home/conda/feedstock_root/build_artifacts/yapf_1641487982943/work
# yarl @ file:///home/conda/feedstock_root/build_artifacts/yarl_1648966513179/work
# zict @ file:///home/conda/feedstock_root/build_artifacts/zict_1651156074437/work
# zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1665306627898/work
今回は、用意されていない holidays パッケージを使用するので、後者のクラスタを作成する方になります。
手順
-
[Google Cloud] Cloud Storage でバケットを作成
-
[Google Cloud] Cloud Dataproc API を有効化
-
[Google Cloud] Dataproc クラスタを作成
$ REGION=<region> $ CLUSTER_NAME=<cluster_name> $ gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/connectors/connectors.sh \ --metadata bigquery-connector-version=1.2.0 \ --metadata spark-bigquery-connector-version=0.21.0 \ --properties '^#^dataproc:pip.packages=holidays==0.16'
クラスタを作成するときに外部パッケージを指定します。
後から他のパッケージを追加したくなった場合は、クラスタを再構築する必要があります(面倒 🥲)。
-
[dbt Cloud] Dataproc クラスタ情報の登録
- BigQuery の接続設定(
Account Settings
->Projects
->Overview
->Connection
)を変更する
- BigQuery の接続設定(
-
[dbt Cloud] dbt Pythom モデルを作成
models フォルダに[python_model_name].py
を作成する -
[dbt Cloud] コードを書き終えたら、
dbt run
コマンドで実行
プログラム
from datetime import datetime
import holidays
import numpy as np
import pandas as pd
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType, BooleanType
from pyspark.sql.functions import when
def model(dbt, session):
# dbtの設定
dbt.config(
materialized='table',
submission_method="cluster"
)
# 日付の設定
jp_holidays = holidays.JP()
start_date = datetime(1990, 1, 1).strftime('%Y-%m-%d')
end_date = datetime.now().strftime('%Y-%m-%d')
# 日付のデータフレームを作成
pdf = pd.DataFrame({'date': pd.date_range(start_date, end_date, freq='D')})
# 年、年月、四半期、月、日
pdf['year'] = pdf['date'].dt.year
pdf['year_month'] = pdf['date'].dt.strftime('%Y-%m')
pdf['quarter'] = pdf['date'].dt.quarter
pdf['month'] = pdf['date'].dt.month
pdf['day'] = pdf['date'].dt.day
# 曜日(漢字)
pdf['day_of_week'] = pdf['date'].dt.weekday.map({0: '日', 1: '月', 2: '火', 3: '水', 4: '木', 5: '金', 6: '土'})
# 祝日フラグ
pdf['is_public_holiday'] = pdf['date'].apply(lambda x: True if x in jp_holidays else False)
# 祝日の名前
pdf['public_holiday_name'] = pdf['date'].apply(lambda x: jp_holidays.get(x) if x in jp_holidays else pd.NA)
# 休日フラグ(土、日、祝、三が日)
pdf['is_holiday'] = np.where(
(pdf['is_public_holiday']==True)| \
(pdf['day_of_week']=='土')| \
(pdf['day_of_week']=='日')| \
(pdf['month']==1)&(pdf['day']==2)| \
(pdf['month']==1)&(pdf['day']==3), True, False)
# 第n営業日
pdf['business_day'] = pdf.groupby(['year_month', 'is_holiday'])['date'] \
.rank(ascending=True, method='first') \
.mask(pdf['is_holiday']==True, -1) \
.astype(int)
# PySpark用のスキーマ(なくても良いが、いい感じに型推論してくれないので設定)
schema = StructType([
StructField("date", DateType(), True),
StructField("year", IntegerType(), True),
StructField("year_month", StringType(), True),
StructField("quarter", IntegerType(), True),
StructField("month", IntegerType(), True),
StructField("day", IntegerType(), True),
StructField("day_of_week", StringType(), True),
StructField("is_public_holiday", BooleanType(), True),
StructField("public_holiday_name", StringType(), True),
StructField("is_holiday", BooleanType(), True),
StructField("business_day", IntegerType(), True),
])
# pandas.DataFrame -> pyspark.DataFrame
sdf = session.createDataFrame(pdf, schema=schema)
# 代理の欠損値(-1)をnullに変換
# pdf -> sdfのとき、IntegerType()がpdfのInt64型(NaNありの整数型)に対応していないため
sdf = sdf.withColumn('business_day', when(sdf.business_day == -1, None).otherwise(sdf.business_day))
return sdf
dbt は、BigQuery と Python の連携に PySpark を利用しているので、model 関数の戻り値には、pyspark.DataFrame
オブジェクトを指定する必要があります。
記事執筆時点(2022 年 11 月)では、 Dataproc クラスタの PySpark のバージョンが 3.1.3
なので、もし 3.2 以上になった場合は、データフレームの変換部分を修正してくださいね。
-
pandas.DataFrame
->pyspark.DataFrame
- 3.2 未満:
session.createDataFrame(df)
- 3.2 + :
df.to_spark()
- 3.2 未満:
-
pyspark.DataFrame
->pandas.DataFrame
- 3.2 未満:
df.toPandas()
- 3.2 + :
df.to_pandas_on_spark()
- 3.2 未満:
PySpark の 3.2 以降のバージョンでは、pandas.DataFrame
のラッパーである、pyspark.pandas.DataFrame
を使用しているので、相互変換の関数が変わるという訳です。
感想
日本のカレンダーにローカライズされたリファレンステーブルを作成すること自体は、 Cloud Functions などでもできていた内容ですが、dbt Python モデルが導入されたことによって、今まで以上にコードの管理がしやすくなったと個人的に感じています。
特に Snowflake で dbt Python モデルを導入するのは比較的簡単だったので、Python が書ける方は積極的に導入してもいいと思います。
それでは 👋
Discussion