dbt入門
はじめに
周回遅れではありますが、モダンデータスタックを始めたいと思います。
モダンデータスタックの中でも、ELT の T の部分は dbt 一択のようなので、dbt 入門してみます。
やってみる
基本は、以下の流れにしたいと思いますが、関連リンクを飛びながらやろうと思います。
uv でプロジェクトを作る
$ uv --version
uv 0.6.10
$ mkdir dbt-tutorial
$ cd dbt-tutorial
$ uv init
Initialized project `dbt-tutorial`
$ uv venv
Using CPython 3.9.20 interpreter at: /usr/bin/python3.9
Creating virtual environment at: .venv
$ tree
.
├── README.md
├── main.py
└── pyproject.toml
def main():
print("Hello from dbt-tutorial!")
if __name__ == "__main__":
main()
[project]
name = "dbt-tutorial"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.9"
dependencies = []
dbt Core をインストールする
DuckDB で試そうと思うので、以下を参考にインストールします。
$ uv add dbt-core dbt-duckdb
Resolved 55 packages in 280ms
Installed 52 packages in 83ms
:
+ zipp==3.21.0
$ tree
.
├── README.md
├── main.py
├── pyproject.toml
└── uv.lock
@@ -4,4 +4,7 @@ version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.9"
-dependencies = []
+dependencies = [
+ "dbt-core>=1.9.4",
+ "dbt-duckdb>=1.9.2",
+]
dbt プロジェクトを作成する
$ uv run dbt --version
/home/ec2-user/work/mds/dbt-tutorial/.venv/lib64/python3.9/site-packages/networkx/utils/backends.py:135: RuntimeWarning: networkx backend defined more than once: nx-loopback
backends.update(_get_backends("networkx.backends"))
Core:
- installed: 1.9.4
- latest: 1.9.4 - Up to date!
Plugins:
- duckdb: 1.9.2 - Up to date!
こんな感じで警告が表示されるので、Python をアップデートします。
$ uv python pin 3.12
Updated `.python-version` from `3.9` -> `3.12`
$ uv run dbt --version
Using CPython 3.12.9
Removed virtual environment at: .venv
Creating virtual environment at: .venv
Installed 52 packages in 34ms
Core:
- installed: 1.9.4
- latest: 1.9.4 - Up to date!
Plugins:
- duckdb: 1.9.2 - Up to date!
init コマンドを使用して jaffle_shop プロジェクトを開始します
$ uv run dbt init jaffle_shop
01:18:45 Running with dbt=1.9.4
01:18:45 [ConfigFolderDirectory]: Unable to parse logging event dictionary. Failed to parse dir field: expected string or bytes-like object, got 'PosixPath'.. Dictionary: {'dir': PosixPath('/home/ec2-user/.dbt')}
01:18:45 Creating dbt configuration folder at
01:18:45
Your new dbt project "jaffle_shop" was created!
For more information on how to configure the profiles.yml file,
please consult the dbt documentation here:
https://docs.getdbt.com/docs/configure-your-profile
One more thing:
Need help? Don't hesitate to reach out to us via GitHub issues or on Slack:
https://community.getdbt.com/
Happy modeling!
01:18:45 Setting up your profile.
Which database would you like to use?
[1] duckdb
(Don't see the one you want? https://docs.getdbt.com/docs/available-adapters)
Enter a number: 1
01:19:33 Profile jaffle_shop written to /home/ec2-user/.dbt/profiles.yml using target's sample configuration. Once updated, you'll be able to start developing with dbt.
$ tree
.
├── README.md
├── jaffle_shop
│ ├── README.md
│ ├── analyses
│ ├── dbt_project.yml
│ ├── macros
│ ├── models
│ │ └── example
│ │ ├── my_first_dbt_model.sql
│ │ ├── my_second_dbt_model.sql
│ │ └── schema.yml
│ ├── seeds
│ ├── snapshots
│ └── tests
├── logs
│ └── dbt.log
├── main.py
├── pyproject.toml
└── uv.lock
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'jaffle_shop'
version: '1.0.0'
# This setting configures which "profile" dbt uses for this project.
profile: 'jaffle_shop'
# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"
# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models
# In this example config, we tell dbt to build all models in the example/
# directory as views. These settings can be overridden in the individual model
# files using the `{{ config(...) }}` macro.
models:
jaffle_shop:
# Config indicated by + and applies to all files under models/example/
example:
+materialized: view
コマンドの最後で duckdb を選んだからか、~/.dbt/profiles.yml が生成されていました。
jaffle_shop:
outputs:
dev:
type: duckdb
path: dev.duckdb
threads: 1
prod:
type: duckdb
path: prod.duckdb
threads: 4
target: dev
DuckDB に接続する
$ cd jaffle_shop/
$ uv run dbt debug
01:40:53 Running with dbt=1.9.4
01:40:53 dbt version: 1.9.4
01:40:53 python version: 3.12.9
01:40:53 python path: /home/ec2-user/work/mds/dbt-tutorial/.venv/bin/python
01:40:53 os info: Linux-6.1.128-136.201.amzn2023.x86_64-x86_64-with-glibc2.34
01:40:53 Using profiles dir at /home/ec2-user/.dbt
01:40:53 Using profiles.yml file at /home/ec2-user/.dbt/profiles.yml
01:40:53 Using dbt_project.yml file at /home/ec2-user/work/mds/dbt-tutorial/jaffle_shop/dbt_project.yml
01:40:53 adapter type: duckdb
01:40:53 adapter version: 1.9.2
01:40:53 Configuration:
01:40:53 profiles.yml file [OK found and valid]
01:40:53 dbt_project.yml file [OK found and valid]
01:40:53 Required dependencies:
01:40:53 - git [OK found]
01:40:53 Connection:
01:40:53 database: dev
01:40:53 schema: main
01:40:53 path: dev.duckdb
01:40:53 config_options: None
01:40:53 extensions: None
01:40:53 settings: {}
01:40:53 external_root: .
01:40:53 use_credential_provider: None
01:40:53 attach: None
01:40:53 filesystems: None
01:40:53 remote: None
01:40:53 plugins: None
01:40:53 disable_transactions: False
01:40:53 Registered adapter: duckdb=1.9.2
01:40:53 Connection test: [OK connection ok]
01:40:53 All checks passed!
最初の dbt run
$ uv run dbt run
01:45:20 Running with dbt=1.9.4
01:45:20 Registered adapter: duckdb=1.9.2
01:45:20 Unable to do partial parsing because saved manifest not found. Starting full parse.
01:45:21 Found 2 models, 4 data tests, 426 macros
01:45:21
01:45:21 Concurrency: 1 threads (target='dev')
01:45:21
01:45:21 1 of 2 START sql table model main.my_first_dbt_model ........................... [RUN]
01:45:21 1 of 2 OK created sql table model main.my_first_dbt_model ...................... [OK in 0.05s]
01:45:21 2 of 2 START sql view model main.my_second_dbt_model ........................... [RUN]
01:45:21 2 of 2 OK created sql view model main.my_second_dbt_model ...................... [OK in 0.03s]
01:45:21
01:45:21 Finished running 1 table model, 1 view model in 0 hours 0 minutes and 0.16 seconds (0.16s).
01:45:21
01:45:21 Completed successfully
01:45:21
01:45:21 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
最初のモデルを作成する
ここまで来て、DuckDB のモデルが無かったorz。しかも、動かしてデータウェアハウスを見るように書いてあったので…
DuckDB を使えるようにしておく
$ curl https://install.duckdb.org | sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3270 100 3270 0 0 10085 0 --:--:-- --:--:-- --:--:-- 10092
*** DuckDB Linux/MacOS installation script, version 1.2.1 ***
.;odxdl,
.xXXXXXXXXKc
0XXXXXXXXXXXd cooo:
,XXXXXXXXXXXXK OXXXXd
0XXXXXXXXXXXo cooo:
.xXXXXXXXXKc
.;odxdl,
######################################################################## 100.0%
Successfully installed DuckDB binary to /home/ec2-user/.duckdb/cli/1.2.1/duckdb
with a link from /home/ec2-user/.duckdb/cli/latest/duckdb
Hint: Append the following line to your shell profile:
export PATH='/home/ec2-user/.duckdb/cli/latest':$PATH
Also created a symlink from /home/ec2-user/.local/bin/duckdb
to /home/ec2-user/.duckdb/cli/latest/duckdb
To launch DuckDB now, type
/home/ec2-user/.duckdb/cli/latest/duckdb
ということなので、やってみる。さっきの dbt run で dev.duckdb ファイルができていたので
$ ~/.duckdb/cli/latest/duckdb
v1.2.1 8e52ec4395
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D .open dev.duckdb
D .tables
my_first_dbt_model my_second_dbt_model
D select * from my_first_dbt_model;
┌───────┐
│ id │
│ int32 │
├───────┤
│ 1 │
│ NULL │
└───────┘
D .exit
モデル
DuckDB のケースは無いのですが、他のケースもジャンプしているようなので、もう少し探すと、ここから持ってくることを期待されているようです。
なので、持ってきます。まずは、seeds の下の3つの .csv ファイルをコピーして、dbt seed します。(結局、このリポジトリからコピーするのは、この3つの .csv のみでした)
$ uv run dbt seed
03:56:21 Running with dbt=1.9.4
03:56:21 Registered adapter: duckdb=1.9.2
03:56:21 Found 3 models, 4 data tests, 3 seeds, 426 macros
03:56:22
03:56:22 Concurrency: 1 threads (target='dev')
03:56:22
03:56:22 1 of 3 START seed file main.raw_customers ...................................... [RUN]
03:56:22 1 of 3 OK loaded seed file main.raw_customers .................................. [INSERT 100 in 0.05s]
03:56:22 2 of 3 START seed file main.raw_orders ......................................... [RUN]
03:56:22 2 of 3 OK loaded seed file main.raw_orders ..................................... [INSERT 99 in 0.02s]
03:56:22 3 of 3 START seed file main.raw_payments ....................................... [RUN]
03:56:22 3 of 3 OK loaded seed file main.raw_payments ................................... [INSERT 113 in 0.01s]
03:56:22
03:56:22 Finished running 3 seeds in 0 hours 0 minutes and 0.16 seconds (0.16s).
03:56:22
03:56:22 Completed successfully
03:56:22
03:56:22 Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
それから、models/customers.sql を以下の内容で作成します。
with customers as (
select
id as customer_id,
first_name,
last_name
from {{ ref('raw_customers') }}
),
orders as (
select
id as order_id,
user_id as customer_id,
order_date,
status
from {{ ref('raw_orders') }}
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from customers
left join customer_orders using (customer_id)
)
select * from final
そして、dbt run します。
$ uv run dbt run
03:59:26 Running with dbt=1.9.4
03:59:26 Registered adapter: duckdb=1.9.2
03:59:26 Found 3 models, 4 data tests, 3 seeds, 426 macros
03:59:26
03:59:26 Concurrency: 1 threads (target='dev')
03:59:26
03:59:26 1 of 3 START sql view model main.customers ..................................... [RUN]
03:59:26 1 of 3 OK created sql view model main.customers ................................ [OK in 0.05s]
03:59:26 2 of 3 START sql table model main.my_first_dbt_model ........................... [RUN]
03:59:26 2 of 3 OK created sql table model main.my_first_dbt_model ...................... [OK in 0.04s]
03:59:26 3 of 3 START sql view model main.my_second_dbt_model ........................... [RUN]
03:59:26 3 of 3 OK created sql view model main.my_second_dbt_model ...................... [OK in 0.02s]
03:59:26
03:59:26 Finished running 1 table model, 2 view models in 0 hours 0 minutes and 0.19 seconds (0.19s).
03:59:26
03:59:26 Completed successfully
03:59:26
03:59:26 Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
DuckDB で確認します。
$ ~/.duckdb/cli/latest/duckdb
v1.2.1 8e52ec4395
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D .open dev.duckdb
D .tables
customers my_second_dbt_model raw_orders
my_first_dbt_model raw_customers raw_payments
D select * from customers;
┌─────────────┬────────────┬───────────┬──────────────────┬────────────────────────┬──────────────────┐
│ customer_id │ first_name │ last_name │ first_order_date │ most_recent_order_date │ number_of_orders │
│ int32 │ varchar │ varchar │ date │ date │ int64 │
├─────────────┼────────────┼───────────┼──────────────────┼────────────────────────┼──────────────────┤
│ 1 │ Michael │ P. │ 2018-01-01 │ 2018-02-10 │ 2 │
│ 2 │ Shawn │ M. │ 2018-01-11 │ 2018-01-11 │ 1 │
│ 3 │ Kathleen │ P. │ 2018-01-02 │ 2018-03-11 │ 3 │
│ 6 │ Sarah │ R. │ 2018-02-19 │ 2018-02-19 │ 1 │
│ 7 │ Martin │ M. │ 2018-01-14 │ 2018-01-14 │ 1 │
│ 8 │ Frank │ R. │ 2018-01-29 │ 2018-03-12 │ 2 │
│ 9 │ Jennifer │ F. │ 2018-03-17 │ 2018-03-17 │ 1 │
│ 11 │ Fred │ S. │ 2018-03-23 │ 2018-03-23 │ 1 │
│ 12 │ Amy │ D. │ 2018-03-03 │ 2018-03-03 │ 1 │
│ 13 │ Kathleen │ M. │ 2018-03-07 │ 2018-03-07 │ 1 │
│ 16 │ Amanda │ H. │ 2018-02-02 │ 2018-02-02 │ 1 │
│ 18 │ Johnny │ K. │ 2018-02-27 │ 2018-02-27 │ 1 │
│ 19 │ Virginia │ F. │ 2018-03-16 │ 2018-03-16 │ 1 │
│ 20 │ Anna │ A. │ 2018-01-23 │ 2018-01-23 │ 1 │
│ 21 │ Willie │ H. │ 2018-03-28 │ 2018-03-28 │ 1 │
│ 22 │ Sean │ H. │ 2018-01-26 │ 2018-03-01 │ 3 │
│ 25 │ Victor │ H. │ 2018-01-17 │ 2018-03-20 │ 2 │
│ 26 │ Aaron │ R. │ 2018-02-11 │ 2018-03-08 │ 2 │
│ 27 │ Benjamin │ B. │ 2018-02-21 │ 2018-04-04 │ 2 │
│ 28 │ Lisa │ W. │ 2018-02-04 │ 2018-02-04 │ 1 │
│ · │ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │ · │
│ 60 │ Norma │ W. │ NULL │ NULL │ 0 │
│ 61 │ Timothy │ R. │ NULL │ NULL │ 0 │
│ 62 │ Elizabeth │ P. │ NULL │ NULL │ 0 │
│ 65 │ Brenda │ W. │ NULL │ NULL │ 0 │
│ 67 │ Michael │ H. │ NULL │ NULL │ 0 │
│ 72 │ Kathryn │ O. │ NULL │ NULL │ 0 │
│ 73 │ Alan │ B. │ NULL │ NULL │ 0 │
│ 74 │ Harry │ A. │ NULL │ NULL │ 0 │
│ 75 │ Andrea │ H. │ NULL │ NULL │ 0 │
│ 77 │ Anne │ W. │ NULL │ NULL │ 0 │
│ 78 │ Harry │ H. │ NULL │ NULL │ 0 │
│ 81 │ Shirley │ H. │ NULL │ NULL │ 0 │
│ 82 │ Arthur │ D. │ NULL │ NULL │ 0 │
│ 83 │ Virginia │ R. │ NULL │ NULL │ 0 │
│ 87 │ Phillip │ B. │ NULL │ NULL │ 0 │
│ 95 │ Lisa │ P. │ NULL │ NULL │ 0 │
│ 96 │ Jacqueline │ A. │ NULL │ NULL │ 0 │
│ 97 │ Shirley │ D. │ NULL │ NULL │ 0 │
│ 98 │ Nicole │ M. │ NULL │ NULL │ 0 │
│ 100 │ Jean │ M. │ NULL │ NULL │ 0 │
├─────────────┴────────────┴───────────┴──────────────────┴────────────────────────┴──────────────────┤
│ 100 rows (40 shown) 6 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────┘
D select view_name from duckdb_views;
┌─────────────────────┐
│ view_name │
│ varchar │
├─────────────────────┤
│ customers │
│ my_second_dbt_model │
└─────────────────────┘
D .exit
前振りで、customers が view であることを確認して、次へ進みます。
モデルのマテリアライズの方法を変更する
@@ -31,6 +31,7 @@
# files using the `{{ config(...) }}` macro.
models:
jaffle_shop:
+ +materialized: table
# Config indicated by + and applies to all files under models/example/
example:
+materialized: view
として、uv run dbt run
すると、customers は、table になり、
@@ -1,3 +1,9 @@
+{{
+ config(
+ materialized='view'
+ )
+}}
+
with customers as (
select
として、uv run dbt run
すると、customers は、view になります。
サンプルモデルを削除する
$ rm -rf models/example/
@@ -32,6 +32,3 @@
models:
jaffle_shop:
+materialized: table
- # Config indicated by + and applies to all files under models/example/
- example:
- +materialized: view
他のモデルの上にモデルを構築する
以下を models/customers.sql からコピーして、新規ファイルとして作成します。
select
id as customer_id,
first_name,
last_name
from {{ ref('raw_customers') }}
select
id as order_id,
user_id as customer_id,
order_date,
status
from {{ ref('raw_orders') }}
上で切り出した箇所を参照するように、customers.sql を変更します。
@@ -1,29 +1,12 @@
-{{
- config(
- materialized='view'
- )
-}}
-
with customers as (
- select
- id as customer_id,
- first_name,
- last_name
-
- from {{ ref('raw_customers') }}
+ select * from {{ ref('stg_customers') }}
),
orders as (
- select
- id as order_id,
- user_id as customer_id,
- order_date,
- status
-
- from {{ ref('raw_orders') }}
+ select * from {{ ref('stg_orders') }}
),
uv run dbt run
して、確認しておきます。
モデルにテストを追加する
以下のファイルを追加します。
version: 2
models:
- name: customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: stg_customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
- name: customer_id
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
uv run dbt test
を実行する。
$ uv run dbt test
04:50:53 Running with dbt=1.9.4
04:50:53 Registered adapter: duckdb=1.9.2
04:50:53 Found 3 models, 3 seeds, 9 data tests, 426 macros
04:50:53
04:50:53 Concurrency: 1 threads (target='dev')
04:50:53
04:50:53 1 of 9 START test accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned [RUN]
04:50:54 1 of 9 PASS accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned [PASS in 0.03s]
04:50:54 2 of 9 START test not_null_customers_customer_id ............................... [RUN]
04:50:54 2 of 9 PASS not_null_customers_customer_id ..................................... [PASS in 0.01s]
04:50:54 3 of 9 START test not_null_stg_customers_customer_id ........................... [RUN]
04:50:54 3 of 9 PASS not_null_stg_customers_customer_id ................................. [PASS in 0.01s]
04:50:54 4 of 9 START test not_null_stg_orders_customer_id .............................. [RUN]
04:50:54 4 of 9 PASS not_null_stg_orders_customer_id .................................... [PASS in 0.01s]
04:50:54 5 of 9 START test not_null_stg_orders_order_id ................................. [RUN]
04:50:54 5 of 9 PASS not_null_stg_orders_order_id ....................................... [PASS in 0.01s]
04:50:54 6 of 9 START test relationships_stg_orders_customer_id__customer_id__ref_stg_customers_ [RUN]
04:50:54 6 of 9 PASS relationships_stg_orders_customer_id__customer_id__ref_stg_customers_ [PASS in 0.01s]
04:50:54 7 of 9 START test unique_customers_customer_id ................................. [RUN]
04:50:54 7 of 9 PASS unique_customers_customer_id ....................................... [PASS in 0.01s]
04:50:54 8 of 9 START test unique_stg_customers_customer_id ............................. [RUN]
04:50:54 8 of 9 PASS unique_stg_customers_customer_id ................................... [PASS in 0.01s]
04:50:54 9 of 9 START test unique_stg_orders_order_id ................................... [RUN]
04:50:54 9 of 9 PASS unique_stg_orders_order_id ......................................... [PASS in 0.01s]
04:50:54
04:50:54 Finished running 9 data tests in 0 hours 0 minutes and 0.18 seconds (0.18s).
04:50:54
04:50:54 Completed successfully
04:50:54
04:50:54 Done. PASS=9 WARN=0 ERROR=0 SKIP=0 TOTAL=9
モデルを文章化する
以下のように models/schema.yml を変更します。
@@ -2,22 +2,30 @@ version: 2
models:
- name: customers
+ description: One record per customer
columns:
- name: customer_id
+ description: Primary key
tests:
- unique
- not_null
+ - name: first_order_date
+ description: NULL when a customer has not yet placed an order.
- name: stg_customers
+ description: This model cleans up customer data
columns:
- name: customer_id
+ description: Primary key
tests:
- unique
- not_null
- name: stg_orders
+ description: This model cleans up order data
columns:
- name: order_id
+ description: Primary key
tests:
- unique
- not_null
uv run dbt docs serve
して確認することになっていますが、以下のようにダンマリでした。
uv run dbt docs generate
は、確認できました。
$ wget localhost:8080
--2025-04-06 05:05:05-- http://localhost:8080/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:8080... connected.
HTTP request sent, awaiting response...
ジョブをスケジュールする
こちらは、dbt Cloud が対象のようなので、省略します。
おわりに
dbt は、ドキュメントの質も高く、量も多くて、素晴らしいのですが、簡単に始めるチュートリアルがわかりにくかったので、DuckDB を使用して、スタンドアロン環境で試す手順をまとめておきます。
Discussion