セマンティックレイヤーを dbt core を使って体験してみる
dbt で semantic layer を定義する方法がわからないので https://docs.getdbt.com/docs/build/sl-getting-started を見ながらやってみる
前提
❯ dbt --version
Core:
- installed: 1.7.8
- latest: 1.7.8 - Up to date!
Plugins:
- duckdb: 1.7.2 - Up to date!
Metricflow を使えるようにする
❯ pip install metricflow
❯ pip install dbt-metricflow
❯ install "dbt-metricflow[duckdb]"
❯ mf --version
mf, version 0.204.0
dbt project のセットアップ
環境に export DBT_PROFILES_DIR=.dbt/profiles.yml
しておく
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: "jaffle_shop"
version: "1.0.0"
config-version: 2
# This setting configures which "profile" dbt uses for this project.
profile: "duckdb-jaffle"
require-dbt-version: ">=1.6.0rc2"
# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["jaffle-data"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"
vars:
truncate_timespan_to: "{{ current_timestamp() }}"
"dbt_date:time_zone": "America/Los_Angeles"
# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models
# In this example config, we tell dbt to build all models in the example/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:
jaffle_shop:
staging:
+materialized: view
marts:
+materialized: table
duckdb-jaffle:
outputs:
dev:
type: duckdb
path: dev.duckdb
threads: 1
prod:
type: duckdb
path: prod.duckdb
threads: 4
target: dev
疎通確認
❯ dbt debug
16:52:59 Running with dbt=1.7.8
16:52:59 dbt version: 1.7.8
16:52:59 python version: 3.10.13
16:52:59 python path: /Users/ahayashi/dev/jaffle-sl-template/.env/bin/python
16:52:59 os info: macOS-13.4.1-x86_64-i386-64bit
16:53:00 Using profiles dir at .dbt
16:53:00 Using profiles.yml file at .dbt/profiles.yml
16:53:00 Using dbt_project.yml file at /Users/ahayashi/dev/jaffle-sl-template/dbt_project.yml
16:53:00 adapter type: duckdb
16:53:00 adapter version: 1.7.2
16:53:00 Configuration:
16:53:00 profiles.yml file [OK found and valid]
16:53:00 dbt_project.yml file [OK found and valid]
16:53:00 Required dependencies:
16:53:00 - git [OK found]
16:53:00 Connection:
16:53:00 database: dev
16:53:00 schema: main
16:53:00 path: dev.duckdb
16:53:00 config_options: None
16:53:00 extensions: None
16:53:00 settings: None
16:53:00 external_root: .
16:53:00 use_credential_provider: None
16:53:00 attach: None
16:53:00 filesystems: None
16:53:00 remote: None
16:53:00 plugins: None
16:53:00 disable_transactions: False
16:53:00 Registered adapter: duckdb=1.7.2
16:53:00 Connection test: [OK connection ok]
16:53:00 All checks passed!
tn %
mf
コマンドを一通り試していく
mf tutorial
❯ mf tutorial
To begin building and querying metrics, you must define semantic models and
metric configuration files in your dbt project. dbt will use these files to generate a
semantic manifest artifact, which MetricFlow will use to create a semantic graph for querying.
As part of this tutorial, we will generate the following files to help you get started:
📜 model files -> /Users/Rindrics/dev/jaffle-sl-template/models/sample_model
🌱 seed files -> /Users/Rindrics/dev/jaffle-sl-template/jaffle-data/sample_seed
✅ semantic manifest json file -> /Users/Rindrics/dev/jaffle-sl-template/target/semantic_manifest.json
Continue and generate the files? [y/N]: y
✔ 📜 Sample files has been generated.
🤓 Please run the following steps,
1. Verify that your adapter credentials are correct in `profiles.yml`
2. Add time spine model to the models directory (https://docs.getdbt.com/docs/build/metricflow-time-spine)
3. Run `dbt seed`, check to see that the steps related to countries, transactions, customers are passing.
4. Try validating your data model: `mf validate-configs`
5. Check out your metrics: `mf list metrics`
6. Check out dimensions for your metric `mf list dimensions --metrics transactions`
7. Query your first metric: `mf query --metrics transactions --group-bys metric_time --order metric_time`
8. Show the SQL MetricFlow generates:
`mf query --metrics transactions --group-bys metric_time --order metric_time --explain`
9. Visualize the plan:
`mf query --metrics transactions --group-bys metric_time --order metric_time --explain --display-plans`
* This only works if you have graphviz installed - see README.
10. Add another dimension:
`mf query --metrics transactions --group-bys metric_time,customer__customer_country --order metric_time`
11. Add a coarser time granularity:
`mf query --metrics transactions --group-bys metric_time__week --order metric_time__week`
12. Try a more complicated query: mf query --metrics transactions,transaction_usd_na --group-bys metric_time,is_large --order metric_time --start-time 2022-03-20 --end-time 2022-04-01.
13. When you're done with the tutorial, run mf tutorial --clean to delete sample models and seeds.
💡 Run `mf tutorial --msg` to see this message again without executing everything else
生成された target/semantic_manifest.json
を見てみたが、これは手で書くものではなさそうと思った
2. Add time spine model to the models directory
に関しては、下記の内容のコードが models/marts/metricflow_time_spine.sql
にあるのを確認した
-- metricflow_time_spine.sql
with
days as (
--for BQ adapters use "DATE('01/01/2000','mm/dd/yyyy')"
{{ dbt_date.get_base_dates(n_dateparts=365*10, datepart="day") }}
),
cast_to_date as (
select
cast(date_day as date) as date_day
from days
)
select * from cast_to_date
しかし
- Run
dbt seed
, check to see that the steps related to countries, transactions, customers are passing.
については、そのままではモデル customers
が下記のように名前衝突して失敗した
❯ dbt seed
18:31:02 Running with dbt=1.7.8
18:31:03 Registered adapter: duckdb=1.7.2
18:31:03 Unable to do partial parsing because profile has changed
18:31:03 Unable to do partial parsing because a project config has changed
18:31:04 Encountered an error:
Compilation Error
dbt found two models with the name "customers".
Since these resources have the same name, dbt will be unable to find the correct resource
when looking for ref("customers").
To fix this, change the name of one of these resources:
- model.jaffle_shop.customers (models/sample_model/customers.sql)
- model.jaffle_shop.customers (models/marts/customer360/customers.sql)
どちらを消すべきか考えたが、下記の理由から marts/
を消すことにした
- git diff してみると、
model/sample_model
が生成されていることがわかった-
mf tutorial
の実行時に生成されたものだった
-
- これは
dbt
コマンドではなくmf
コマンドの生成物であるため、現在の dbt project (https://github.com/dbt-labs/jaffle-sl-template)の設定を前提としていないと思われた
models/marts/
を削除したうえで dbt seed
を実行すると、time spine model がない旨のエラーが出た:
❯ dbt seed
18:52:24 Running with dbt=1.7.8
18:52:24 Registered adapter: duckdb=1.7.2
18:52:24 Encountered an error:
Parsing Error
The semantic layer requires a 'metricflow_time_spine' model in the project, but none was found. Guidance on creating this model can be found on our docs site (https://docs.getdbt.com/docs/build/metricflow-time-spine)
そこで models/marts/metricflow_time_spine.sql
と同じ内容を models/sample_model/metricflow_time_spine.sql
として保存した
再度 dbt seed
を実行すると、今度は成功した:
❯ dbt seed
18:38:58 Running with dbt=1.7.8
18:38:58 Registered adapter: duckdb=1.7.2
18:38:58 Unable to do partial parsing because profile has changed
18:38:58 Unable to do partial parsing because a project config has changed
18:39:01 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.jaffle_shop.marts
18:39:01 Found 10 models, 25 seeds, 12 tests, 18 sources, 0 exposures, 11 metrics, 615 macros, 0 groups, 5 semantic models
18:39:01
18:39:01 Concurrency: 1 threads (target='dev')
18:39:01
18:39:01 1 of 25 START seed file main.countries_seed .................................... [RUN]
18:39:01 1 of 25 OK loaded seed file main.countries_seed ................................ [INSERT 6 in 0.13s]
18:39:01 2 of 25 START seed file main.customers_seed .................................... [RUN]
18:39:01 2 of 25 OK loaded seed file main.customers_seed ................................ [INSERT 4 in 0.06s]
18:39:01 3 of 25 START seed file main.home_page_table ................................... [RUN]
18:39:02 3 of 25 OK loaded seed file main.home_page_table ............................... [INSERT 90400 in 1.20s]
18:39:02 4 of 25 START seed file main.olist_closed_deals_dataset ........................ [RUN]
18:39:02 4 of 25 OK loaded seed file main.olist_closed_deals_dataset .................... [INSERT 842 in 0.13s]
18:39:02 5 of 25 START seed file main.olist_customers_dataset ........................... [RUN]
18:39:04 5 of 25 OK loaded seed file main.olist_customers_dataset ....................... [INSERT 99441 in 1.76s]
18:39:04 6 of 25 START seed file main.olist_geolocation_dataset ......................... [RUN]
18:39:33 6 of 25 OK loaded seed file main.olist_geolocation_dataset ..................... [INSERT 1000163 in 29.19s]
18:39:33 7 of 25 START seed file main.olist_marketing_qualified_leads_dataset ........... [RUN]
18:39:34 7 of 25 OK loaded seed file main.olist_marketing_qualified_leads_dataset ....... [INSERT 8000 in 0.42s]
18:39:34 8 of 25 START seed file main.olist_order_items_dataset ......................... [RUN]
18:39:37 8 of 25 OK loaded seed file main.olist_order_items_dataset ..................... [INSERT 112650 in 3.32s]
18:39:37 9 of 25 START seed file main.olist_order_payments_dataset ...................... [RUN]
18:39:40 9 of 25 OK loaded seed file main.olist_order_payments_dataset .................. [INSERT 103886 in 3.08s]
18:39:40 10 of 25 START seed file main.olist_order_reviews_dataset ...................... [RUN]
18:39:47 10 of 25 OK loaded seed file main.olist_order_reviews_dataset .................. [INSERT 100000 in 6.70s]
18:39:47 11 of 25 START seed file main.olist_orders_dataset ............................. [RUN]
18:39:58 11 of 25 OK loaded seed file main.olist_orders_dataset ......................... [INSERT 99441 in 11.25s]
18:39:58 12 of 25 START seed file main.olist_products_dataset ........................... [RUN]
18:40:00 12 of 25 OK loaded seed file main.olist_products_dataset ....................... [INSERT 32951 in 2.02s]
18:40:00 13 of 25 START seed file main.olist_sellers_dataset ............................ [RUN]
18:40:00 13 of 25 OK loaded seed file main.olist_sellers_dataset ........................ [INSERT 3095 in 0.11s]
18:40:00 14 of 25 START seed file main.payment_confirmation_table ....................... [RUN]
18:40:00 14 of 25 OK loaded seed file main.payment_confirmation_table ................... [INSERT 452 in 0.06s]
18:40:00 15 of 25 START seed file main.payment_page_table ............................... [RUN]
18:40:00 15 of 25 OK loaded seed file main.payment_page_table ........................... [INSERT 6030 in 0.12s]
18:40:00 16 of 25 START seed file main.product_category_name_translation ................ [RUN]
18:40:00 16 of 25 OK loaded seed file main.product_category_name_translation ............ [INSERT 71 in 0.05s]
18:40:00 17 of 25 START seed file main.raw_customers .................................... [RUN]
18:40:01 17 of 25 OK loaded seed file main.raw_customers ................................ [INSERT 939 in 0.06s]
18:40:01 18 of 25 START seed file main.raw_items ........................................ [RUN]
18:40:01 18 of 25 OK loaded seed file main.raw_items .................................... [INSERT 95368 in 0.74s]
18:40:01 19 of 25 START seed file main.raw_orders ....................................... [RUN]
18:40:05 19 of 25 OK loaded seed file main.raw_orders ................................... [INSERT 59652 in 3.73s]
18:40:05 20 of 25 START seed file main.raw_products ..................................... [RUN]
18:40:05 20 of 25 OK loaded seed file main.raw_products ................................. [INSERT 10 in 0.05s]
18:40:05 21 of 25 START seed file main.raw_stores ....................................... [RUN]
18:40:05 21 of 25 OK loaded seed file main.raw_stores ................................... [INSERT 5 in 0.06s]
18:40:05 22 of 25 START seed file main.raw_supplies ..................................... [RUN]
18:40:07 22 of 25 OK loaded seed file main.raw_supplies ................................. [INSERT 65 in 0.05s]
18:40:07 23 of 25 START seed file main.search_page_table ................................ [RUN]
18:40:07 23 of 25 OK loaded seed file main.search_page_table ............................ [INSERT 45200 in 0.57s]
18:40:07 24 of 25 START seed file main.transactions_seed ................................ [RUN]
18:40:07 24 of 25 OK loaded seed file main.transactions_seed ............................ [INSERT 50 in 0.06s]
18:40:07 25 of 25 START seed file main.user_table ....................................... [RUN]
18:40:10 25 of 25 OK loaded seed file main.user_table ................................... [INSERT 90400 in 2.94s]
18:40:10
18:40:10 Finished running 25 seeds in 0 hours 1 minutes and 9.73 seconds (69.73s).
18:40:10
18:40:10 Completed successfully
18:40:10
18:40:10 Done. PASS=25 WARN=0 ERROR=0 SKIP=0 TOTAL=25
mf validate-configs
大量のエラーが出た
エラーの一部:
^
• ERROR: with semantic model `countries` - Failed to query dimensions in data warehouse for semantic model `countries`
Received following error from data warehouse:
Runtime Error
Catalog Error: Table with name countries does not exist!
Did you mean "countries_seed"?
LINE 6: FROM "dev"."main"."countries" countries_src_6...
^
DWH からエラーが出たと言っている
いま mf tutorial
をやっているが、この前に https://github.com/dbt-labs/jaffle-sl-template の README.md
にある dbt seed
を実行したので、このときロードされたデータとモデルとの間に齟齬が生じている可能性があると考えた
もしこの仮説が正しければ、DWH 内のテーブルを、mf tutorial
によって生成されている jaffle-data/sample-seed/
由来のものだけにすればよいと考えられる
そこで jaffle-data/sample-seed/
以外の seed data 由来と思われるテーブルを削除してみる
現在の DWH の状況は下記の通り
❯ duckdb dev.duckdb
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D show tables;
┌─────────────────────────────────────────┐
│ name │
│ varchar │
├─────────────────────────────────────────┤
│ countries_seed │
│ customers │
│ customers_seed │
│ home_page_table │
│ olist_closed_deals │
│ olist_closed_deals_dataset │
│ olist_customers_dataset │
│ olist_geolocation_dataset │
│ olist_marketing_qualified_leads │
│ olist_marketing_qualified_leads_dataset │
│ olist_order_items_dataset │
│ olist_order_payments_dataset │
│ olist_order_reviews_dataset │
│ olist_orders │
│ olist_orders_dataset │
│ olist_products_dataset │
│ olist_seller │
│ olist_sellers_dataset │
│ order_items │
│ orders │
│ payment_confirmation_table │
│ payment_page_table │
│ product_category_name_translation │
│ raw_customers │
│ raw_items │
│ raw_orders │
│ raw_products │
│ raw_stores │
│ raw_supplies │
│ search_page_table │
│ stg_customers │
│ stg_home_page │
│ stg_locations │
│ stg_order_items │
│ stg_orders │
│ stg_payment_confirmation │
│ stg_payment_page │
│ stg_products │
│ stg_search_page │
│ stg_supplies │
│ stg_users │
│ transactions_seed │
│ user_table │
├─────────────────────────────────────────┤
│ 43 rows │
└─────────────────────────────────────────┘
D
❯ tree jaffle-data
jaffle-data
├── ecommerce
│ ├── home_page_table.csv
│ ├── olist_closed_deals_dataset.csv
│ ├── olist_customers_dataset.csv
│ ├── olist_geolocation_dataset.csv
│ ├── olist_marketing_qualified_leads_dataset.csv
│ ├── olist_order_items_dataset.csv
│ ├── olist_order_payments_dataset.csv
│ ├── olist_order_reviews_dataset.csv
│ ├── olist_orders_dataset.csv
│ ├── olist_products_dataset.csv
│ ├── olist_sellers_dataset.csv
│ ├── payment_confirmation_table.csv
│ ├── payment_page_table.csv
│ ├── product_category_name_translation.csv
│ ├── search_page_table.csv
│ └── user_table.csv
├── raw_customers.csv
├── raw_items.csv
├── raw_orders.csv
├── raw_products.csv
├── raw_stores.csv
├── raw_supplies.csv
└── sample_seed
├── countries_seed.csv
├── customers_seed.csv
└── transactions_seed.csv
2 directories, 25 files
なので、下記 3 つのテーブルだけが存在する状態にする
- countries_seed
- customers_seed
- transactions_seed
下記の SQL を実行する
DROP TABLE IF EXISTS customers;
DROP TABLE IF EXISTS home_page_table;
DROP TABLE IF EXISTS olist_closed_deals;
DROP TABLE IF EXISTS olist_closed_deals_dataset;
DROP TABLE IF EXISTS olist_customers_dataset;
DROP TABLE IF EXISTS olist_geolocation_dataset;
DROP TABLE IF EXISTS olist_marketing_qualified_leads;
DROP TABLE IF EXISTS olist_marketing_qualified_leads_dataset;
DROP TABLE IF EXISTS olist_order_items_dataset;
DROP TABLE IF EXISTS olist_order_payments_dataset;
DROP TABLE IF EXISTS olist_order_reviews_dataset;
DROP TABLE IF EXISTS olist_orders;
DROP TABLE IF EXISTS olist_orders_dataset;
DROP TABLE IF EXISTS olist_products_dataset;
DROP TABLE IF EXISTS olist_seller;
DROP TABLE IF EXISTS olist_sellers_dataset;
DROP TABLE IF EXISTS order_items;
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS payment_confirmation_table;
DROP TABLE IF EXISTS payment_page_table;
DROP TABLE IF EXISTS product_category_name_translation;
DROP TABLE IF EXISTS raw_customers;
DROP TABLE IF EXISTS raw_items;
DROP TABLE IF EXISTS raw_orders;
DROP TABLE IF EXISTS raw_products;
DROP TABLE IF EXISTS raw_stores;
DROP TABLE IF EXISTS raw_supplies;
DROP TABLE IF EXISTS search_page_table;
DROP TABLE IF EXISTS stg_customers;
DROP TABLE IF EXISTS stg_home_page;
DROP TABLE IF EXISTS stg_locations;
DROP TABLE IF EXISTS stg_order_items;
DROP TABLE IF EXISTS stg_orders;
DROP TABLE IF EXISTS stg_payment_confirmation;
DROP TABLE IF EXISTS stg_payment_page;
DROP TABLE IF EXISTS stg_products;
DROP TABLE IF EXISTS stg_search_page;
DROP TABLE IF EXISTS stg_supplies;
DROP TABLE IF EXISTS stg_users;
DROP TABLE IF EXISTS user_table;
まだ残っている。一部は view だったようだ
D show tables;
┌───────────────────┐
│ name │
│ varchar │
├───────────────────┤
│ countries_seed │
│ customers_seed │
│ stg_customers │
│ stg_locations │
│ stg_order_items │
│ stg_orders │
│ stg_products │
│ stg_supplies │
│ transactions_seed │
└───────────────────┘
下記の SQL を実行する
DROP VIEW IF EXISTS stg_customers;
DROP VIEW IF EXISTS stg_locations;
DROP VIEW IF EXISTS stg_order_items;
DROP VIEW IF EXISTS stg_orders;
DROP VIEW IF EXISTS stg_products;
DROP VIEW IF EXISTS stg_supplies;
意図した状態になった
D show tables;
┌───────────────────┐
│ name │
│ varchar │
├───────────────────┤
│ countries_seed │
│ customers_seed │
│ transactions_seed │
└───────────────────┘
D
再度 mf validate-configs
を実行してみる
❯ mf validate-configs
(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully parsed manifest from dbt project
✔ 🎉 Successfully validated the semantics of built manifest (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating semantic models against data warehouse (ERRORS: 5, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating dimensions against data warehouse (ERRORS: 49, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating entities against data warehouse (ERRORS: 13, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating measures against data warehouse (ERRORS: 12, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating metrics against data warehouse (ERRORS: 11, FUTURE_ERRORS: 0, WARNINGS: 0)
• ERROR: with semantic model `stg_products` - Unable to access semantic model `stg_products` in data warehouse
Received following error from data warehouse:
Runtime Error
Catalog Error: Table with name stg_products does not exist!
Did you mean "temp.pg_catalog.pg_proc"?
LINE 1: EXPLAIN SELECT * FROM "dev"."main"."stg_products"
^
• ERROR: with semantic model `locations` - Unable to access semantic model `locations` in data warehouse
Received following error from data warehouse:
Runtime Error
Catalog Error: Table with name stg_locations does not exist!
Did you mean "pg_catalog.pg_class"?
LINE 1: EXPLAIN SELECT * FROM "dev"."main"."stg_locations"
まだエラーが出ているが、stg_
の文字列が見えるのでもともと https://github.com/dbt-labs/jaffle-sl-template にあった models/staging/
配下のモデルと思われる
これらを削除して再実行する
❯ mf validate-configs
(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully parsed manifest from dbt project
✔ 🎉 Successfully validated the semantics of built manifest (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating semantic models against data warehouse (ERRORS: 5, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating dimensions against data warehouse (ERRORS: 49, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating entities against data warehouse (ERRORS: 13, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating measures against data warehouse (ERRORS: 12, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating metrics against data warehouse (ERRORS: 11, FUTURE_ERRORS: 0, WARNINGS: 0)
• ERROR: with semantic model `stg_products` - Unable to access semantic model `stg_products` in data warehouse
Received following error from data warehouse:
Runtime Error
Catalog Error: Table with name stg_products does not exist!
Did you mean "temp.pg_catalog.pg_proc"?
LINE 1: EXPLAIN SELECT * FROM "dev"."main"."stg_products"
まだエラーが出ている。
Unable to access semantic model
stg_products
in data warehouse
semantic model stg_products
を参照しようとしているので、これが起こらないようにしたい
❯ ls target
compiled graph_summary.json partial_parse.msgpack run_results.json
graph.gpickle manifest.json run semantic_manifest.json
target/semantic_manifest.json
が怪しい
mf tutorial
を実行すると、そのときの model 定義にもとづいて semantic_manifest.json
が生成されると想像する。
semantic_manifest.json
を削除してから再生成してみる
❯ mf tutorial
To begin building and querying metrics, you must define semantic models and
metric configuration files in your dbt project. dbt will use these files to generate a
semantic manifest artifact, which MetricFlow will use to create a semantic graph for querying.
As part of this tutorial, we will generate the following files to help you get started:
📜 model files -> /Users/Rindrics/dev/jaffle-sl-template/models/sample_model
🌱 seed files -> /Users/Rindrics/dev/jaffle-sl-template/jaffle-data/sample_seed
✅ semantic manifest json file -> /Users/Rindrics/dev/jaffle-sl-template/target/semantic_manifest.json
Continue and generate the files? [y/N]: y
There are existing files in the paths above, would you like to overwrite them? [y/N]: y
✔ 📜 Sample files has been generated.
🤓 Please run the following steps,
1. Verify that your adapter credentials are correct in `profiles.yml`
2. Add time spine model to the models directory (https://docs.getdbt.com/docs/build/metricflow-time-spine)
3. Run `dbt seed`, check to see that the steps related to countries, transactions, customers are passing.
4. Try validating your data model: `mf validate-configs`
5. Check out your metrics: `mf list metrics`
6. Check out dimensions for your metric `mf list dimensions --metrics transactions`
7. Query your first metric: `mf query --metrics transactions --group-bys metric_time --order metric_time`
8. Show the SQL MetricFlow generates:
`mf query --metrics transactions --group-bys metric_time --order metric_time --explain`
9. Visualize the plan:
`mf query --metrics transactions --group-bys metric_time --order metric_time --explain --display-plans`
* This only works if you have graphviz installed - see README.
10. Add another dimension:
`mf query --metrics transactions --group-bys metric_time,customer__customer_country --order metric_time`
11. Add a coarser time granularity:
`mf query --metrics transactions --group-bys metric_time__week --order metric_time__week`
12. Try a more complicated query: mf query --metrics transactions,transaction_usd_na --group-bys metric_time,is_large --order metric_time --start-time 2022-03-20 --end-time 2022-04-01.
13. When you're done with the tutorial, run mf tutorial --clean to delete sample models and seeds.
💡 Run `mf tutorial --msg` to see this message again without executing everything else
上書きによって time spine model が消えてしまったので再作成しつつ mf validate-configs
を再実行
❯ mf validate-configs
(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully parsed manifest from dbt project
✔ 🎉 Successfully validated the semantics of built manifest (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating semantic models against data warehouse (ERRORS: 3, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating dimensions against data warehouse (ERRORS: 29, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating entities against data warehouse (ERRORS: 9, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating measures against data warehouse (ERRORS: 10, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating metrics against data warehouse (ERRORS: 11, FUTURE_ERRORS: 0, WARNINGS: 0)
• ERROR: with semantic model `customers` - Unable to access semantic model `customers` in data warehouse
Received following error from data warehouse:
Runtime Error
Catalog Error: Table with name customers does not exist!
Did you mean "raw_customers"?
LINE 1: EXPLAIN SELECT * FROM "dev"."main"."customers"
まだエラーが出る
DWH にテーブル customers
がないというエラー。customers
は models/sample_model/
に存在するで、dbt run
すればよいと判断
❯ dbt run
19:58:43 Running with dbt=1.7.8
19:58:44 Registered adapter: duckdb=1.7.2
19:58:44 Unable to do partial parsing because a project config has changed
19:58:46 Found 4 models, 25 seeds, 3 sources, 0 exposures, 11 metrics, 615 macros, 0 groups, 3 semantic models
19:58:46
19:58:46 Concurrency: 1 threads (target='dev')
19:58:46
19:58:46 1 of 4 START sql view model main.countries ..................................... [RUN]
19:58:47 1 of 4 OK created sql view model main.countries ................................ [OK in 0.13s]
19:58:47 2 of 4 START sql view model main.customers ..................................... [RUN]
19:58:47 2 of 4 OK created sql view model main.customers ................................ [OK in 0.07s]
19:58:47 3 of 4 START sql view model main.metricflow_time_spine ......................... [RUN]
19:58:47 3 of 4 ERROR creating sql view model main.metricflow_time_spine ................ [ERROR in 0.04s]
19:58:47 4 of 4 START sql view model main.transactions .................................. [RUN]
19:58:47 4 of 4 OK created sql view model main.transactions ............................. [OK in 0.06s]
19:58:47
19:58:47 Finished running 4 view models in 0 hours 0 minutes and 0.48 seconds (0.48s).
19:58:47
19:58:47 Completed with 1 error and 0 warnings:
19:58:47
19:58:47 Runtime Error in model metricflow_time_spine (models/sample_model/metricflow_time_spine.sql)
Binder Error: No function matches the given name and argument types '+(DATE, TIMESTAMP)'. You might need to add explicit type casts.
Candidate functions:
+(TINYINT) -> TINYINT
+(TINYINT, TINYINT) -> TINYINT
+(SMALLINT) -> SMALLINT
+(SMALLINT, SMALLINT) -> SMALLINT
+(INTEGER) -> INTEGER
+(INTEGER, INTEGER) -> INTEGER
+(BIGINT) -> BIGINT
+(BIGINT, BIGINT) -> BIGINT
+(HUGEINT) -> HUGEINT
+(HUGEINT, HUGEINT) -> HUGEINT
+(FLOAT) -> FLOAT
+(FLOAT, FLOAT) -> FLOAT
+(DOUBLE) -> DOUBLE
+(DOUBLE, DOUBLE) -> DOUBLE
+(DECIMAL) -> DECIMAL
+(DECIMAL, DECIMAL) -> DECIMAL
+(UTINYINT) -> UTINYINT
+(UTINYINT, UTINYINT) -> UTINYINT
+(USMALLINT) -> USMALLINT
+(USMALLINT, USMALLINT) -> USMALLINT
+(UINTEGER) -> UINTEGER
+(UINTEGER, UINTEGER) -> UINTEGER
+(UBIGINT) -> UBIGINT
+(UBIGINT, UBIGINT) -> UBIGINT
+(UHUGEINT) -> UHUGEINT
+(UHUGEINT, UHUGEINT) -> UHUGEINT
+(DATE, INTEGER) -> DATE
+(INTEGER, DATE) -> DATE
+(INTERVAL, INTERVAL) -> INTERVAL
+(DATE, INTERVAL) -> TIMESTAMP
+(INTERVAL, DATE) -> TIMESTAMP
+(TIME, INTERVAL) -> TIME
+(INTERVAL, TIME) -> TIME
+(TIMESTAMP, INTERVAL) -> TIMESTAMP
+(INTERVAL, TIMESTAMP) -> TIMESTAMP
+(TIME WITH TIME ZONE, INTERVAL) -> TIME WITH TIME ZONE
+(INTERVAL, TIME WITH TIME ZONE) -> TIME WITH TIME ZONE
+(TIME, DATE) -> TIMESTAMP
+(DATE, TIME) -> TIMESTAMP
+(TIME WITH TIME ZONE, DATE) -> TIMESTAMP WITH TIME ZONE
+(DATE, TIME WITH TIME ZONE) -> TIMESTAMP WITH TIME ZONE
+(ANY[], ANY[]) -> ANY[]
+(TIMESTAMP WITH TIME ZONE, INTERVAL) -> TIMESTAMP WITH TIME ZONE
+(INTERVAL, TIMESTAMP WITH TIME ZONE) -> TIMESTAMP WITH TIME ZONE
LINE 9: ) as date) + ((interval '1 day') * (-3650))
^
19:58:47
19:58:47 Done. PASS=3 WARN=0 ERROR=1 SKIP=0 TOTAL=4
time spine model で型不整合エラーが出ている
Error: No function matches the given name and argument types '+(DATE, TIMESTAMP)'. You might need to add explicit type casts.
一応 mf validate-configs
は通るようになった
❯ mf validate-configs
(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully parsed manifest from dbt project
✔ 🎉 Successfully validated the semantics of built manifest (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated semantic models against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated dimensions against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated entities against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated measures against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated metrics against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
time spine model における型不整合エラーを修正する
Error: No function matches the given name and argument types '+(DATE, TIMESTAMP)'. You might need to add explicit type casts.
この現象は https://github.com/duckdb/dbt-duckdb/issues/346 と同じに見える
https://github.com/duckdb/dbt-duckdb/pull/347 で直っているように見えるが、これはまだリリースされていない
未リリースの当該 PR のバージョンを明示的に指定して再インストールした
❯ pip install --pre --no-cache-dir git+https://github.com/duckdb/dbt-duckdb.git@b960956
dbt run
が通るようになった
❯ dbt run
22:25:03 Running with dbt=1.7.8
22:25:03 Registered adapter: duckdb=1.7.2
22:25:04 Found 4 models, 25 seeds, 3 sources, 0 exposures, 11 metrics, 615 macros, 0 groups, 3 semantic models
22:25:04
22:25:04 Concurrency: 1 threads (target='dev')
22:25:04
22:25:04 1 of 4 START sql view model main.countries ..................................... [RUN]
22:25:04 1 of 4 OK created sql view model main.countries ................................ [OK in 0.13s]
22:25:04 2 of 4 START sql view model main.customers ..................................... [RUN]
22:25:04 2 of 4 OK created sql view model main.customers ................................ [OK in 0.06s]
22:25:04 3 of 4 START sql view model main.metricflow_time_spine ......................... [RUN]
22:25:04 3 of 4 OK created sql view model main.metricflow_time_spine .................... [OK in 0.15s]
22:25:04 4 of 4 START sql view model main.transactions .................................. [RUN]
22:25:04 4 of 4 OK created sql view model main.transactions ............................. [OK in 0.07s]
22:25:04
22:25:04 Finished running 4 view models in 0 hours 0 minutes and 0.64 seconds (0.64s).
22:25:04
22:25:04 Completed successfully
22:25:04
22:25:04 Done. PASS=4 WARN=0 ERROR=0 SKIP=0 TOTAL=4
mf list metrics
❯ mf list metrics
✔ 🌱 We've found 11 metrics.
The list below shows metrics in the format of "metric_name: list of available dimensions"
• new_customers: country__region, customer__customer_country, customer__ds, metric_time
• quick_buy_transactions: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• quick_buy_amount_usd: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• cancellations: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• transactions: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• alterations: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• transaction_amount: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• cancellation_rate: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• revenue_usd: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• cancellations_mx: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• transaction_usd_na: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
DB の中身と照合してみると、メトリクスの実体は DB に存在しないことがわかる
ira.hayashi.1987@gmail.com took 8s
❯ duckdb dev.duckdb
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D show tables;
┌─────────────────────────────────────────┐
│ name │
│ varchar │
├─────────────────────────────────────────┤
│ countries │
│ countries_seed │
│ customers │
│ customers_seed │
│ home_page_table │
│ metricflow_time_spine │
│ olist_closed_deals_dataset │
│ olist_customers_dataset │
│ olist_geolocation_dataset │
│ olist_marketing_qualified_leads_dataset │
│ olist_order_items_dataset │
│ olist_order_payments_dataset │
│ olist_order_reviews_dataset │
│ olist_orders_dataset │
│ olist_products_dataset │
│ olist_sellers_dataset │
│ payment_confirmation_table │
│ payment_page_table │
│ product_category_name_translation │
│ raw_customers │
│ raw_items │
│ raw_orders │
│ raw_products │
│ raw_stores │
│ raw_supplies │
│ search_page_table │
│ transactions │
│ transactions_seed │
│ user_table │
├─────────────────────────────────────────┤
│ 29 rows │
└─────────────────────────────────────────┘
D
メトリクスを参照するたび SQL が生成されて実行されるのだろうと想像した
mf list dimensions --metrics METRIC_NAME
❯ mf list dimensions --metrics new_customers
✔ 🌱 We've found 4 common dimensions for metrics ['new_customers'].
• country__region
• customer__customer_country
• customer__ds
• metric_time
この結果は list metrics
したときにも見えていた
❯ mf list metrics
✔ 🌱 We've found 11 metrics.
The list below shows metrics in the format of "metric_name: list of available dimensions"
dimension が多くて truncate されていたメトリクスに対して使うと意味があるんだろう
• transaction_usd_na: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
❯ mf list dimensions --metrics transaction_usd_na
✔ 🌱 We've found 7 common dimensions for metrics ['transaction_usd_na'].
• customer__country__region
• customer__customer_country
• customer__ds
• metric_time
• transaction__ds
• transaction__is_large
• transaction__quick_buy_transaction
mf query --metrics transactions --group-by metric_time --order metric_time
(余談)
はじめ、チュートリアルには mf query --metrics transactions --group-bys metric_time --order metric_time
を実行せよと書いてあったのだが、これはそのまま実行できなかった。オプション --group-bys
は --group-by
に変更されたのだがそれがチュートリアルメッセージに反映されていないバージョンを使ってしまっていたみたい。
❯ mf --version
mf, version 0.204.0
メッセージが修正されたバージョンはリリース前だったので、下記を実行することで更新した
❯ pip install --pre --no-cache-dir git+https://github.com/dbt-labs/metricflow.git@v0.204.1.dev5
❯ mf --version
mf, version 0.204.1.dev5
メッセージ中の --group-bys
が --group-by
に直っている
❯ mf tutorial --msg
🤓 Please run the following steps,
1. Verify that your adapter credentials are correct in `profiles.yml`
2. Add time spine model to the models directory (https://docs.getdbt.com/docs/build/metricflow-time-spine)
3. Run `dbt seed`, check to see that the steps related to countries, transactions, customers are passing.
4. Try validating your data model: `mf validate-configs`
5. Check out your metrics: `mf list metrics`
6. Check out dimensions for your metric `mf list dimensions --metrics transactions`
7. Query your first metric: `mf query --metrics transactions --group-by metric_time --order metric_time`
8. Show the SQL MetricFlow generates:
`mf query --metrics transactions --group-by metric_time --order metric_time --explain`
9. Visualize the plan:
`mf query --metrics transactions --group-by metric_time --order metric_time --explain --display-plans`
* This only works if you have graphviz installed - see README.
10. Add another dimension:
`mf query --metrics transactions --group-by metric_time,customer__customer_country --order metric_time`
11. Add a coarser time granularity:
`mf query --metrics transactions --group-by metric_time__week --order metric_time__week`
12. Try a more complicated query: mf query --metrics transactions,transaction_usd_na --group-by metric_time,is_large --order metric_time --start-time 2022-03-20 --end-time 2022-04-01.
13. When you're done with the tutorial, run mf tutorial --clean to delete sample models and seeds.
mf query --metrics transactions --group-by metric_time --order metric_time --explain
を実行してみる
❯ mf query --metrics transactions --group-by metric_time --order metric_time --explain
✔ Success 🦄 - query completed after 0.28 seconds
🔎 SQL (remove --explain to see data or add --show-dataflow-plan to see the generated dataflow plan):
SELECT
metric_time__day
, SUM(transactions) AS transactions
FROM (
SELECT
DATE_TRUNC('day', ds) AS metric_time__day
, 1 AS transactions
FROM "dev"."main"."transactions" transactions_src_10000
) subq_2
GROUP BY
metric_time__day
ORDER BY metric_time__day
、、、おーこれはすごい。transactions を参照するたびにこれが実行されるのか
❯ mf query --metrics transactions --group-by metric_time --order metric_time
✔ Success 🦄 - query completed after 0.31 seconds
| metric_time__day | transactions |
|:-------------------|---------------:|
| 2022-03-07 | 2 |
| 2022-03-08 | 2 |
| 2022-03-09 | 1 |
| 2022-03-10 | 2 |
| 2022-03-11 | 1 |
| 2022-03-12 | 1 |
| 2022-03-13 | 1 |
| 2022-03-14 | 1 |
| 2022-03-15 | 3 |
| 2022-03-16 | 2 |
| 2022-03-17 | 1 |
| 2022-03-18 | 1 |
| 2022-03-21 | 3 |
| 2022-03-22 | 3 |
| 2022-03-23 | 2 |
| 2022-03-25 | 3 |
| 2022-03-26 | 2 |
| 2022-03-27 | 2 |
| 2022-03-28 | 1 |
| 2022-03-29 | 2 |
| 2022-03-30 | 4 |
| 2022-03-31 | 5 |
| 2022-04-01 | 1 |
| 2022-04-02 | 2 |
| 2022-04-03 | 1 |
| 2022-04-04 | 1 |
、、、なるほど
❯ mf query --metrics transactions --group-by metric_time --order metric_time --explain --show-dataflow-plan
✔ Success 🦄 - query completed after 0.29 seconds
🔎 Generated Dataflow Plan + SQL (remove --explain to see data):
-- Metric Dataflow Plan:
-- <DataflowPlan>
-- <WriteToResultDataframeNode>
-- <!-- description = 'Write to Dataframe' -->
-- <!-- node_id = NodeId(id_str='wrd_1') -->
-- <OrderByLimitNode>
-- <!-- description = "Order By ['metric_time__day']" -->
-- <!-- node_id = NodeId(id_str='obl_1') -->
-- <!-- order_by_spec = -->
-- <!-- OrderBySpec( -->
-- <!-- instance_spec=TimeDimensionSpec(element_name='metric_time', time_granularity=DAY), -->
-- <!-- descending=False, -->
-- <!-- ) -->
-- <!-- limit = 'None' -->
-- <ComputeMetricsNode>
-- <!-- description = 'Compute Metrics via Expressions' -->
-- <!-- node_id = NodeId(id_str='cm_1') -->
-- <!-- metric_spec = MetricSpec(element_name='transactions') -->
-- <AggregateMeasuresNode>
-- <!-- description = 'Aggregate Measures' -->
-- <!-- node_id = NodeId(id_str='am_1') -->
-- <FilterElementsNode>
-- <!-- description = "Pass Only Elements: ['transactions', 'metric_time__day']" -->
-- <!-- node_id = NodeId(id_str='pfe_1') -->
-- <!-- include_spec = MeasureSpec(element_name='transactions') -->
-- <!-- include_spec = TimeDimensionSpec(element_name='metric_time', time_granularity=DAY) -->
-- <!-- distinct = False -->
-- <MetricTimeDimensionTransformNode>
-- <!-- description = "Metric Time Dimension 'ds'" -->
-- <!-- node_id = NodeId(id_str='sma_0') -->
-- <!-- aggregation_time_dimension = 'ds' -->
-- <ReadSqlSourceNode>
-- <!-- description = "Read From SemanticModelDataSet('transactions')" -->
-- <!-- node_id = NodeId(id_str='rss_0') -->
-- <!-- data_set = SemanticModelDataSet('transactions') -->
-- </ReadSqlSourceNode>
-- </MetricTimeDimensionTransformNode>
-- </FilterElementsNode>
-- </AggregateMeasuresNode>
-- </ComputeMetricsNode>
-- </OrderByLimitNode>
-- </WriteToResultDataframeNode>
-- </DataflowPlan>
SELECT
metric_time__day
, SUM(transactions) AS transactions
FROM (
SELECT
DATE_TRUNC('day', ds) AS metric_time__day
, 1 AS transactions
FROM "dev"."main"."transactions" transactions_src_10000
) subq_2
GROUP BY
metric_time__day
ORDER BY metric_time__day
dataflow plan って聞いたことなかった
mf tutorial
のメッセージには、その後もクエリを試してみろという内容のメッセージが続いている
10. Add another dimension:
`mf query --metrics transactions --group-by metric_time,customer__customer_country --order metric_time`
11. Add a coarser time granularity:
`mf query --metrics transactions --group-by metric_time__week --order metric_time__week`
12. Try a more complicated query: mf query --metrics transactions,transaction_usd_na --group-by metric_time,is_large --order metric_time --start-time 2022-03-20 --end-time 2022-04-01.
これらを眺めてみてわかるのが、SQL を一切意識することなく API 経由で DWH にクエリできていること
確かにこれなら、BI のフロントからのクエリはかなり楽になりそう。
ここまで https://github.com/dbt-labs/jaffle-sl-template ではなく結局 metric flow のチュートリアルをやっていたので、後片付けをしておく
13. When you're done with the tutorial, run mf tutorial --clean to delete sample models and seeds.
Would you like to remove all the sample files? [y/N]: y
✔ 🗑️ Sample files has been removed.
しかし https://github.com/dbt-labs/jaffle-sl-template のドキュメントに従って
dbt deps
dbt seed
dbt build --exclude path:jaffle-data
したあとにも
❯ mf validate-configs
‼️ Warning: A new version of the MetricFlow CLI is available.
💡 Please update to version 0.205.0, released 2024-03-01 20:41:26 by running:
$ pip install --upgrade metricflow
(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully parsed manifest from dbt project
✖ Breaking issues found when checking semantics of built manifest (ERRORS: 8, FUTURE_ERRORS: 0, WARNINGS: 0)
• ERROR: with metric `mql_to_seller_conversion_rate_base` - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_base` - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: sellers is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_7_day_window` - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_7_day_window` - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: sellers is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_7_day_window_with_filter` - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_7_day_window_with_filter` - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: sellers is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_1_week_window` - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_1_week_window` - For
となった。
まず下記に着目して状況を調べていく
• ERROR: with metric
mql_to_seller_conversion_rate_base
- For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM
manifest を見てみる
{
"name": "mql_to_seller_conversion_rate_base",
"description": "Conversion rate from MQL to seller",
"type": "conversion",
"type_params": {
"measure": null,
"numerator": null,
"denominator": null,
"expr": null,
"window": null,
"grain_to_date": null,
"metrics": [],
"conversion_type_params": {
"base_measure": {
"name": "mqls",
"filter": null,
"alias": null,
"join_to_timespine": false,
"fill_nulls_with": null
},
"conversion_measure": {
"name": "sellers",
"filter": null,
"alias": null,
"join_to_timespine": false,
"fill_nulls_with": null
},
"entity": "mql",
"calculation": "conversion_rate",
"window": null,
"constant_properties": null
},
"input_measures": [
{
"name": "mqls",
"filter": null,
"alias": null,
"join_to_timespine": false,
"fill_nulls_with": null
},
{
"name": "sellers",
"filter": null,
"alias": null,
"join_to_timespine": false,
"fill_nulls_with": null
}
]
},
"filter": null,
"metadata": null,
"label": "MQL to Seller Conversion Rate"
},
この
"conversion_type_params": {
"base_measure": {
"name": "mqls",
"filter": null,
"alias": null,
"join_to_timespine": false,
"fill_nulls_with": null
},
に問題があるんだろうと想像する。
For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM
と言われているかじ、上記自体には agg type のようなフィールドはない。
同じメッセージが複数出ていることから、問題は参照先にある可能性を考えた。
mql
のツリーを探して見てみた
{
"name": "mqls",
"description": "",
"type": "simple",
"type_params": {
"measure": {
"name": "mqls",
"filter": null,
"alias": null,
"join_to_timespine": true,
"fill_nulls_with": 0
},
"numerator": null,
"denominator": null,
"expr": null,
"window": null,
"grain_to_date": null,
"metrics": [],
"conversion_type_params": null,
"input_measures": [
{
"name": "mqls",
"filter": null,
"alias": null,
"join_to_timespine": true,
"fill_nulls_with": 0
}
]
},
"filter": null,
"metadata": null,
"label": "MQls"
},
ここにも agg type に関する情報はない。
次にモデルのメタデータを見てみる
semantic_models:
- name: olist_mqls
description: "Table showing all seller contacts for an ecommerce marketplace"
model: ref('olist_marketing_qualified_leads')
defaults:
agg_time_dimension: ds
entities:
- name: mql
type: primary
expr: mql_id
- name: landing_page
type: foreign
expr: landing_page_id
dimensions:
- name: ds
expr: first_contact_date
type: time
type_params:
time_granularity: day
- name: origin
type: categorical
measures:
- name: mqls
agg: count
expr: mql_id
measure mqls
が agg:count
となっているがここだろうか?
measures がそもそもわからないので https://docs.getdbt.com/docs/build/measures を見てみる
dbt supports the following aggregations: sum, max, min, avg, median, count_distinct, and sum_boolean.
なのでそもそも現在の agg: count
自体、サポートされていない気がするがよいのだろうか。
そもそも今見ている場所が、現在直面している問題に関係しているのかどうかもわからない。
これを確かめるために agg: count
を agg: count_distinct
に変えて build
& validate-configs
してみた
❯ mf validate-configs
(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully parsed manifest from dbt project
✖ Breaking issues found when checking semantics of built manifest (ERRORS: 4, FUTURE_ERRORS: 0, WARNINGS: 0)
• ERROR: with metric `mql_to_seller_conversion_rate_1_week_window` - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: sellers is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_7_day_window` - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: sellers is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_7_day_window_with_filter` - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: sellers is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_base` - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: sellers is agg type: AggregationType.SUM
エラーメッセージは減った。
今見ている箇所が検討違いでないとしたら、agg: count
に関して、validate-configs
のメッセージと metrics のドキュメントの記述が矛盾している。どちらかが誤っているのだろうか
そもそも何が起こっているのか理解できていないので詳しく見てみる
validate-configs
でエラーが出ていたメトリクスの一つ mql_to_seller_conversion_rate_1_week_window
についてクエリを調べてみる
❯ mf query --metrics mql_to_seller_conversion_rate_1_week_window --group-by metric_time --order metric_time --explain
⠧ Initiating query…
✔ Success 🦄 - query completed after 0.55 seconds
🔎 SQL (remove --explain to see data or add --show-dataflow-plan to see the generated dataflow plan):
SELECT
metric_time__day
, CAST(sellers AS DOUBLE) / CAST(NULLIF(mqls, 0) AS DOUBLE) AS mql_to_seller_conversion_rate_1_week_window
FROM (
SELECT
COALESCE(subq_3.metric_time__day, subq_13.metric_time__day) AS metric_time__day
, MAX(subq_3.mqls) AS mqls
, COALESCE(MAX(subq_13.sellers), 0) AS sellers
FROM (
SELECT
metric_time__day
, SUM(mqls) AS mqls
FROM (
SELECT
DATE_TRUNC('day', first_contact_date) AS metric_time__day
, CASE WHEN mql_id IS NOT NULL THEN 1 ELSE 0 END AS mqls
FROM "dev"."main"."olist_marketing_qualified_leads" olist_mqls_src_10000
) subq_2
GROUP BY
metric_time__day
) subq_3
FULL OUTER JOIN (
SELECT
metric_time__day
, SUM(sellers) AS sellers
FROM (
SELECT DISTINCT
first_value(subq_6.mqls) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY subq_6.ds__day DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS mqls
, first_value(subq_6.ds__day) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY subq_6.ds__day DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ds__day
, first_value(subq_6.metric_time__day) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY subq_6.ds__day DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS metric_time__day
, first_value(subq_6.mql) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY subq_6.ds__day DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS mql
, subq_9.mf_internal_uuid AS mf_internal_uuid
, subq_9.sellers AS sellers
FROM (
SELECT
DATE_TRUNC('day', first_contact_date) AS ds__day
, DATE_TRUNC('day', first_contact_date) AS metric_time__day
, mql_id AS mql
, CASE WHEN mql_id IS NOT NULL THEN 1 ELSE 0 END AS mqls
FROM "dev"."main"."olist_marketing_qualified_leads" olist_mqls_src_10000
) subq_6
INNER JOIN (
SELECT
DATE_TRUNC('day', won_date) AS won_date__day
, mql_id AS mql
, CASE WHEN seller_id IS NOT NULL THEN 1 ELSE 0 END AS sellers
, GEN_RANDOM_UUID() AS mf_internal_uuid
FROM "dev"."main"."olist_closed_deals" olist_closed_deals_src_10000
) subq_9
ON
(
subq_6.mql = subq_9.mql
) AND (
(
subq_6.ds__day <= subq_9.won_date__day
) AND (
subq_6.ds__day > subq_9.won_date__day - INTERVAL 1 week
)
)
) subq_10
GROUP BY
metric_time__day
) subq_13
ON
subq_3.metric_time__day = subq_13.metric_time__day
GROUP BY
COALESCE(subq_3.metric_time__day, subq_13.metric_time__day)
) subq_14
ORDER BY metric_time__day
これと、エラーメッセージ • ERROR: with metric
mql_to_seller_conversion_rate_1_week_window - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM
を見比べると、確かに mql
を SUM
している箇所があった:
SELECT
metric_time__day
, SUM(mqls) AS mqls
measure としての mqls
の定義は下記の通り:
measures:
- name: mqls
agg: count
expr: mql_id
mql_id
を count
するので、集計結果としては SUM
になるのだろう
試しに、agg: count
を agg: count_distinct
(https://docs.getdbt.com/docs/build/measures に従う形)とし、モデルを再ビルド & メトリクスのクエリを表示してみた:
✔ Success 🦄 - query completed after 0.51 seconds
🔎 SQL (remove --explain to see data or add --show-dataflow-plan to see the generated dataflow plan):
SELECT
metric_time__day
, CAST(sellers AS DOUBLE) / CAST(NULLIF(mqls, 0) AS DOUBLE) AS mql_to_seller_conversion_rate_1_week_window
FROM (
SELECT
COALESCE(subq_3.metric_time__day, subq_13.metric_time__day) AS metric_time__day
, MAX(subq_3.mqls) AS mqls
, COALESCE(MAX(subq_13.sellers), 0) AS sellers
FROM (
SELECT
DATE_TRUNC('day', first_contact_date) AS metric_time__day
, COUNT(DISTINCT mql_id) AS mqls
FROM "dev"."main"."olist_marketing_qualified_leads" olist_mqls_src_10000
GROUP BY
DATE_TRUNC('day', first_contact_date)
) subq_3
FULL OUTER JOIN (
SELECT
metric_time__day
, SUM(sellers) AS sellers
FROM (
SELECT DISTINCT
first_value(olist_mqls_src_10000.mql_id) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date) DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS mqls
, first_value(DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date)) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date) DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ds__day
, first_value(DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date)) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date) DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS metric_time__day
, first_value(olist_mqls_src_10000.mql_id) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date) DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS mql
, subq_9.mf_internal_uuid AS mf_internal_uuid
, subq_9.sellers AS sellers
FROM "dev"."main"."olist_marketing_qualified_leads" olist_mqls_src_10000
INNER JOIN (
SELECT
DATE_TRUNC('day', won_date) AS won_date__day
, mql_id AS mql
, CASE WHEN seller_id IS NOT NULL THEN 1 ELSE 0 END AS sellers
, GEN_RANDOM_UUID() AS mf_internal_uuid
FROM "dev"."main"."olist_closed_deals" olist_closed_deals_src_10000
) subq_9
ON
(
olist_mqls_src_10000.mql_id = subq_9.mql
) AND (
(
DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date) <= subq_9.won_date__day
) AND (
DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date) > subq_9.won_date__day - INTERVAL 1 week
)
)
) subq_10
GROUP BY
metric_time__day
) subq_13
ON
subq_3.metric_time__day = subq_13.metric_time__day
GROUP BY
COALESCE(subq_3.metric_time__day, subq_13.metric_time__day)
) subq_14
ORDER BY metric_time__day
, SUM(mqls) AS mqls
だった箇所が , COUNT(DISTINCT mql_id) AS mqls
になった。
ここまで調べてみた結果、当初出ていたエラーメッセージ ERROR: with metric
mql_to_seller_conversion_rate_1_week_window - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM
は、measure 定義の agg: count
に対してではなく、生成された SQL に対して出ていたことがわかった。
ということで、https://docs.getdbt.com/docs/build/measures のドキュメントも、mf validate-configs
のメッセージも正しいことがわかった。
一方で https://github.com/dbt-labs/jaffle-sl-template の measure 定義は誤っていることがわかった。
残りの measure に出ているエラーはどれも sellers is agg type: AggregationType.SUM
なので、同じ方法で解決していく
measures:
- name: sellers
agg: count_distinct
expr: seller_id
create_metric: true
mf validate-configs
してみると、measure のエラーは全て解決されたことがわかった。
新たに下記のエラーが出てしまったのでこちらは次コメントで対処する
`✖ Breaking issues found when validating entities against data warehouse (ERRORS: 2, FUTURE_ERRORS: 0, WARNINGS: 0)`
下記のエラーが出ている
> ✖ Breaking issues found when validating entities against data warehouse (ERRORS: 2, FUTURE_ERRORS: 0, WARNINGS: 0)
調べてみたところ、このエラーは semantic model の metric として定義しているカラム landing_page_id
が seed data jaffle_data/ecommerce/olist_closed_deals_dataset.csv
の中にないことが原因だった。
ふつうに landing_page_id カラムを足して解決した
以上から、mf validate-configs
が通るようになった
❯ mf validate-configs
(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully parsed manifest from dbt project
✔ 🎉 Successfully validated the semantics of built manifest (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated semantic models against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated dimensions against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated entities against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated measures against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated metrics against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)