dbt で semantic layer を定義する方法がわからないので https://docs.getdbt.com/docs/build/sl-getting-started を見ながらやってみる

前提

❯ dbt --version
Core:
  - installed: 1.7.8
  - latest:    1.7.8 - Up to date!

Plugins:
  - duckdb: 1.7.2 - Up to date!

Rindrics

Metricflow を使えるようにする

❯ pip install metricflow
❯ pip install dbt-metricflow
❯ install "dbt-metricflow[duckdb]"

❯ mf --version
mf, version 0.204.0

Rindrics

dbt project のセットアップ

環境に export DBT_PROFILES_DIR=.dbt/profiles.yml しておく

dbt_project.yml

# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: "jaffle_shop"
version: "1.0.0"
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: "duckdb-jaffle"
require-dbt-version: ">=1.6.0rc2"

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["jaffle-data"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
  - "target"
  - "dbt_packages"

vars:
  truncate_timespan_to: "{{ current_timestamp() }}"
  "dbt_date:time_zone": "America/Los_Angeles"

# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# In this example config, we tell dbt to build all models in the example/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.

models:
  jaffle_shop:
    staging:
      +materialized: view
    marts:
      +materialized: table

.dbt/profiles.yml

duckdb-jaffle:
  outputs:
    dev:
      type: duckdb
      path: dev.duckdb
      threads: 1

    prod:
      type: duckdb
      path: prod.duckdb
      threads: 4

  target: dev

疎通確認

❯ dbt debug
16:52:59  Running with dbt=1.7.8
16:52:59  dbt version: 1.7.8
16:52:59  python version: 3.10.13
16:52:59  python path: /Users/ahayashi/dev/jaffle-sl-template/.env/bin/python
16:52:59  os info: macOS-13.4.1-x86_64-i386-64bit
16:53:00  Using profiles dir at .dbt
16:53:00  Using profiles.yml file at .dbt/profiles.yml
16:53:00  Using dbt_project.yml file at /Users/ahayashi/dev/jaffle-sl-template/dbt_project.yml
16:53:00  adapter type: duckdb
16:53:00  adapter version: 1.7.2
16:53:00  Configuration:
16:53:00    profiles.yml file [OK found and valid]
16:53:00    dbt_project.yml file [OK found and valid]
16:53:00  Required dependencies:
16:53:00   - git [OK found]

16:53:00  Connection:
16:53:00    database: dev
16:53:00    schema: main
16:53:00    path: dev.duckdb
16:53:00    config_options: None
16:53:00    extensions: None
16:53:00    settings: None
16:53:00    external_root: .
16:53:00    use_credential_provider: None
16:53:00    attach: None
16:53:00    filesystems: None
16:53:00    remote: None
16:53:00    plugins: None
16:53:00    disable_transactions: False
16:53:00  Registered adapter: duckdb=1.7.2
16:53:00    Connection test: [OK connection ok]

16:53:00  All checks passed!
tn	%

Rindrics

`mf` コマンドを一通り試していく

Rindrics

`mf tutorial`

❯ mf tutorial
To begin building and querying metrics, you must define semantic models and
metric configuration files in your dbt project. dbt will use these files to generate a
semantic manifest artifact, which MetricFlow will use to create a semantic graph for querying.
As part of this tutorial, we will generate the following files to help you get started:

📜 model files -> /Users/Rindrics/dev/jaffle-sl-template/models/sample_model
🌱 seed files -> /Users/Rindrics/dev/jaffle-sl-template/jaffle-data/sample_seed
✅ semantic manifest json file -> /Users/Rindrics/dev/jaffle-sl-template/target/semantic_manifest.json

Continue and generate the files? [y/N]: y
✔ 📜 Sample files has been generated.
🤓 Please run the following steps,
    1.  Verify that your adapter credentials are correct in `profiles.yml`
    2.  Add time spine model to the models directory (https://docs.getdbt.com/docs/build/metricflow-time-spine)
    3.  Run `dbt seed`, check to see that the steps related to countries, transactions, customers are passing.
    4.  Try validating your data model: `mf validate-configs`
    5.  Check out your metrics: `mf list metrics`
    6.  Check out dimensions for your metric `mf list dimensions --metrics transactions`
    7.  Query your first metric: `mf query --metrics transactions --group-bys metric_time --order metric_time`
    8.  Show the SQL MetricFlow generates:
        `mf query --metrics transactions --group-bys metric_time --order metric_time --explain`
    9.  Visualize the plan:
        `mf query --metrics transactions --group-bys metric_time --order metric_time --explain --display-plans`
        * This only works if you have graphviz installed - see README.
    10.  Add another dimension:
        `mf query --metrics transactions --group-bys metric_time,customer__customer_country --order metric_time`
    11.  Add a coarser time granularity:
        `mf query --metrics transactions --group-bys metric_time__week --order metric_time__week`
    12. Try a more complicated query: mf query --metrics transactions,transaction_usd_na --group-bys metric_time,is_large --order metric_time --start-time 2022-03-20 --end-time 2022-04-01.
    13. When you're done with the tutorial, run mf tutorial --clean to delete sample models and seeds.

💡 Run `mf tutorial --msg` to see this message again without executing everything else

Rindrics

生成された target/semantic_manifest.json を見てみたが、これは手で書くものではなさそうと思った

Rindrics

2. Add time spine model to the models directory に関しては、下記の内容のコードが models/marts/metricflow_time_spine.sql にあるのを確認した

-- metricflow_time_spine.sql
with 

days as (
    
    --for BQ adapters use "DATE('01/01/2000','mm/dd/yyyy')"
    {{ dbt_date.get_base_dates(n_dateparts=365*10, datepart="day") }}

),

cast_to_date as (

    select 
        cast(date_day as date) as date_day
    
    from days

)

select * from cast_to_date

Rindrics

しかし

Run dbt seed, check to see that the steps related to countries, transactions, customers are passing.

については、そのままではモデル customers が下記のように名前衝突して失敗した

❯ dbt seed
18:31:02  Running with dbt=1.7.8
18:31:03  Registered adapter: duckdb=1.7.2
18:31:03  Unable to do partial parsing because profile has changed
18:31:03  Unable to do partial parsing because a project config has changed
18:31:04  Encountered an error:
Compilation Error
  dbt found two models with the name "customers".

  Since these resources have the same name, dbt will be unable to find the correct resource
  when looking for ref("customers").

  To fix this, change the name of one of these resources:
  - model.jaffle_shop.customers (models/sample_model/customers.sql)
  - model.jaffle_shop.customers (models/marts/customer360/customers.sql)

どちらを消すべきか考えたが、下記の理由から marts/ を消すことにした

git diff してみると、model/sample_model が生成されていることがわかった
- mf tutorial の実行時に生成されたものだった
これは dbt コマンドではなく mf コマンドの生成物であるため、現在の dbt project （https://github.com/dbt-labs/jaffle-sl-template）の設定を前提としていないと思われた

models/marts/ を削除したうえで dbt seed を実行すると、time spine model がない旨のエラーが出た:

❯ dbt seed
18:52:24  Running with dbt=1.7.8
18:52:24  Registered adapter: duckdb=1.7.2
18:52:24  Encountered an error:
Parsing Error
  The semantic layer requires a 'metricflow_time_spine' model in the project, but none was found. Guidance on creating this model can be found on our docs site (https://docs.getdbt.com/docs/build/metricflow-time-spine)

そこで models/marts/metricflow_time_spine.sql と同じ内容を models/sample_model/metricflow_time_spine.sql として保存した

再度 dbt seed を実行すると、今度は成功した:

❯ dbt seed
18:38:58  Running with dbt=1.7.8
18:38:58  Registered adapter: duckdb=1.7.2
18:38:58  Unable to do partial parsing because profile has changed
18:38:58  Unable to do partial parsing because a project config has changed
18:39:01  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.jaffle_shop.marts
18:39:01  Found 10 models, 25 seeds, 12 tests, 18 sources, 0 exposures, 11 metrics, 615 macros, 0 groups, 5 semantic models
18:39:01
18:39:01  Concurrency: 1 threads (target='dev')
18:39:01
18:39:01  1 of 25 START seed file main.countries_seed .................................... [RUN]
18:39:01  1 of 25 OK loaded seed file main.countries_seed ................................ [INSERT 6 in 0.13s]
18:39:01  2 of 25 START seed file main.customers_seed .................................... [RUN]
18:39:01  2 of 25 OK loaded seed file main.customers_seed ................................ [INSERT 4 in 0.06s]
18:39:01  3 of 25 START seed file main.home_page_table ................................... [RUN]
18:39:02  3 of 25 OK loaded seed file main.home_page_table ............................... [INSERT 90400 in 1.20s]
18:39:02  4 of 25 START seed file main.olist_closed_deals_dataset ........................ [RUN]
18:39:02  4 of 25 OK loaded seed file main.olist_closed_deals_dataset .................... [INSERT 842 in 0.13s]
18:39:02  5 of 25 START seed file main.olist_customers_dataset ........................... [RUN]
18:39:04  5 of 25 OK loaded seed file main.olist_customers_dataset ....................... [INSERT 99441 in 1.76s]
18:39:04  6 of 25 START seed file main.olist_geolocation_dataset ......................... [RUN]
18:39:33  6 of 25 OK loaded seed file main.olist_geolocation_dataset ..................... [INSERT 1000163 in 29.19s]
18:39:33  7 of 25 START seed file main.olist_marketing_qualified_leads_dataset ........... [RUN]
18:39:34  7 of 25 OK loaded seed file main.olist_marketing_qualified_leads_dataset ....... [INSERT 8000 in 0.42s]
18:39:34  8 of 25 START seed file main.olist_order_items_dataset ......................... [RUN]
18:39:37  8 of 25 OK loaded seed file main.olist_order_items_dataset ..................... [INSERT 112650 in 3.32s]
18:39:37  9 of 25 START seed file main.olist_order_payments_dataset ...................... [RUN]
18:39:40  9 of 25 OK loaded seed file main.olist_order_payments_dataset .................. [INSERT 103886 in 3.08s]
18:39:40  10 of 25 START seed file main.olist_order_reviews_dataset ...................... [RUN]
18:39:47  10 of 25 OK loaded seed file main.olist_order_reviews_dataset .................. [INSERT 100000 in 6.70s]
18:39:47  11 of 25 START seed file main.olist_orders_dataset ............................. [RUN]
18:39:58  11 of 25 OK loaded seed file main.olist_orders_dataset ......................... [INSERT 99441 in 11.25s]
18:39:58  12 of 25 START seed file main.olist_products_dataset ........................... [RUN]
18:40:00  12 of 25 OK loaded seed file main.olist_products_dataset ....................... [INSERT 32951 in 2.02s]
18:40:00  13 of 25 START seed file main.olist_sellers_dataset ............................ [RUN]
18:40:00  13 of 25 OK loaded seed file main.olist_sellers_dataset ........................ [INSERT 3095 in 0.11s]
18:40:00  14 of 25 START seed file main.payment_confirmation_table ....................... [RUN]
18:40:00  14 of 25 OK loaded seed file main.payment_confirmation_table ................... [INSERT 452 in 0.06s]
18:40:00  15 of 25 START seed file main.payment_page_table ............................... [RUN]
18:40:00  15 of 25 OK loaded seed file main.payment_page_table ........................... [INSERT 6030 in 0.12s]
18:40:00  16 of 25 START seed file main.product_category_name_translation ................ [RUN]
18:40:00  16 of 25 OK loaded seed file main.product_category_name_translation ............ [INSERT 71 in 0.05s]
18:40:00  17 of 25 START seed file main.raw_customers .................................... [RUN]
18:40:01  17 of 25 OK loaded seed file main.raw_customers ................................ [INSERT 939 in 0.06s]
18:40:01  18 of 25 START seed file main.raw_items ........................................ [RUN]
18:40:01  18 of 25 OK loaded seed file main.raw_items .................................... [INSERT 95368 in 0.74s]
18:40:01  19 of 25 START seed file main.raw_orders ....................................... [RUN]
18:40:05  19 of 25 OK loaded seed file main.raw_orders ................................... [INSERT 59652 in 3.73s]
18:40:05  20 of 25 START seed file main.raw_products ..................................... [RUN]
18:40:05  20 of 25 OK loaded seed file main.raw_products ................................. [INSERT 10 in 0.05s]
18:40:05  21 of 25 START seed file main.raw_stores ....................................... [RUN]
18:40:05  21 of 25 OK loaded seed file main.raw_stores ................................... [INSERT 5 in 0.06s]
18:40:05  22 of 25 START seed file main.raw_supplies ..................................... [RUN]
18:40:07  22 of 25 OK loaded seed file main.raw_supplies ................................. [INSERT 65 in 0.05s]
18:40:07  23 of 25 START seed file main.search_page_table ................................ [RUN]
18:40:07  23 of 25 OK loaded seed file main.search_page_table ............................ [INSERT 45200 in 0.57s]
18:40:07  24 of 25 START seed file main.transactions_seed ................................ [RUN]
18:40:07  24 of 25 OK loaded seed file main.transactions_seed ............................ [INSERT 50 in 0.06s]
18:40:07  25 of 25 START seed file main.user_table ....................................... [RUN]
18:40:10  25 of 25 OK loaded seed file main.user_table ................................... [INSERT 90400 in 2.94s]
18:40:10
18:40:10  Finished running 25 seeds in 0 hours 1 minutes and 9.73 seconds (69.73s).
18:40:10
18:40:10  Completed successfully
18:40:10
18:40:10  Done. PASS=25 WARN=0 ERROR=0 SKIP=0 TOTAL=25

Rindrics

`mf validate-configs`

大量のエラーが出た

エラーの一部:

                             ^
• ERROR: with semantic model `countries`  - Failed to query dimensions in data warehouse for semantic model `countries`
Received following error from data warehouse:
Runtime Error
  Catalog Error: Table with name countries does not exist!
  Did you mean "countries_seed"?
  LINE 6: FROM "dev"."main"."countries" countries_src_6...
               ^

DWH からエラーが出たと言っている

いま mf tutorial をやっているが、この前に https://github.com/dbt-labs/jaffle-sl-template の README.md にある dbt seed を実行したので、このときロードされたデータとモデルとの間に齟齬が生じている可能性があると考えた

もしこの仮説が正しければ、DWH 内のテーブルを、mf tutorial によって生成されている jaffle-data/sample-seed/ 由来のものだけにすればよいと考えられる

そこで jaffle-data/sample-seed/ 以外の seed data 由来と思われるテーブルを削除してみる

現在の DWH の状況は下記の通り

❯ duckdb dev.duckdb
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D show tables;
┌─────────────────────────────────────────┐
│                  name                   │
│                 varchar                 │
├─────────────────────────────────────────┤
│ countries_seed                          │
│ customers                               │
│ customers_seed                          │
│ home_page_table                         │
│ olist_closed_deals                      │
│ olist_closed_deals_dataset              │
│ olist_customers_dataset                 │
│ olist_geolocation_dataset               │
│ olist_marketing_qualified_leads         │
│ olist_marketing_qualified_leads_dataset │
│ olist_order_items_dataset               │
│ olist_order_payments_dataset            │
│ olist_order_reviews_dataset             │
│ olist_orders                            │
│ olist_orders_dataset                    │
│ olist_products_dataset                  │
│ olist_seller                            │
│ olist_sellers_dataset                   │
│ order_items                             │
│ orders                                  │
│ payment_confirmation_table              │
│ payment_page_table                      │
│ product_category_name_translation       │
│ raw_customers                           │
│ raw_items                               │
│ raw_orders                              │
│ raw_products                            │
│ raw_stores                              │
│ raw_supplies                            │
│ search_page_table                       │
│ stg_customers                           │
│ stg_home_page                           │
│ stg_locations                           │
│ stg_order_items                         │
│ stg_orders                              │
│ stg_payment_confirmation                │
│ stg_payment_page                        │
│ stg_products                            │
│ stg_search_page                         │
│ stg_supplies                            │
│ stg_users                               │
│ transactions_seed                       │
│ user_table                              │
├─────────────────────────────────────────┤
│                 43 rows                 │
└─────────────────────────────────────────┘
D

❯ tree jaffle-data
jaffle-data
├── ecommerce
│   ├── home_page_table.csv
│   ├── olist_closed_deals_dataset.csv
│   ├── olist_customers_dataset.csv
│   ├── olist_geolocation_dataset.csv
│   ├── olist_marketing_qualified_leads_dataset.csv
│   ├── olist_order_items_dataset.csv
│   ├── olist_order_payments_dataset.csv
│   ├── olist_order_reviews_dataset.csv
│   ├── olist_orders_dataset.csv
│   ├── olist_products_dataset.csv
│   ├── olist_sellers_dataset.csv
│   ├── payment_confirmation_table.csv
│   ├── payment_page_table.csv
│   ├── product_category_name_translation.csv
│   ├── search_page_table.csv
│   └── user_table.csv
├── raw_customers.csv
├── raw_items.csv
├── raw_orders.csv
├── raw_products.csv
├── raw_stores.csv
├── raw_supplies.csv
└── sample_seed
    ├── countries_seed.csv
    ├── customers_seed.csv
    └── transactions_seed.csv

2 directories, 25 files

なので、下記 3 つのテーブルだけが存在する状態にする

countries_seed
customers_seed
transactions_seed

下記の SQL を実行する

DROP TABLE IF EXISTS customers;
DROP TABLE IF EXISTS home_page_table;
DROP TABLE IF EXISTS olist_closed_deals;
DROP TABLE IF EXISTS olist_closed_deals_dataset;
DROP TABLE IF EXISTS olist_customers_dataset;
DROP TABLE IF EXISTS olist_geolocation_dataset;
DROP TABLE IF EXISTS olist_marketing_qualified_leads;
DROP TABLE IF EXISTS olist_marketing_qualified_leads_dataset;
DROP TABLE IF EXISTS olist_order_items_dataset;
DROP TABLE IF EXISTS olist_order_payments_dataset;
DROP TABLE IF EXISTS olist_order_reviews_dataset;
DROP TABLE IF EXISTS olist_orders;
DROP TABLE IF EXISTS olist_orders_dataset;
DROP TABLE IF EXISTS olist_products_dataset;
DROP TABLE IF EXISTS olist_seller;
DROP TABLE IF EXISTS olist_sellers_dataset;
DROP TABLE IF EXISTS order_items;
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS payment_confirmation_table;
DROP TABLE IF EXISTS payment_page_table;
DROP TABLE IF EXISTS product_category_name_translation;
DROP TABLE IF EXISTS raw_customers;
DROP TABLE IF EXISTS raw_items;
DROP TABLE IF EXISTS raw_orders;
DROP TABLE IF EXISTS raw_products;
DROP TABLE IF EXISTS raw_stores;
DROP TABLE IF EXISTS raw_supplies;
DROP TABLE IF EXISTS search_page_table;
DROP TABLE IF EXISTS stg_customers;
DROP TABLE IF EXISTS stg_home_page;
DROP TABLE IF EXISTS stg_locations;
DROP TABLE IF EXISTS stg_order_items;
DROP TABLE IF EXISTS stg_orders;
DROP TABLE IF EXISTS stg_payment_confirmation;
DROP TABLE IF EXISTS stg_payment_page;
DROP TABLE IF EXISTS stg_products;
DROP TABLE IF EXISTS stg_search_page;
DROP TABLE IF EXISTS stg_supplies;
DROP TABLE IF EXISTS stg_users;
DROP TABLE IF EXISTS user_table;

まだ残っている。一部は view だったようだ

D show tables;
┌───────────────────┐
│       name        │
│      varchar      │
├───────────────────┤
│ countries_seed    │
│ customers_seed    │
│ stg_customers     │
│ stg_locations     │
│ stg_order_items   │
│ stg_orders        │
│ stg_products      │
│ stg_supplies      │
│ transactions_seed │
└───────────────────┘

下記の SQL を実行する

DROP VIEW IF EXISTS stg_customers;
DROP VIEW IF EXISTS stg_locations;
DROP VIEW IF EXISTS stg_order_items;
DROP VIEW IF EXISTS stg_orders;
DROP VIEW IF EXISTS stg_products;
DROP VIEW IF EXISTS stg_supplies;

意図した状態になった

D show tables;
┌───────────────────┐
│       name        │
│      varchar      │
├───────────────────┤
│ countries_seed    │
│ customers_seed    │
│ transactions_seed │
└───────────────────┘
D

再度 mf validate-configs を実行してみる

❯ mf validate-configs
(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully parsed manifest from dbt project
✔ 🎉 Successfully validated the semantics of built manifest (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating semantic models against data warehouse (ERRORS: 5, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating dimensions against data warehouse (ERRORS: 49, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating entities against data warehouse (ERRORS: 13, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating measures against data warehouse (ERRORS: 12, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating metrics against data warehouse (ERRORS: 11, FUTURE_ERRORS: 0, WARNINGS: 0)
• ERROR: with semantic model `stg_products`  - Unable to access semantic model `stg_products` in data warehouse
Received following error from data warehouse:
Runtime Error
  Catalog Error: Table with name stg_products does not exist!
  Did you mean "temp.pg_catalog.pg_proc"?
  LINE 1: EXPLAIN SELECT * FROM "dev"."main"."stg_products"
                                ^
• ERROR: with semantic model `locations`  - Unable to access semantic model `locations` in data warehouse
Received following error from data warehouse:
Runtime Error
  Catalog Error: Table with name stg_locations does not exist!
  Did you mean "pg_catalog.pg_class"?
  LINE 1: EXPLAIN SELECT * FROM "dev"."main"."stg_locations"

まだエラーが出ているが、stg_ の文字列が見えるのでもともと https://github.com/dbt-labs/jaffle-sl-template にあった models/staging/ 配下のモデルと思われる

これらを削除して再実行する

❯ mf validate-configs
(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully parsed manifest from dbt project
✔ 🎉 Successfully validated the semantics of built manifest (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating semantic models against data warehouse (ERRORS: 5, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating dimensions against data warehouse (ERRORS: 49, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating entities against data warehouse (ERRORS: 13, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating measures against data warehouse (ERRORS: 12, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating metrics against data warehouse (ERRORS: 11, FUTURE_ERRORS: 0, WARNINGS: 0)
• ERROR: with semantic model `stg_products`  - Unable to access semantic model `stg_products` in data warehouse
Received following error from data warehouse:
Runtime Error
  Catalog Error: Table with name stg_products does not exist!
  Did you mean "temp.pg_catalog.pg_proc"?
  LINE 1: EXPLAIN SELECT * FROM "dev"."main"."stg_products"

まだエラーが出ている。

Unable to access semantic model stg_products in data warehouse

semantic model stg_products を参照しようとしているので、これが起こらないようにしたい

❯ ls target
compiled               graph_summary.json     partial_parse.msgpack  run_results.json
graph.gpickle          manifest.json          run                    semantic_manifest.json

target/semantic_manifest.json が怪しい

mf tutorial を実行すると、そのときの model 定義にもとづいて semantic_manifest.json が生成されると想像する。

semantic_manifest.jsonを削除してから再生成してみる

❯ mf tutorial
To begin building and querying metrics, you must define semantic models and
metric configuration files in your dbt project. dbt will use these files to generate a
semantic manifest artifact, which MetricFlow will use to create a semantic graph for querying.
As part of this tutorial, we will generate the following files to help you get started:

📜 model files -> /Users/Rindrics/dev/jaffle-sl-template/models/sample_model
🌱 seed files -> /Users/Rindrics/dev/jaffle-sl-template/jaffle-data/sample_seed
✅ semantic manifest json file -> /Users/Rindrics/dev/jaffle-sl-template/target/semantic_manifest.json

Continue and generate the files? [y/N]: y
There are existing files in the paths above, would you like to overwrite them? [y/N]: y
✔ 📜 Sample files has been generated.
🤓 Please run the following steps,
    1.  Verify that your adapter credentials are correct in `profiles.yml`
    2.  Add time spine model to the models directory (https://docs.getdbt.com/docs/build/metricflow-time-spine)
    3.  Run `dbt seed`, check to see that the steps related to countries, transactions, customers are passing.
    4.  Try validating your data model: `mf validate-configs`
    5.  Check out your metrics: `mf list metrics`
    6.  Check out dimensions for your metric `mf list dimensions --metrics transactions`
    7.  Query your first metric: `mf query --metrics transactions --group-bys metric_time --order metric_time`
    8.  Show the SQL MetricFlow generates:
        `mf query --metrics transactions --group-bys metric_time --order metric_time --explain`
    9.  Visualize the plan:
        `mf query --metrics transactions --group-bys metric_time --order metric_time --explain --display-plans`
        * This only works if you have graphviz installed - see README.
    10.  Add another dimension:
        `mf query --metrics transactions --group-bys metric_time,customer__customer_country --order metric_time`
    11.  Add a coarser time granularity:
        `mf query --metrics transactions --group-bys metric_time__week --order metric_time__week`
    12. Try a more complicated query: mf query --metrics transactions,transaction_usd_na --group-bys metric_time,is_large --order metric_time --start-time 2022-03-20 --end-time 2022-04-01.
    13. When you're done with the tutorial, run mf tutorial --clean to delete sample models and seeds.

💡 Run `mf tutorial --msg` to see this message again without executing everything else

上書きによって time spine model が消えてしまったので再作成しつつ mf validate-configs を再実行

❯ mf validate-configs
(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully parsed manifest from dbt project
✔ 🎉 Successfully validated the semantics of built manifest (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating semantic models against data warehouse (ERRORS: 3, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating dimensions against data warehouse (ERRORS: 29, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating entities against data warehouse (ERRORS: 9, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating measures against data warehouse (ERRORS: 10, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when validating metrics against data warehouse (ERRORS: 11, FUTURE_ERRORS: 0, WARNINGS: 0)
• ERROR: with semantic model `customers`  - Unable to access semantic model `customers` in data warehouse
Received following error from data warehouse:
Runtime Error
  Catalog Error: Table with name customers does not exist!
  Did you mean "raw_customers"?
  LINE 1: EXPLAIN SELECT * FROM "dev"."main"."customers"

まだエラーが出る

DWH にテーブル customers がないというエラー。customers は models/sample_model/ に存在するで、dbt run すればよいと判断

❯ dbt run
19:58:43  Running with dbt=1.7.8
19:58:44  Registered adapter: duckdb=1.7.2
19:58:44  Unable to do partial parsing because a project config has changed
19:58:46  Found 4 models, 25 seeds, 3 sources, 0 exposures, 11 metrics, 615 macros, 0 groups, 3 semantic models
19:58:46
19:58:46  Concurrency: 1 threads (target='dev')
19:58:46
19:58:46  1 of 4 START sql view model main.countries ..................................... [RUN]
19:58:47  1 of 4 OK created sql view model main.countries ................................ [OK in 0.13s]
19:58:47  2 of 4 START sql view model main.customers ..................................... [RUN]
19:58:47  2 of 4 OK created sql view model main.customers ................................ [OK in 0.07s]
19:58:47  3 of 4 START sql view model main.metricflow_time_spine ......................... [RUN]
19:58:47  3 of 4 ERROR creating sql view model main.metricflow_time_spine ................ [ERROR in 0.04s]
19:58:47  4 of 4 START sql view model main.transactions .................................. [RUN]
19:58:47  4 of 4 OK created sql view model main.transactions ............................. [OK in 0.06s]
19:58:47
19:58:47  Finished running 4 view models in 0 hours 0 minutes and 0.48 seconds (0.48s).
19:58:47
19:58:47  Completed with 1 error and 0 warnings:
19:58:47
19:58:47    Runtime Error in model metricflow_time_spine (models/sample_model/metricflow_time_spine.sql)
  Binder Error: No function matches the given name and argument types '+(DATE, TIMESTAMP)'. You might need to add explicit type casts.
  	Candidate functions:
  	+(TINYINT) -> TINYINT
  	+(TINYINT, TINYINT) -> TINYINT
  	+(SMALLINT) -> SMALLINT
  	+(SMALLINT, SMALLINT) -> SMALLINT
  	+(INTEGER) -> INTEGER
  	+(INTEGER, INTEGER) -> INTEGER
  	+(BIGINT) -> BIGINT
  	+(BIGINT, BIGINT) -> BIGINT
  	+(HUGEINT) -> HUGEINT
  	+(HUGEINT, HUGEINT) -> HUGEINT
  	+(FLOAT) -> FLOAT
  	+(FLOAT, FLOAT) -> FLOAT
  	+(DOUBLE) -> DOUBLE
  	+(DOUBLE, DOUBLE) -> DOUBLE
  	+(DECIMAL) -> DECIMAL
  	+(DECIMAL, DECIMAL) -> DECIMAL
  	+(UTINYINT) -> UTINYINT
  	+(UTINYINT, UTINYINT) -> UTINYINT
  	+(USMALLINT) -> USMALLINT
  	+(USMALLINT, USMALLINT) -> USMALLINT
  	+(UINTEGER) -> UINTEGER
  	+(UINTEGER, UINTEGER) -> UINTEGER
  	+(UBIGINT) -> UBIGINT
  	+(UBIGINT, UBIGINT) -> UBIGINT
  	+(UHUGEINT) -> UHUGEINT
  	+(UHUGEINT, UHUGEINT) -> UHUGEINT
  	+(DATE, INTEGER) -> DATE
  	+(INTEGER, DATE) -> DATE
  	+(INTERVAL, INTERVAL) -> INTERVAL
  	+(DATE, INTERVAL) -> TIMESTAMP
  	+(INTERVAL, DATE) -> TIMESTAMP
  	+(TIME, INTERVAL) -> TIME
  	+(INTERVAL, TIME) -> TIME
  	+(TIMESTAMP, INTERVAL) -> TIMESTAMP
  	+(INTERVAL, TIMESTAMP) -> TIMESTAMP
  	+(TIME WITH TIME ZONE, INTERVAL) -> TIME WITH TIME ZONE
  	+(INTERVAL, TIME WITH TIME ZONE) -> TIME WITH TIME ZONE
  	+(TIME, DATE) -> TIMESTAMP
  	+(DATE, TIME) -> TIMESTAMP
  	+(TIME WITH TIME ZONE, DATE) -> TIMESTAMP WITH TIME ZONE
  	+(DATE, TIME WITH TIME ZONE) -> TIMESTAMP WITH TIME ZONE
  	+(ANY[], ANY[]) -> ANY[]
  	+(TIMESTAMP WITH TIME ZONE, INTERVAL) -> TIMESTAMP WITH TIME ZONE
  	+(INTERVAL, TIMESTAMP WITH TIME ZONE) -> TIMESTAMP WITH TIME ZONE

  LINE 9: ) as date) + ((interval '1 day') * (-3650))
                     ^
19:58:47
19:58:47  Done. PASS=3 WARN=0 ERROR=1 SKIP=0 TOTAL=4

time spine model で型不整合エラーが出ている

Error: No function matches the given name and argument types '+(DATE, TIMESTAMP)'. You might need to add explicit type casts.

一応 mf validate-configs は通るようになった

❯ mf validate-configs
(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully parsed manifest from dbt project
✔ 🎉 Successfully validated the semantics of built manifest (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated semantic models against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated dimensions against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated entities against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated measures against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated metrics against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)

Rindrics

time spine model における型不整合エラーを修正する

Error: No function matches the given name and argument types '+(DATE, TIMESTAMP)'. You might need to add explicit type casts.

この現象は https://github.com/duckdb/dbt-duckdb/issues/346 と同じに見える

https://github.com/duckdb/dbt-duckdb/pull/347 で直っているように見えるが、これはまだリリースされていない

未リリースの当該 PR のバージョンを明示的に指定して再インストールした

❯ pip install --pre --no-cache-dir git+https://github.com/duckdb/dbt-duckdb.git@b960956

dbt run が通るようになった

❯ dbt run
22:25:03  Running with dbt=1.7.8
22:25:03  Registered adapter: duckdb=1.7.2
22:25:04  Found 4 models, 25 seeds, 3 sources, 0 exposures, 11 metrics, 615 macros, 0 groups, 3 semantic models
22:25:04
22:25:04  Concurrency: 1 threads (target='dev')
22:25:04
22:25:04  1 of 4 START sql view model main.countries ..................................... [RUN]
22:25:04  1 of 4 OK created sql view model main.countries ................................ [OK in 0.13s]
22:25:04  2 of 4 START sql view model main.customers ..................................... [RUN]
22:25:04  2 of 4 OK created sql view model main.customers ................................ [OK in 0.06s]
22:25:04  3 of 4 START sql view model main.metricflow_time_spine ......................... [RUN]
22:25:04  3 of 4 OK created sql view model main.metricflow_time_spine .................... [OK in 0.15s]
22:25:04  4 of 4 START sql view model main.transactions .................................. [RUN]
22:25:04  4 of 4 OK created sql view model main.transactions ............................. [OK in 0.07s]
22:25:04
22:25:04  Finished running 4 view models in 0 hours 0 minutes and 0.64 seconds (0.64s).
22:25:04
22:25:04  Completed successfully
22:25:04
22:25:04  Done. PASS=4 WARN=0 ERROR=0 SKIP=0 TOTAL=4

Rindrics

`mf list metrics`

❯ mf list metrics
✔ 🌱 We've found 11 metrics.
The list below shows metrics in the format of "metric_name: list of available dimensions"
• new_customers: country__region, customer__customer_country, customer__ds, metric_time
• quick_buy_transactions: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• quick_buy_amount_usd: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• cancellations: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• transactions: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• alterations: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• transaction_amount: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• cancellation_rate: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• revenue_usd: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• cancellations_mx: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more
• transaction_usd_na: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more

DB の中身と照合してみると、メトリクスの実体は DB に存在しないことがわかる

ira.hayashi.1987@gmail.com took 8s
❯ duckdb dev.duckdb
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D show tables;
┌─────────────────────────────────────────┐
│                  name                   │
│                 varchar                 │
├─────────────────────────────────────────┤
│ countries                               │
│ countries_seed                          │
│ customers                               │
│ customers_seed                          │
│ home_page_table                         │
│ metricflow_time_spine                   │
│ olist_closed_deals_dataset              │
│ olist_customers_dataset                 │
│ olist_geolocation_dataset               │
│ olist_marketing_qualified_leads_dataset │
│ olist_order_items_dataset               │
│ olist_order_payments_dataset            │
│ olist_order_reviews_dataset             │
│ olist_orders_dataset                    │
│ olist_products_dataset                  │
│ olist_sellers_dataset                   │
│ payment_confirmation_table              │
│ payment_page_table                      │
│ product_category_name_translation       │
│ raw_customers                           │
│ raw_items                               │
│ raw_orders                              │
│ raw_products                            │
│ raw_stores                              │
│ raw_supplies                            │
│ search_page_table                       │
│ transactions                            │
│ transactions_seed                       │
│ user_table                              │
├─────────────────────────────────────────┤
│                 29 rows                 │
└─────────────────────────────────────────┘
D

メトリクスを参照するたび SQL が生成されて実行されるのだろうと想像した

Rindrics

`mf list dimensions --metrics METRIC_NAME`

❯ mf list dimensions --metrics new_customers
✔ 🌱 We've found 4 common dimensions for metrics ['new_customers'].
• country__region
• customer__customer_country
• customer__ds
• metric_time

この結果は list metrics したときにも見えていた

❯ mf list metrics
✔ 🌱 We've found 11 metrics.
The list below shows metrics in the format of "metric_name: list of available dimensions"

dimension が多くて truncate されていたメトリクスに対して使うと意味があるんだろう

• transaction_usd_na: customer__country__region, customer__customer_country, customer__ds, metric_time, transaction__ds and 2 more

❯ mf list dimensions --metrics transaction_usd_na
✔ 🌱 We've found 7 common dimensions for metrics ['transaction_usd_na'].
• customer__country__region
• customer__customer_country
• customer__ds
• metric_time
• transaction__ds
• transaction__is_large
• transaction__quick_buy_transaction

Rindrics

`mf query --metrics transactions --group-by metric_time --order metric_time`

（余談）

はじめ、チュートリアルには mf query --metrics transactions --group-bys metric_time --order metric_time を実行せよと書いてあったのだが、これはそのまま実行できなかった。オプション --group-bys は --group-by に変更されたのだがそれがチュートリアルメッセージに反映されていないバージョンを使ってしまっていたみたい。

❯ mf --version
mf, version 0.204.0

メッセージが修正されたバージョンはリリース前だったので、下記を実行することで更新した

❯ pip install --pre --no-cache-dir git+https://github.com/dbt-labs/metricflow.git@v0.204.1.dev5

❯ mf --version
mf, version 0.204.1.dev5

メッセージ中の --group-bys が --group-by に直っている

❯ mf tutorial --msg
🤓 Please run the following steps,
    1.  Verify that your adapter credentials are correct in `profiles.yml`
    2.  Add time spine model to the models directory (https://docs.getdbt.com/docs/build/metricflow-time-spine)
    3.  Run `dbt seed`, check to see that the steps related to countries, transactions, customers are passing.
    4.  Try validating your data model: `mf validate-configs`
    5.  Check out your metrics: `mf list metrics`
    6.  Check out dimensions for your metric `mf list dimensions --metrics transactions`
    7.  Query your first metric: `mf query --metrics transactions --group-by metric_time --order metric_time`
    8.  Show the SQL MetricFlow generates:
        `mf query --metrics transactions --group-by metric_time --order metric_time --explain`
    9.  Visualize the plan:
        `mf query --metrics transactions --group-by metric_time --order metric_time --explain --display-plans`
        * This only works if you have graphviz installed - see README.
    10.  Add another dimension:
        `mf query --metrics transactions --group-by metric_time,customer__customer_country --order metric_time`
    11.  Add a coarser time granularity:
        `mf query --metrics transactions --group-by metric_time__week --order metric_time__week`
    12. Try a more complicated query: mf query --metrics transactions,transaction_usd_na --group-by metric_time,is_large --order metric_time --start-time 2022-03-20 --end-time 2022-04-01.
    13. When you're done with the tutorial, run mf tutorial --clean to delete sample models and seeds.

`mf query --metrics transactions --group-by metric_time --order metric_time --explain` を実行してみる

❯ mf query --metrics transactions --group-by metric_time --order metric_time --explain
✔ Success 🦄 - query completed after 0.28 seconds
🔎 SQL (remove --explain to see data or add --show-dataflow-plan to see the generated dataflow plan):
SELECT
  metric_time__day
  , SUM(transactions) AS transactions
FROM (
  SELECT
    DATE_TRUNC('day', ds) AS metric_time__day
    , 1 AS transactions
  FROM "dev"."main"."transactions" transactions_src_10000
) subq_2
GROUP BY
  metric_time__day
ORDER BY metric_time__day

、、、おーこれはすごい。transactions を参照するたびにこれが実行されるのか

❯ mf query --metrics transactions --group-by metric_time --order metric_time
✔ Success 🦄 - query completed after 0.31 seconds
| metric_time__day   |   transactions |
|:-------------------|---------------:|
| 2022-03-07         |              2 |
| 2022-03-08         |              2 |
| 2022-03-09         |              1 |
| 2022-03-10         |              2 |
| 2022-03-11         |              1 |
| 2022-03-12         |              1 |
| 2022-03-13         |              1 |
| 2022-03-14         |              1 |
| 2022-03-15         |              3 |
| 2022-03-16         |              2 |
| 2022-03-17         |              1 |
| 2022-03-18         |              1 |
| 2022-03-21         |              3 |
| 2022-03-22         |              3 |
| 2022-03-23         |              2 |
| 2022-03-25         |              3 |
| 2022-03-26         |              2 |
| 2022-03-27         |              2 |
| 2022-03-28         |              1 |
| 2022-03-29         |              2 |
| 2022-03-30         |              4 |
| 2022-03-31         |              5 |
| 2022-04-01         |              1 |
| 2022-04-02         |              2 |
| 2022-04-03         |              1 |
| 2022-04-04         |              1 |

、、、なるほど

❯ mf query --metrics transactions --group-by metric_time --order metric_time --explain --show-dataflow-plan
✔ Success 🦄 - query completed after 0.29 seconds
🔎 Generated Dataflow Plan + SQL (remove --explain to see data):
-- Metric Dataflow Plan:
--     <DataflowPlan>
--         <WriteToResultDataframeNode>
--             <!-- description = 'Write to Dataframe' -->
--             <!-- node_id = NodeId(id_str='wrd_1') -->
--             <OrderByLimitNode>
--                 <!-- description = "Order By ['metric_time__day']" -->
--                 <!-- node_id = NodeId(id_str='obl_1') -->
--                 <!-- order_by_spec =                                                                        -->
--                 <!--   OrderBySpec(                                                                         -->
--                 <!--     instance_spec=TimeDimensionSpec(element_name='metric_time', time_granularity=DAY), -->
--                 <!--     descending=False,                                                                  -->
--                 <!--   )                                                                                    -->
--                 <!-- limit = 'None' -->
--                 <ComputeMetricsNode>
--                     <!-- description = 'Compute Metrics via Expressions' -->
--                     <!-- node_id = NodeId(id_str='cm_1') -->
--                     <!-- metric_spec = MetricSpec(element_name='transactions') -->
--                     <AggregateMeasuresNode>
--                         <!-- description = 'Aggregate Measures' -->
--                         <!-- node_id = NodeId(id_str='am_1') -->
--                         <FilterElementsNode>
--                             <!-- description = "Pass Only Elements: ['transactions', 'metric_time__day']" -->
--                             <!-- node_id = NodeId(id_str='pfe_1') -->
--                             <!-- include_spec = MeasureSpec(element_name='transactions') -->
--                             <!-- include_spec = TimeDimensionSpec(element_name='metric_time', time_granularity=DAY) -->
--                             <!-- distinct = False -->
--                             <MetricTimeDimensionTransformNode>
--                                 <!-- description = "Metric Time Dimension 'ds'" -->
--                                 <!-- node_id = NodeId(id_str='sma_0') -->
--                                 <!-- aggregation_time_dimension = 'ds' -->
--                                 <ReadSqlSourceNode>
--                                     <!-- description = "Read From SemanticModelDataSet('transactions')" -->
--                                     <!-- node_id = NodeId(id_str='rss_0') -->
--                                     <!-- data_set = SemanticModelDataSet('transactions') -->
--                                 </ReadSqlSourceNode>
--                             </MetricTimeDimensionTransformNode>
--                         </FilterElementsNode>
--                     </AggregateMeasuresNode>
--                 </ComputeMetricsNode>
--             </OrderByLimitNode>
--         </WriteToResultDataframeNode>
--     </DataflowPlan>

SELECT
  metric_time__day
  , SUM(transactions) AS transactions
FROM (
  SELECT
    DATE_TRUNC('day', ds) AS metric_time__day
    , 1 AS transactions
  FROM "dev"."main"."transactions" transactions_src_10000
) subq_2
GROUP BY
  metric_time__day
ORDER BY metric_time__day

dataflow plan って聞いたことなかった

Rindrics

mf tutorial のメッセージには、その後もクエリを試してみろという内容のメッセージが続いている

   10.  Add another dimension:
        `mf query --metrics transactions --group-by metric_time,customer__customer_country --order metric_time`
    11.  Add a coarser time granularity:
        `mf query --metrics transactions --group-by metric_time__week --order metric_time__week`
    12. Try a more complicated query: mf query --metrics transactions,transaction_usd_na --group-by metric_time,is_large --order metric_time --start-time 2022-03-20 --end-time 2022-04-01.

これらを眺めてみてわかるのが、SQL を一切意識することなく API 経由で DWH にクエリできていること

確かにこれなら、BI のフロントからのクエリはかなり楽になりそう。

Rindrics

ここまで https://github.com/dbt-labs/jaffle-sl-template ではなく結局 metric flow のチュートリアルをやっていたので、後片付けをしておく

    13. When you're done with the tutorial, run mf tutorial --clean to delete sample models and seeds.

Would you like to remove all the sample files? [y/N]: y
✔ 🗑️ Sample files has been removed.

しかし https://github.com/dbt-labs/jaffle-sl-template のドキュメントに従って

dbt deps
dbt seed
dbt build --exclude path:jaffle-data

したあとにも

❯ mf validate-configs
‼️ Warning: A new version of the MetricFlow CLI is available.
💡 Please update to version 0.205.0, released 2024-03-01 20:41:26 by running:
	$ pip install --upgrade metricflow

(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully parsed manifest from dbt project
✖ Breaking issues found when checking semantics of built manifest (ERRORS: 8, FUTURE_ERRORS: 0, WARNINGS: 0)
• ERROR: with metric `mql_to_seller_conversion_rate_base`  - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_base`  - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: sellers is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_7_day_window`  - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_7_day_window`  - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: sellers is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_7_day_window_with_filter`  - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_7_day_window_with_filter`  - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: sellers is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_1_week_window`  - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_1_week_window`  - For

となった。

まず下記に着目して状況を調べていく

• ERROR: with metric mql_to_seller_conversion_rate_base - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM

manifest を見てみる

target/semantic_manifest.json

    {
      "name": "mql_to_seller_conversion_rate_base",
      "description": "Conversion rate from MQL to seller",
      "type": "conversion",
      "type_params": {
        "measure": null,
        "numerator": null,
        "denominator": null,
        "expr": null,
        "window": null,
        "grain_to_date": null,
        "metrics": [],
        "conversion_type_params": {
          "base_measure": {
            "name": "mqls",
            "filter": null,
            "alias": null,
            "join_to_timespine": false,
            "fill_nulls_with": null
          },
          "conversion_measure": {
            "name": "sellers",
            "filter": null,
            "alias": null,
            "join_to_timespine": false,
            "fill_nulls_with": null
          },
          "entity": "mql",
          "calculation": "conversion_rate",
          "window": null,
          "constant_properties": null
        },
        "input_measures": [
          {
            "name": "mqls",
            "filter": null,
            "alias": null,
            "join_to_timespine": false,
            "fill_nulls_with": null
          },
          {
            "name": "sellers",
            "filter": null,
            "alias": null,
            "join_to_timespine": false,
            "fill_nulls_with": null
          }
        ]
      },
      "filter": null,
      "metadata": null,
      "label": "MQL to Seller Conversion Rate"
    },

この

"conversion_type_params": {
          "base_measure": {
            "name": "mqls",
            "filter": null,
            "alias": null,
            "join_to_timespine": false,
            "fill_nulls_with": null
          },

に問題があるんだろうと想像する。

For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM
と言われているかじ、上記自体には agg type のようなフィールドはない。

同じメッセージが複数出ていることから、問題は参照先にある可能性を考えた。
mql のツリーを探して見てみた

    {
      "name": "mqls",
      "description": "",
      "type": "simple",
      "type_params": {
        "measure": {
          "name": "mqls",
          "filter": null,
          "alias": null,
          "join_to_timespine": true,
          "fill_nulls_with": 0
        },
        "numerator": null,
        "denominator": null,
        "expr": null,
        "window": null,
        "grain_to_date": null,
        "metrics": [],
        "conversion_type_params": null,
        "input_measures": [
          {
            "name": "mqls",
            "filter": null,
            "alias": null,
            "join_to_timespine": true,
            "fill_nulls_with": 0
          }
        ]
      },
      "filter": null,
      "metadata": null,
      "label": "MQls"
    },

ここにも agg type に関する情報はない。

次にモデルのメタデータを見てみる

models/marts/ecommerce/olist_marketing_qualified_leads.yaml

semantic_models:
  - name: olist_mqls
    description: "Table showing all seller contacts for an ecommerce marketplace"
    model: ref('olist_marketing_qualified_leads')
    defaults:
      agg_time_dimension: ds
    entities:
      - name: mql
        type: primary
        expr: mql_id
      - name: landing_page
        type: foreign
        expr: landing_page_id
    dimensions:
      - name: ds
        expr: first_contact_date
        type: time
        type_params:
          time_granularity: day
      - name: origin
        type: categorical
    measures:
      - name: mqls
        agg: count
        expr: mql_id

measure mqls が agg:count となっているがここだろうか？

measures がそもそもわからないので https://docs.getdbt.com/docs/build/measures を見てみる

dbt supports the following aggregations: sum, max, min, avg, median, count_distinct, and sum_boolean.

なのでそもそも現在の agg: count 自体、サポートされていない気がするがよいのだろうか。

そもそも今見ている場所が、現在直面している問題に関係しているのかどうかもわからない。
これを確かめるために agg: count を agg: count_distinct に変えて build & validate-configs してみた

❯ mf validate-configs
(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully parsed manifest from dbt project
✖ Breaking issues found when checking semantics of built manifest (ERRORS: 4, FUTURE_ERRORS: 0, WARNINGS: 0)
• ERROR: with metric `mql_to_seller_conversion_rate_1_week_window`  - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: sellers is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_7_day_window`  - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: sellers is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_7_day_window_with_filter`  - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: sellers is agg type: AggregationType.SUM
• ERROR: with metric `mql_to_seller_conversion_rate_base`  - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: sellers is agg type: AggregationType.SUM

エラーメッセージは減った。

今見ている箇所が検討違いでないとしたら、agg: count に関して、validate-configs のメッセージと metrics のドキュメントの記述が矛盾している。どちらかが誤っているのだろうか

Rindrics

そもそも何が起こっているのか理解できていないので詳しく見てみる

validate-configs でエラーが出ていたメトリクスの一つ mql_to_seller_conversion_rate_1_week_window についてクエリを調べてみる

❯ mf query --metrics mql_to_seller_conversion_rate_1_week_window --group-by metric_time --order metric_time --explain

⠧ Initiating query…
✔ Success 🦄 - query completed after 0.55 seconds
🔎 SQL (remove --explain to see data or add --show-dataflow-plan to see the generated dataflow plan):
SELECT
  metric_time__day
  , CAST(sellers AS DOUBLE) / CAST(NULLIF(mqls, 0) AS DOUBLE) AS mql_to_seller_conversion_rate_1_week_window
FROM (
  SELECT
    COALESCE(subq_3.metric_time__day, subq_13.metric_time__day) AS metric_time__day
    , MAX(subq_3.mqls) AS mqls
    , COALESCE(MAX(subq_13.sellers), 0) AS sellers
  FROM (
    SELECT
      metric_time__day
      , SUM(mqls) AS mqls
    FROM (
      SELECT
        DATE_TRUNC('day', first_contact_date) AS metric_time__day
        , CASE WHEN mql_id IS NOT NULL THEN 1 ELSE 0 END AS mqls
      FROM "dev"."main"."olist_marketing_qualified_leads" olist_mqls_src_10000
    ) subq_2
    GROUP BY
      metric_time__day
  ) subq_3
  FULL OUTER JOIN (
    SELECT
      metric_time__day
      , SUM(sellers) AS sellers
    FROM (
      SELECT DISTINCT
        first_value(subq_6.mqls) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY subq_6.ds__day DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS mqls
        , first_value(subq_6.ds__day) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY subq_6.ds__day DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ds__day
        , first_value(subq_6.metric_time__day) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY subq_6.ds__day DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS metric_time__day
        , first_value(subq_6.mql) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY subq_6.ds__day DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS mql
        , subq_9.mf_internal_uuid AS mf_internal_uuid
        , subq_9.sellers AS sellers
      FROM (
        SELECT
          DATE_TRUNC('day', first_contact_date) AS ds__day
          , DATE_TRUNC('day', first_contact_date) AS metric_time__day
          , mql_id AS mql
          , CASE WHEN mql_id IS NOT NULL THEN 1 ELSE 0 END AS mqls
        FROM "dev"."main"."olist_marketing_qualified_leads" olist_mqls_src_10000
      ) subq_6
      INNER JOIN (
        SELECT
          DATE_TRUNC('day', won_date) AS won_date__day
          , mql_id AS mql
          , CASE WHEN seller_id IS NOT NULL THEN 1 ELSE 0 END AS sellers
          , GEN_RANDOM_UUID() AS mf_internal_uuid
        FROM "dev"."main"."olist_closed_deals" olist_closed_deals_src_10000
      ) subq_9
      ON
        (
          subq_6.mql = subq_9.mql
        ) AND (
          (
            subq_6.ds__day <= subq_9.won_date__day
          ) AND (
            subq_6.ds__day > subq_9.won_date__day - INTERVAL 1 week
          )
        )
    ) subq_10
    GROUP BY
      metric_time__day
  ) subq_13
  ON
    subq_3.metric_time__day = subq_13.metric_time__day
  GROUP BY
    COALESCE(subq_3.metric_time__day, subq_13.metric_time__day)
) subq_14
ORDER BY metric_time__day

これと、エラーメッセージ • ERROR: with metric mql_to_seller_conversion_rate_1_week_window - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM を見比べると、確かに mql を SUM している箇所があった:

 SELECT
      metric_time__day
      , SUM(mqls) AS mqls

measure としての mqls の定義は下記の通り:

    measures:
      - name: mqls
        agg: count
        expr: mql_id

mql_id を count するので、集計結果としては SUM になるのだろう

試しに、agg: count を agg: count_distinct（https://docs.getdbt.com/docs/build/measures に従う形）とし、モデルを再ビルド & メトリクスのクエリを表示してみた:


✔ Success 🦄 - query completed after 0.51 seconds
🔎 SQL (remove --explain to see data or add --show-dataflow-plan to see the generated dataflow plan):
SELECT
  metric_time__day
  , CAST(sellers AS DOUBLE) / CAST(NULLIF(mqls, 0) AS DOUBLE) AS mql_to_seller_conversion_rate_1_week_window
FROM (
  SELECT
    COALESCE(subq_3.metric_time__day, subq_13.metric_time__day) AS metric_time__day
    , MAX(subq_3.mqls) AS mqls
    , COALESCE(MAX(subq_13.sellers), 0) AS sellers
  FROM (
    SELECT
      DATE_TRUNC('day', first_contact_date) AS metric_time__day
      , COUNT(DISTINCT mql_id) AS mqls
    FROM "dev"."main"."olist_marketing_qualified_leads" olist_mqls_src_10000
    GROUP BY
      DATE_TRUNC('day', first_contact_date)
  ) subq_3
  FULL OUTER JOIN (
    SELECT
      metric_time__day
      , SUM(sellers) AS sellers
    FROM (
      SELECT DISTINCT
        first_value(olist_mqls_src_10000.mql_id) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date) DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS mqls
        , first_value(DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date)) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date) DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ds__day
        , first_value(DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date)) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date) DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS metric_time__day
        , first_value(olist_mqls_src_10000.mql_id) OVER (PARTITION BY subq_9.mql, subq_9.won_date__day, subq_9.mf_internal_uuid ORDER BY DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date) DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS mql
        , subq_9.mf_internal_uuid AS mf_internal_uuid
        , subq_9.sellers AS sellers
      FROM "dev"."main"."olist_marketing_qualified_leads" olist_mqls_src_10000
      INNER JOIN (
        SELECT
          DATE_TRUNC('day', won_date) AS won_date__day
          , mql_id AS mql
          , CASE WHEN seller_id IS NOT NULL THEN 1 ELSE 0 END AS sellers
          , GEN_RANDOM_UUID() AS mf_internal_uuid
        FROM "dev"."main"."olist_closed_deals" olist_closed_deals_src_10000
      ) subq_9
      ON
        (
          olist_mqls_src_10000.mql_id = subq_9.mql
        ) AND (
          (
            DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date) <= subq_9.won_date__day
          ) AND (
            DATE_TRUNC('day', olist_mqls_src_10000.first_contact_date) > subq_9.won_date__day - INTERVAL 1 week
          )
        )
    ) subq_10
    GROUP BY
      metric_time__day
  ) subq_13
  ON
    subq_3.metric_time__day = subq_13.metric_time__day
  GROUP BY
    COALESCE(subq_3.metric_time__day, subq_13.metric_time__day)
) subq_14
ORDER BY metric_time__day

, SUM(mqls) AS mqls だった箇所が , COUNT(DISTINCT mql_id) AS mqls になった。

ここまで調べてみた結果、当初出ていたエラーメッセージ ERROR: with metricmql_to_seller_conversion_rate_1_week_window - For conversion metrics, the measure must be COUNT/SUM(1)/COUNT_DISTINCT. Measure: mqls is agg type: AggregationType.SUM は、measure 定義の agg: count に対してではなく、生成された SQL に対して出ていたことがわかった。

ということで、https://docs.getdbt.com/docs/build/measures のドキュメントも、mf validate-configs のメッセージも正しいことがわかった。
一方で https://github.com/dbt-labs/jaffle-sl-template の measure 定義は誤っていることがわかった。

残りの measure に出ているエラーはどれも sellers is agg type: AggregationType.SUM なので、同じ方法で解決していく

    measures:
      - name: sellers
        agg: count_distinct
        expr: seller_id
        create_metric: true

mf validate-configs してみると、measure のエラーは全て解決されたことがわかった。

新たに下記のエラーが出てしまったのでこちらは次コメントで対処する

`✖ Breaking issues found when validating entities against data warehouse (ERRORS: 2, FUTURE_ERRORS: 0, WARNINGS: 0)`

Rindrics

下記のエラーが出ている

> ✖ Breaking issues found when validating entities against data warehouse (ERRORS: 2, FUTURE_ERRORS: 0, WARNINGS: 0)

調べてみたところ、このエラーは semantic model の metric として定義しているカラム landing_page_id が seed data jaffle_data/ecommerce/olist_closed_deals_dataset.csv の中にないことが原因だった。

ふつうに landing_page_id カラムを足して解決した

Rindrics

以上から、mf validate-configs が通るようになった

❯ mf validate-configs
(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully parsed manifest from dbt project
✔ 🎉 Successfully validated the semantics of built manifest (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated semantic models against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated dimensions against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated entities against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated measures against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✔ 🎉 Successfully validated metrics against data warehouse (ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)

Rindrics

ここまでぜんぜん semantic layer さわってなかったのでツール組み合わせて使ってみたい

下記のツールを組み合わせて構築してみる

BI: Steep
Semantic Layer by dbt:
- DWH: BigQuery
- dbt env: dbt Cloud

前提

Metricflow を使えるようにする

dbt project のセットアップ

疎通確認

mf コマンドを一通り試していく

mf tutorial

mf validate-configs

time spine model における型不整合エラーを修正する

mf list metrics

mf list dimensions --metrics METRIC_NAME

mf query --metrics transactions --group-by metric_time --order metric_time

（余談）

mf query --metrics transactions --group-by metric_time --order metric_time --explain を実行してみる

`mf` コマンドを一通り試していく

`mf tutorial`

`mf validate-configs`

`mf list metrics`

`mf list dimensions --metrics METRIC_NAME`

`mf query --metrics transactions --group-by metric_time --order metric_time`

`mf query --metrics transactions --group-by metric_time --order metric_time --explain` を実行してみる