🚀

【dbt Docs】Building a dbt Project - Seeds

2022/03/14に公開

dbt

tech

Seeds

Seed configurations
Seed properties
dbt seed command

Getting started

SeedsはCSVファイルをDWHにロードするコマンド。大抵の場合、seedsディレクトリ、コマンドはdbt seed。ref関数が使える。

これらのCSVファイルはdbtリポジトリに配置されるため、バージョン管理され、コードレビューが可能です。Seedsは、変更頻度の少ない静的なデータに最適です。

seedを使う良いユースケースとしては

国番号から国名へのマッピングのリスト
解析から除外するテストメールのリスト
従業員のアカウントIDの一覧

seedをつかう「悪い」ケース

CSVにエクスポートされた生データの読み込み
機密情報を含むあらゆる種類の生産データ。例えば、個人を特定できる情報（PII）、パスワードなど。

Example

「シードファイル」をｄｂｔプロジェクトにロードする手順は以下の通り

seedsディレクトリに、.csvファイルを配置する。例）seeds/coutry_codes.csv
seeds/country_codes.csv
```
country_code,country_name
US,United States
CA,Canada
GB,United Kingdom
...
```

dbt seedを実行。新しいテーブルが作られる。上の場合では coutry_code

$ dbt seed

Found 2 models, 3 tests, 0 archives, 0 analyses, 53 macros, 0 operations, 1 seed file

14:46:15 | Concurrency: 1 threads (target='dev')
14:46:15 |
14:46:15 | 1 of 1 START seed file    analytics.country_codes........................... [RUN]
14:46:15 | 1 of 1 OK loaded seed file analytics.country_codes....................... [INSERT 3 in 0.01s]
14:46:16 |
14:46:16 | Finished running 1 seed in 0.14s.

Completed successfully

Done. PASS=1 ERROR=0 SKIP=0 TOTAL=1$ dbt seed

Found 2 models, 3 tests, 0 archives, 0 analyses, 53 macros, 0 operations, 1 seed file

14:46:15 | Concurrency: 1 threads (target='dev')
14:46:15 |
14:46:15 | 1 of 1 START seed file analytics.country_codes........................... [RUN]
14:46:15 | 1 of 1 OK loaded seed file analytics.country_codes....................... [INSERT 3 in 0.01s]
14:46:16 |
14:46:16 | Finished running 1 seed in 0.14s.

Completed successfully

Done. PASS=1 ERROR=0 SKIP=0 TOTAL=1

ref関数を使って利用

select * from {{ ref('country_codes') }}

Configuring seeds

dbt_project.ymlでの設定の仕方。https://docs.getdbt.com/reference/seed-configs

Documenting and testing seeds

seedに関するプロパティは、https://docs.getdbt.com/reference/seed-properties

FAQs

シードを使用して生データをロードできますか？

シードを使用して生データをロードしないでください（たとえば、本番データベースからの大規模なCSVエクスポート）。
シードはバージョン管理されているため、国コードのリストや従業員のユーザーIDなど、ビジネス固有のロジックを含むファイルに最適です。

dbtのシード機能を使用したCSVのロードは、大きなファイルに対してはパフォーマンスが高くありません。別のツールを使用して、これらのCSVをデータウェアハウスにロードすることを検討してください。
プロジェクトの`seeds`ディレクトリ以外のディレクトリにシードを保存できますか？
デフォルトでは、dbtはシードファイルがseedsディレクトリにあることを想定しています。 seed-pathsで設定可能
dbt_project.yml
```
seed-paths: ["custom_seeds"]
```
シードの列が変更され、 `seed`コマンドを実行するとエラーが発生します。どうすればよいですか？

シードの列を変更した場合、Database Errorになる。
これを回避するためには dbt seed --full-refreshを使う必要がある。

シードをテストして文書化するにはどうすればよいですか？

下記のように設定する

seeds/schema.yml

version: 2

seeds:
  - name: country_codes
    description: A mapping of two letter country codes to country names
    columns:
      - name: country_code
        tests:
          - unique
          - not_null
      - name: country_name
        tests:
          - unique
          - not_null

シードの列のデータ型を設定するにはどうすればよいですか？

設定ファイルで設定できる

dbt_project.yml

seeds:
  jaffle_shop: # you must include the project name
    warehouse_locations:
      +column_types:
        zipcode: varchar(5)

シードの下流でモデルを実行するにはどうすればよいですか？
```
$ dbt run --select country_codes+
```
シードの先行ゼロを保持するにはどうすればよいですか？（ゼロ埋め）
先行ゼロを保持する必要がある場合（たとえば、郵便番号や携帯電話番号）：
1. v0.16.0以降：シードファイルに先行ゼロを含め、正しい長さのvarcharデータ型でcolumn_types 構成を使用します。
2. v0.16.0より前：ダウンストリームモデルを使用して、SQLを使用して先行ゼロを埋めます。次に例を示します。lpad(zipcode, 5, '0')
一度に1つのシードを作成するにはどうすればよいですか？
```
$ dbt seed --select country_codes
```
フックはシードで実行されますか？
フックは使える
- prehookとpost-hook
- on-run-start on-run-endフック
  dbt_project.ymlで設定可能