🔗

dlt入門

2025/04/14に公開

はじめに

前回の dbt に続いて、dlt です。dbt は、data build tool。dlt は、data load tool。
ま、やってみます。

やってみる

インストール

uv でプロジェクトを作って、インストールします。

$ uv init
Initialized project `dlt-getting-started`
$ uv venv
Using CPython 3.12.9
Creating virtual environment at: .venv
Activate with: source .venv/bin/activate
$ uv add dlt[duckdb]
Resolved 50 packages in 566ms
Prepared 30 packages in 164ms
Installed 43 packages in 46ms
:
 + urllib3==2.3.0
$ uv run dlt --version
dlt 1.9.0

REST API

uv のワークスペースのメンバーとしてプロジェクトを作成します。

$ uv init rest-api
Adding `rest-api` as member of workspace `/home/ec2-user/work/mds/dlt-getting-started`
Initialized project `rest-api` at `/home/ec2-user/work/mds/dlt-getting-started/rest-api`
$ cd rest-api
$ uv venv
Using CPython 3.12.9
Creating virtual environment at: .venv
Activate with: source .venv/bin/activate
$ uv run dlt --version
dlt 1.9.0

dlt init します。途中の質問は Y で続行します。

$ uv run dlt init rest_api duckdb
Creating a new pipeline with the dlt core source rest_api (Generic API Source)
NOTE: Beginning with dlt 1.0.0, the source rest_api will no longer be copied from the verified sources repo but imported from dlt.sources. You can provide the --eject flag to revert to the old behavior.
Do you want to proceed? [Y/n]: 

Your new pipeline rest_api is ready to be customized!
* Review and change how dlt loads your data in rest_api_pipeline.py
* Add credentials for duckdb and other secrets in ./.dlt/secrets.toml
* Add the required dependencies to pyproject.toml:
  dlt[duckdb]>=1.9.0
  If the dlt dependency is already added, make sure you install the extra for duckdb to it
  If you are using poetry you may issue the following command:
  poetry add dlt -E duckdb

* Read https://dlthub.com/docs/walkthroughs/create-a-pipeline for more information
$ tree
.
├── .dlt
│   ├── config.toml
│   └── secrets.toml
├── .gitignore
├── README.md
├── main.py
├── pyproject.toml
└── rest_api_pipeline.py

パイプラインの実行

ドキュメントと異なり、requirements.txt はできていないので、出力されたとおり、pyproject.toml に依存関係を追加します。

pyproject.toml
@@ -4,4 +4,6 @@
 description = "Add your description here"
 readme = "README.md"
 requires-python = ">=3.12"
-dependencies = []
+dependencies = [
+    dlt[duckdb]>=1.9.0
+]

このまま動かすと github の rate 制限でエラーになるため、github をコメントアウトして実行します。

rest_api_pipeline.py
@@ -149,5 +149,5 @@
 if __name__ == "__main__":
-    load_github()
+    # load_github()
     load_pokemon()
$ uv run rest_api_pipeline.py 
2025-04-06 08:11:11,321|[ERROR]|34180|140455832667968|dlt|utils.py|check_connection:21|Error checking connection: The following resources could not be found in source rest_api: {'not_existing_endpoint'}. Available resources are: {'location', 'pokemon', 'berry'}
2025-04-06 08:11:11,538|[WARNING]|34180|140455832667968|dlt|client.py|detect_paginator:312|Fallback paginator used: SinglePagePaginator at 7fbe5bb2aa50. Please provide right paginator manually.
Pipeline rest_api_pokemon load step completed in 0.17 seconds
1 load package(s) were loaded to destination duckdb and into dataset rest_api_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/rest-api/rest_api_pokemon.duckdb location to store data
Load package 1743927071.383116 is LOADED and contains no failed jobs

エラーや警告が出ているが、以下のとおり、1302 行が DuckDB に格納されました。

$ ~/.duckdb/cli/latest/duckdb 
v1.2.1 8e52ec4395
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D .open rest_api_pokemon.duckdb
D .tables
_dlt_loads           _dlt_version         location           
_dlt_pipeline_state  berry                pokemon            
D select * from rest_api_data.pokemon;
┌─────────────────────────────┬──────────────────────────────────────────┬───────────────────┬────────────────┐
│            name             │                   url                    │   _dlt_load_id    │    _dlt_id     │
│           varchar           │                 varchar                  │      varchar      │    varchar     │
├─────────────────────────────┼──────────────────────────────────────────┼───────────────────┼────────────────┤
│ bulbasaur                   │ https://pokeapi.co/api/v2/pokemon/1/     │ 1743927071.383116 │ x5YRTG/PnZDqLA │
│ ivysaur                     │ https://pokeapi.co/api/v2/pokemon/2/     │ 1743927071.383116 │ aOkyEKUQqT5I+g │
│ venusaur                    │ https://pokeapi.co/api/v2/pokemon/3/     │ 1743927071.383116 │ LSFAjHFmu3nksw │
│ charmander                  │ https://pokeapi.co/api/v2/pokemon/4/     │ 1743927071.383116 │ 9LH0ak/5YqWQ/Q │
│ charmeleon                  │ https://pokeapi.co/api/v2/pokemon/5/     │ 1743927071.383116 │ KbgLYyCcZmWBZg │
│ charizard                   │ https://pokeapi.co/api/v2/pokemon/6/     │ 1743927071.383116 │ HC8UYN6tXMi4LQ │
│ squirtle                    │ https://pokeapi.co/api/v2/pokemon/7/     │ 1743927071.383116 │ Jp9dJ61hn5neEQ │
│ wartortle                   │ https://pokeapi.co/api/v2/pokemon/8/     │ 1743927071.383116 │ 0b15UoCBQMu20Q │
│ blastoise                   │ https://pokeapi.co/api/v2/pokemon/9/     │ 1743927071.383116 │ q3XR4Zx3UiWTPQ │
│ caterpie                    │ https://pokeapi.co/api/v2/pokemon/10/    │ 1743927071.383116 │ h7Hs47VCSQpGaQ │
│ metapod                     │ https://pokeapi.co/api/v2/pokemon/11/    │ 1743927071.383116 │ Xib1OanrtXydsw │
│ butterfree                  │ https://pokeapi.co/api/v2/pokemon/12/    │ 1743927071.383116 │ wk79bEDFV7pgQQ │
│ weedle                      │ https://pokeapi.co/api/v2/pokemon/13/    │ 1743927071.383116 │ u8HsCI6QhFUtrw │
│ kakuna                      │ https://pokeapi.co/api/v2/pokemon/14/    │ 1743927071.383116 │ +6rbvqAFsFDTvw │
│ beedrill                    │ https://pokeapi.co/api/v2/pokemon/15/    │ 1743927071.383116 │ /txiUsIf0Kd/vQ │
│ pidgey                      │ https://pokeapi.co/api/v2/pokemon/16/    │ 1743927071.383116 │ j5wfFIfo1ZCVFQ │
│ pidgeotto                   │ https://pokeapi.co/api/v2/pokemon/17/    │ 1743927071.383116 │ rwCzjLbfmGyHrw │
│ pidgeot                     │ https://pokeapi.co/api/v2/pokemon/18/    │ 1743927071.383116 │ Q25KkTxI0cKF/w │
│ rattata                     │ https://pokeapi.co/api/v2/pokemon/19/    │ 1743927071.383116 │ 37qoBEd+yA2azQ │
│ raticate                    │ https://pokeapi.co/api/v2/pokemon/20/    │ 1743927071.383116 │ 9JZo2lO/tw7zow │
│    ·                        │                   ·                      │         ·         │       ·        │
│    ·                        │                   ·                      │         ·         │       ·        │
│    ·                        │                   ·                      │         ·         │       ·        │
│ tatsugiri-droopy            │ https://pokeapi.co/api/v2/pokemon/10258/ │ 1743927071.383116 │ glDbU62GwBf+Xw │
│ tatsugiri-stretchy          │ https://pokeapi.co/api/v2/pokemon/10259/ │ 1743927071.383116 │ L6/XzsIRwmSYCQ │
│ squawkabilly-blue-plumage   │ https://pokeapi.co/api/v2/pokemon/10260/ │ 1743927071.383116 │ ci2F5LvfTJG+AA │
│ squawkabilly-yellow-plumage │ https://pokeapi.co/api/v2/pokemon/10261/ │ 1743927071.383116 │ t7nsdzb8iq2Yig │
│ squawkabilly-white-plumage  │ https://pokeapi.co/api/v2/pokemon/10262/ │ 1743927071.383116 │ dhQevhXet2Ivng │
│ gimmighoul-roaming          │ https://pokeapi.co/api/v2/pokemon/10263/ │ 1743927071.383116 │ kvk7X9K0cgAPyQ │
│ koraidon-limited-build      │ https://pokeapi.co/api/v2/pokemon/10264/ │ 1743927071.383116 │ syWzCkYGTmxW4w │
│ koraidon-sprinting-build    │ https://pokeapi.co/api/v2/pokemon/10265/ │ 1743927071.383116 │ QLRg03EXUlxBSQ │
│ koraidon-swimming-build     │ https://pokeapi.co/api/v2/pokemon/10266/ │ 1743927071.383116 │ KgMSLmMgVUEYng │
│ koraidon-gliding-build      │ https://pokeapi.co/api/v2/pokemon/10267/ │ 1743927071.383116 │ Gr2Q9K9iteTejw │
│ miraidon-low-power-mode     │ https://pokeapi.co/api/v2/pokemon/10268/ │ 1743927071.383116 │ bIkyu2abvRVECA │
│ miraidon-drive-mode         │ https://pokeapi.co/api/v2/pokemon/10269/ │ 1743927071.383116 │ UfBO+4Mfxw293w │
│ miraidon-aquatic-mode       │ https://pokeapi.co/api/v2/pokemon/10270/ │ 1743927071.383116 │ Mr5sNDwmjyv70g │
│ miraidon-glide-mode         │ https://pokeapi.co/api/v2/pokemon/10271/ │ 1743927071.383116 │ P7HIoBde+8ytgg │
│ ursaluna-bloodmoon          │ https://pokeapi.co/api/v2/pokemon/10272/ │ 1743927071.383116 │ w70EJa8XFs6Rgw │
│ ogerpon-wellspring-mask     │ https://pokeapi.co/api/v2/pokemon/10273/ │ 1743927071.383116 │ 2Z7lu3PHgNQPYw │
│ ogerpon-hearthflame-mask    │ https://pokeapi.co/api/v2/pokemon/10274/ │ 1743927071.383116 │ pZWubGPwSSsKxQ │
│ ogerpon-cornerstone-mask    │ https://pokeapi.co/api/v2/pokemon/10275/ │ 1743927071.383116 │ 6YNuuUZuJVz3Cg │
│ terapagos-terastal          │ https://pokeapi.co/api/v2/pokemon/10276/ │ 1743927071.383116 │ +5y8clVXDqV7eQ │
│ terapagos-stellar           │ https://pokeapi.co/api/v2/pokemon/10277/ │ 1743927071.383116 │ K9TmDZyyMsyynA │
├─────────────────────────────┴──────────────────────────────────────────┴───────────────────┴────────────────┤
│ 1302 rows (40 shown)                                                                              4 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D .exit

データの調査

$ uv add streamlit
Resolved 72 packages in 388ms
Prepared 13 packages in 977ms
Installed 21 packages in 38ms
:
 + watchdog==6.0.0
$ uv run dlt pipeline rest_api_pokemon show
Found pipeline rest_api_pokemon in /home/ec2-user/.dlt/pipelines

Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.


  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://10.48.149.30:8501
  External URL: http://18.178.73.138:8501

データの置換

rest_api_pipeline.py
@@ -125,6 +125,7 @@
                         "limit": 1000,
                     },
                 },
+                "write_disposition": "replace", # Setting the write disposition to `replace
             },
             "resources": [
                 "pokemon",

データのマージ

rest_api_pipeline.py
@@ -125,10 +125,20 @@ def load_pokemon() -> None:
                         "limit": 1000,
                     },
                 },
-                "write_disposition": "replace", # Setting the write disposition to `replace
+                # For the `berry` and `location` resources, we keep
+                # the `replace` write disposition
+                "write_disposition": "replace",
             },
             "resources": [
-                "pokemon",
+                # We create a specific configuration for the `pokemon` resource
+                # using a dictionary instead of a string to configure
+                # the primary key and write disposition
+                {
+                    "name": "pokemon",
+                    "primary_key": "name",
+                    "write_disposition": "merge",
+                },
+                # The `berry` and `location` resources will use the default
                 "berry",
                 "location",
             ],
$ uv run rest_api_pipeline.py 
2025-04-07 05:37:43,399|[ERROR]|31535|140519944869696|dlt|utils.py|check_connection:21|Error checking connection: The following resources could not be found in source rest_api: {'not_existing_endpoint'}. Available resources are: {'location', 'pokemon', 'berry'}
2025-04-07 05:37:43,602|[WARNING]|31535|140519944869696|dlt|client.py|detect_paginator:312|Fallback paginator used: SinglePagePaginator at 7fcd30ed1f10. Please provide right paginator manually.
Pipeline rest_api_pokemon load step completed in 0.24 seconds
1 load package(s) were loaded to destination duckdb and into dataset rest_api_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/rest-api/rest_api_pokemon.duckdb location to store data
Load package 1744004263.47209 is LOADED and contains no failed jobs
$ ~/.duckdb/cli/latest/duckdb 
v1.2.1 8e52ec4395
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D .open rest_api_pokemon.duckdb
D select database_name, schema_name, table_name from duckdb_tables;
┌──────────────────┬───────────────────────┬─────────────────────┐
│  database_name   │      schema_name      │     table_name      │
│     varchar      │        varchar        │       varchar       │
├──────────────────┼───────────────────────┼─────────────────────┤
│ rest_api_pokemon │ rest_api_data         │ berry               │
│ rest_api_pokemon │ rest_api_data         │ location            │
│ rest_api_pokemon │ rest_api_data         │ pokemon             │
│ rest_api_pokemon │ rest_api_data         │ _dlt_loads          │
│ rest_api_pokemon │ rest_api_data         │ _dlt_pipeline_state │
│ rest_api_pokemon │ rest_api_data         │ _dlt_version        │
│ rest_api_pokemon │ rest_api_data_staging │ pokemon             │
│ rest_api_pokemon │ rest_api_data_staging │ _dlt_version        │
└──────────────────┴───────────────────────┴─────────────────────┘
D .exit

データのインクリメンタルなロード

github は rate 制限でエラーになるため、スキップします。

SQL DB

同様に、uv のワークスペースのメンバーとしてプロジェクトを作成します。
dlt init します。途中の質問は Y で続行します。

$ uv run dlt init sql_database duckdb
Creating a new pipeline with the dlt core source sql_database (Source that loads tables form any SQLAlchemy supported database, supports batching requests and incremental loads.)
NOTE: Beginning with dlt 1.0.0, the source sql_database will no longer be copied from the verified sources repo but imported from dlt.sources. You can provide the --eject flag to revert to the old behavior.
Do you want to proceed? [Y/n]: y

Your new pipeline sql_database is ready to be customized!
* Review and change how dlt loads your data in sql_database_pipeline.py
* Add credentials for duckdb and other secrets in ./.dlt/secrets.toml
* Add the required dependencies to pyproject.toml:
  dlt[duckdb,sql-database]>=1.9.0
  If the dlt dependency is already added, make sure you install the extra for duckdb to it
  If you are using poetry you may issue the following command:
  poetry add dlt -E duckdb

* Read https://dlthub.com/docs/walkthroughs/create-a-pipeline for more information
$ tree
.
├── .dlt
│   ├── config.toml
│   └── secrets.toml
├── .gitignore
├── README.md
├── main.py
├── pyproject.toml
└── sql_database_pipeline.py

同様に、出力されたとおり、pyproject.toml に依存関係を追加します。

pyproject.toml
@@ -4,4 +4,6 @@
 description = "Add your description here"
 readme = "README.md"
 requires-python = ">=3.12"
-dependencies = []
+dependencies = [
+    "dlt[duckdb,sql-database]>=1.9.0"
+]

パイプラインスクリプトを構成する

sql_database_pipline.py
@@ -344,8 +344,26 @@
     info = pipeline.run(sql_alchemy_source)
     print(info)
 
+def load_tables_family_and_genome():
+
+    # Create a dlt source that will load tables "family" and "genome"
+    source = sql_database().with_resources("family", "genome")
+
+    # Create a dlt pipeline object
+    pipeline = dlt.pipeline(
+        pipeline_name="sql_to_duckdb_pipeline", # Custom name for the pipeline
+        destination="duckdb", # dlt destination to which the data will be loaded
+        dataset_name="sql_to_duckdb_pipeline_data" # Custom name for the dataset created in the destination
+    )
+
+    # Run the pipeline
+    load_info = pipeline.run(source)
+
+    # Pretty print load information
+    print(load_info)
 
 if __name__ == "__main__":
+    load_tables_family_and_genome()
     # Load selected tables with different settings
     # load_select_tables_from_database()
 
@@ -356,7 +374,7 @@
     # select_with_end_value_and_row_order()
 
     # Load tables with the standalone table resource
-    load_standalone_table_resource()
+    # load_standalone_table_resource()
 
     # Load all tables from the database.
     # Warning: The sample database is very large

資格情報を追加する

.dlt/secrets.toml
[sources.sql_database.credentials]
drivername = "mysql+pymysql" # database+dialect
database = "Rfam"
password = ""
username = "rfamro"
host = "mysql-rfam-public.ebi.ac.uk"
port = 4497

依存関係をインストールする

最初にやってしまったので、pymysql だけ追加しておきます。

$ uv add pymysql
Resolved 76 packages in 268ms
Prepared 3 packages in 69ms
Installed 3 packages in 8ms
 + greenlet==3.1.1
 + pymysql==1.1.1
 + sqlalchemy==2.0.40

結局、pyproject.toml の差分は、

pyproject.toml
@@ -4,4 +4,7 @@
 description = "Add your description here"
 readme = "README.md"
 requires-python = ">=3.12"
-dependencies = []
+dependencies = [
+    "dlt[duckdb,sql-database]>=1.9.0",
+    "pymysql>=1.1.1",
+]

パイプラインを実行する

$ uv run sql_database_pipeline.py 
Pipeline sql_to_duckdb_pipeline load step completed in 4.34 seconds
1 load package(s) were loaded to destination duckdb and into dataset sql_to_duckdb_pipeline_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/sql-db/sql_to_duckdb_pipeline.duckdb location to store data
Load package 1744006598.0197978 is LOADED and contains no failed jobs
$ ~/.duckdb/cli/latest/duckdb 
v1.2.1 8e52ec4395
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D .open sql_to_duckdb_pipeline.duckdb
D select * from sql_to_duckdb_pipeline_data.genome;
┌─────────────┬─────────────────┬──────────────────┬─────────┬───┬──────────────────────┬──────────────────────┬────────────────────┬────────────────┐
│    upid     │  assembly_acc   │ assembly_version │ wgs_acc │ … │       created        │       updated        │    _dlt_load_id    │    _dlt_id     │
│   varchar   │     varchar     │      int64       │ varchar │   │ timestamp with tim…  │ timestamp with tim…  │      varchar       │    varchar     │
├─────────────┼─────────────────┼──────────────────┼─────────┼───┼──────────────────────┼──────────────────────┼────────────────────┼────────────────┤
│ RG000000001 │ NULL            │             NULL │ NULL    │ … │ 2017-06-06 15:11:0…  │ 2020-04-23 11:46:0…  │ 1744006598.0197978 │ RAoK+ODZamk3UQ │
│ RG000000002 │ NULL            │             NULL │ NULL    │ … │ 2017-06-06 15:11:0…  │ 2020-04-23 11:46:0…  │ 1744006598.0197978 │ drTISYKdQQ9B2g │
│ RG000000003 │ NULL            │             NULL │ NULL    │ … │ 2017-06-06 15:11:1…  │ 2020-04-23 11:47:1…  │ 1744006598.0197978 │ PpqHdR+rXcfIrw │
│ RG000000004 │ NULL            │             NULL │ NULL    │ … │ 2017-06-06 15:11:1…  │ 2020-04-23 11:46:0…  │ 1744006598.0197978 │ Gx2w3Lx18MhAJw │
│ RG000000005 │ NULL            │             NULL │ NULL    │ … │ 2017-06-06 15:11:2…  │ 2020-04-23 11:46:2…  │ 1744006598.0197978 │ drDFcqU8/hAPAw │
│ RG000000006 │ NULL            │             NULL │ NULL    │ … │ 2017-06-06 15:11:2…  │ 2020-04-23 11:52:3…  │ 1744006598.0197978 │ JqAi5zqERM4xyQ │
│ RG000000007 │ GCA_000413255.1 │                1 │ NULL    │ … │ 2017-06-06 15:11:2…  │ 2020-04-23 11:44:5…  │ 1744006598.0197978 │ kf3xKgFBwCRY2w │
│ RG000000009 │ GCF_000836845.1 │                1 │ NULL    │ … │ 2017-06-06 15:11:4…  │ 2020-04-23 11:49:5…  │ 1744006598.0197978 │ zfrLC0X2XNk8iQ │
│ RG000000010 │ GCF_000840125.1 │                1 │ NULL    │ … │ 2017-06-06 15:11:4…  │ 2020-04-23 11:46:2…  │ 1744006598.0197978 │ 77rxN5om2PJxRw │
│ RG000000011 │ GCF_000847605.1 │                1 │ NULL    │ … │ 2017-06-06 15:11:5…  │ 2020-04-23 11:46:0…  │ 1744006598.0197978 │ 2FwYLNks8oC1IQ │
│ RG000000012 │ GCF_000847825.1 │                1 │ NULL    │ … │ 2017-06-06 15:11:5…  │ 2020-04-23 11:46:3…  │ 1744006598.0197978 │ ffZy/QDTO/AS4Q │
│ RG000000013 │ GCF_000849085.1 │                1 │ NULL    │ … │ 2017-06-06 15:12:0…  │ 2020-04-23 11:47:2…  │ 1744006598.0197978 │ hgJPzGytJjMW7Q │
│ RG000000016 │ GCF_000849285.2 │                2 │ NULL    │ … │ 2017-06-06 15:12:1…  │ 2020-04-23 11:48:0…  │ 1744006598.0197978 │ +ClNxwioiVyKAA │
│ RG000000019 │ GCF_000853665.1 │                1 │ NULL    │ … │ 2017-06-06 15:12:2…  │ 2020-04-23 11:46:0…  │ 1744006598.0197978 │ pkPzLyG19KqYoQ │
│ RG000000020 │ GCF_000854685.1 │                1 │ NULL    │ … │ 2017-06-06 15:12:3…  │ 2020-04-23 11:49:5…  │ 1744006598.0197978 │ RcAEzAhDMJZhYA │
│ RG000000021 │ GCF_000854865.1 │                1 │ NULL    │ … │ 2017-06-06 15:12:3…  │ 2020-04-23 11:46:0…  │ 1744006598.0197978 │ gs+pOektB6J8IA │
│ RG000000025 │ GCF_000849985.1 │                1 │ NULL    │ … │ 2017-06-06 15:12:5…  │ 2020-04-23 11:46:1…  │ 1744006598.0197978 │ lN2m551byT+kww │
│ RG000000026 │ GCF_000852745.1 │                1 │ NULL    │ … │ 2017-06-06 15:12:5…  │ 2020-04-23 11:48:0…  │ 1744006598.0197978 │ RO2pMMDgrpUP+w │
│ RG000000027 │ NC_038861.1     │                1 │ NULL    │ … │ 2020-05-06 14:56:5…  │ 2024-09-10 04:53:1…  │ 1744006598.0197978 │ 39PRFerp+sByMw │
│ RG000000028 │ NC_032730.1     │                1 │ NULL    │ … │ 2020-05-06 14:56:5…  │ 2024-09-10 04:56:1…  │ 1744006598.0197978 │ v1HJFZlwxL/w+A │
│      ·      │      ·          │                · │  ·      │ · │          ·           │          ·           │         ·          │       ·        │
│      ·      │      ·          │                · │  ·      │ · │          ·           │          ·           │         ·          │       ·        │
│      ·      │      ·          │                · │  ·      │ · │          ·           │          ·           │         ·          │       ·        │
│ UP001269074 │ GCA_031320445.1 │                1 │ NULL    │ … │ 2024-06-27 09:23:0…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ 3N9L7w5vHw7Daw │
│ UP001269135 │ GCA_031297675.1 │                1 │ NULL    │ … │ 2024-06-28 19:29:5…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ R4AgPHTwF01Kuw │
│ UP001269161 │ GCA_032416515.1 │                1 │ NULL    │ … │ 2024-06-27 09:21:3…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ PpzQ8eKDQ2ZyjQ │
│ UP001269216 │ GCA_031314795.1 │                1 │ NULL    │ … │ 2024-06-27 09:22:0…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ z9Fpycw+xob8TQ │
│ UP001269251 │ GCA_032423685.1 │                1 │ NULL    │ … │ 2024-06-28 19:33:2…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ 7rBXZviyfFCniA │
│ UP001269254 │ GCA_031249865.1 │                1 │ NULL    │ … │ 2024-06-27 11:07:5…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ gHS9qUlQJ/w60g │
│ UP001269288 │ GCA_031298415.1 │                1 │ NULL    │ … │ 2024-06-28 19:30:5…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ 8UiqwxmwRImv1A │
│ UP001269304 │ GCA_031579545.1 │                1 │ NULL    │ … │ 2024-06-27 09:20:5…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ dnLPvoyA6qaoUA │
│ UP001269323 │ GCA_031315845.1 │                1 │ NULL    │ … │ 2024-06-27 09:23:4…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ aTc+gmY5ZiDT2g │
│ UP001269365 │ GCA_031297925.1 │                1 │ NULL    │ … │ 2024-06-27 09:19:5…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ 4OsdbYElmRpmJg │
│ UP001269389 │ GCA_031315295.1 │                1 │ NULL    │ … │ 2024-06-27 09:23:1…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ xDaVRqSqQkNP8Q │
│ UP001269398 │ GCA_031311295.1 │                1 │ NULL    │ … │ 2024-06-27 09:26:3…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ Cb0ZVbpstoqOSg │
│ UP001269477 │ GCA_031345405.1 │                1 │ NULL    │ … │ 2024-06-27 09:19:2…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ kCVeFFc3kJsTzA │
│ UP001269491 │ GCA_031313775.1 │                1 │ NULL    │ … │ 2024-06-28 19:33:5…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ 4jniuAfE2ukDJg │
│ UP001295947 │ GCA_963583325.1 │                1 │ NULL    │ … │ 2024-06-27 09:26:4…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ 5cMRssA3IYvIuw │
│ UP001295980 │ GCA_963583275.1 │                1 │ NULL    │ … │ 2024-06-27 09:26:3…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ 0uw4c/GtNPHWKw │
│ UP001296009 │ GCA_963583145.1 │                1 │ NULL    │ … │ 2024-06-28 19:29:4…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ v4LNPAv7TBgHXg │
│ UP001296230 │ GCA_963583065.1 │                1 │ NULL    │ … │ 2024-06-27 09:19:4…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ a3Xa1ncobXmLeg │
│ UP001296237 │ GCA_963583445.1 │                1 │ NULL    │ … │ 2024-06-28 19:34:5…  │ 2024-09-02 20:32:5…  │ 1744006598.0197978 │ Tu6c6BRCZM8ORg │
│ x           │ NULL            │             NULL │ NULL    │ … │ 2017-05-15 11:09:5…  │ 2020-04-23 11:53:1…  │ 1744006598.0197978 │ 0tjpW6cUvc+WbA │
├─────────────┴─────────────────┴──────────────────┴─────────┴───┴──────────────────────┴──────────────────────┴────────────────────┴────────────────┤
│ 32169 rows (40 shown)                                                                                                         24 columns (8 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D select * from sql_to_duckdb_pipeline_data.family;
┌──────────┬───────────────────┬───────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┬────────────────────┬────────────────┐
│ rfam_acc │      rfam_id      │ auto_wiki │     description      │ … │       created        │       updated        │    _dlt_load_id    │    _dlt_id     │
│ varchar  │      varchar      │   int64   │       varchar        │   │ timestamp with tim…  │ timestamp with tim…  │      varchar       │    varchar     │
├──────────┼───────────────────┼───────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┼────────────────────┼────────────────┤
│ RF00001  │ 5S_rRNA           │      1302 │ 5S ribosomal RNA     │ … │ 2013-10-03 20:41:4…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ OClMfuGcB5Wrvg │
│ RF00002  │ 5_8S_rRNA         │      1303 │ 5.8S ribosomal RNA   │ … │ 2013-10-03 20:47:0…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ qZDwAYdIO/camw │
│ RF00003  │ U1                │      1304 │ U1 spliceosomal RNA  │ … │ 2013-10-03 20:57:1…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ K+Wl36xNJTWIqQ │
│ RF00004  │ U2                │      1305 │ U2 spliceosomal RNA  │ … │ 2013-10-03 20:58:3…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ UAk9TdSlUWIC8A │
│ RF00005  │ tRNA              │      1306 │ tRNA                 │ … │ 2013-10-03 21:00:2…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ rXe0H2YAW/nzGA │
│ RF00006  │ Vault             │      1307 │ Vault RNA            │ … │ 2013-10-03 22:04:0…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ Jxk7lzwmRwCDsg │
│ RF00007  │ U12               │      1308 │ U12 minor spliceos…  │ … │ 2013-10-03 22:04:0…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ PDH7f4fn2cjg5g │
│ RF00008  │ Hammerhead_3      │      1309 │ Hammerhead ribozym…  │ … │ 2013-10-03 22:04:1…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ IDS8muklHuRsjg │
│ RF00009  │ RNaseP_nuc        │      1310 │ Nuclear RNase P      │ … │ 2013-10-03 22:04:1…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ 5k/trucon72JVA │
│ RF00010  │ RNaseP_bact_a     │      2441 │ Bacterial RNase P …  │ … │ 2013-10-03 22:04:2…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ /xTM/2cIzDyN5A │
│ RF00011  │ RNaseP_bact_b     │      2441 │ Bacterial RNase P …  │ … │ 2013-10-03 22:04:5…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ AeKbJ2sqX1ARyQ │
│ RF00012  │ U3                │      1312 │ Small nucleolar RN…  │ … │ 2013-10-03 22:04:5…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ HKKoiog1fgcjOw │
│ RF00013  │ 6S                │      2461 │ 6S / SsrS RNA        │ … │ 2013-10-03 22:05:0…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ +DEnYRXhvYO2oA │
│ RF00014  │ DsrA              │      1237 │ DsrA RNA             │ … │ 2013-02-01 11:56:1…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ A42TH1HH7R4/qQ │
│ RF00015  │ U4                │      1314 │ U4 spliceosomal RNA  │ … │ 2013-10-03 22:05:2…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ xj8cgaEmGDGM0Q │
│ RF00016  │ SNORD14           │      1242 │ Small nucleolar RN…  │ … │ 2013-02-01 11:56:2…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ auFI4SGhh2eBzA │
│ RF00017  │ Metazoa_SRP       │      1315 │ Metazoan signal re…  │ … │ 2013-10-03 22:07:5…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ E+gtShJ+PfZwNw │
│ RF00018  │ CsrB              │      2460 │ CsrB/RsmB RNA family │ … │ 2013-10-03 23:07:2…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ je4IWfYmNIftNQ │
│ RF00019  │ Y_RNA             │      1317 │ Y RNA                │ … │ 2013-10-03 23:07:3…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ Z6xD4ZC8vhXooA │
│ RF00020  │ U5                │      1318 │ U5 spliceosomal RNA  │ … │ 2013-10-03 23:08:4…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ I3QJhHgD/ybezg │
│    ·     │ ·                 │        ·  │          ·           │ · │          ·           │          ·           │         ·          │       ·        │
│    ·     │ ·                 │        ·  │          ·           │ · │          ·           │          ·           │         ·          │       ·        │
│    ·     │ ·                 │        ·  │          ·           │ · │          ·           │          ·           │         ·          │       ·        │
│ RF04290  │ mir-1197          │      1287 │ mir-1197 microRNA …  │ … │ 2023-10-23 23:12:4…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ G8z6tURzn78SYw │
│ RF04291  │ mir-368           │      1287 │ mir-368 microRNA p…  │ … │ 2023-10-23 23:47:3…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ uQXJEqWwI+PK9g │
│ RF04292  │ mir-379           │      1287 │ mir-379 microRNA p…  │ … │ 2023-10-24 00:00:1…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ KusLrBG8LTGjqw │
│ RF04293  │ mir-889           │      1287 │ mir-889 microRNA p…  │ … │ 2023-10-24 00:14:3…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ m0bCDPXGDZ666Q │
│ RF04294  │ mir-3578          │      1287 │ mir-3578 microRNA …  │ … │ 2023-10-24 00:22:2…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ fyLYP9F1uRsdYg │
│ RF04295  │ mir-329           │      1287 │ mir-329 microRNA p…  │ … │ 2023-10-24 00:58:2…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ D+3PjIw6tlulXg │
│ RF04296  │ mir-485           │      1287 │ mir-485 microRNA p…  │ … │ 2023-10-24 01:09:4…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ b+gnt7raOGOc0g │
│ RF04297  │ mir-35_2          │      1287 │ mir-35_2 microRNA …  │ … │ 2023-10-24 10:55:2…  │ 2024-09-10 04:51:1…  │ 1744006598.0197978 │ PkTj21tQpmaRpw │
│ RF04298  │ mir-36_2          │      1287 │ mir-36_2 microRNA …  │ … │ 2023-10-24 10:56:5…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ JwaN+kyG6J70fw │
│ RF04299  │ MIR814            │      1287 │ MIR814 microRNA pr…  │ … │ 2023-10-24 12:12:1…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ 8nxpH2YvaTgk/Q │
│ RF04300  │ mir-39            │      1287 │ mir-39 microRNA pr…  │ … │ 2023-10-24 15:20:3…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ ITyPpnYuaU/ZHw │
│ RF04301  │ mir-200           │      1287 │ mir-200 microRNA p…  │ … │ 2023-11-21 12:46:0…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ vIfA9jEzkihQpQ │
│ RF04302  │ mir-506           │      2235 │ mir-506 microRNA p…  │ … │ 2023-11-27 13:46:2…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ aiEjYQDfbKQJDw │
│ RF04303  │ MIR162_2          │      1287 │ MIR162_2 microRNA …  │ … │ 2023-11-28 13:04:3…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ YbHE+qxzGCVpGA │
│ RF04305  │ HCV_SL1412        │      2418 │ Hepatitis C virus …  │ … │ 2024-04-08 15:25:1…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ ZrN6XN9KvBkTlw │
│ RF04306  │ HCV_SL8001        │      2418 │ Hepatitis C virus …  │ … │ 2024-04-08 15:35:2…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ 6qKRAp9LtyvY4g │
│ RF04307  │ HCV_SL8670        │      2418 │ Hepatitis C virus …  │ … │ 2024-04-08 16:11:3…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ zIyQtZBl7OmFqw │
│ RF04308  │ HCV_SL2531-SL2549 │      2418 │ Hepatitis C virus …  │ … │ 2024-04-10 15:16:5…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ jLvZqnyvNdqK0w │
│ RF04309  │ HCV_J7880         │      2418 │ Hepatitis C virus …  │ … │ 2024-06-17 13:59:0…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ L7T1KucJBTXoMQ │
│ RF04310  │ nqrA-II           │      2768 │ nqrA-II ncRNA motif  │ … │ 2024-06-25 14:43:4…  │ 2024-09-09 21:15:1…  │ 1744006598.0197978 │ YExDc03iO5RNSA │
├──────────┴───────────────────┴───────────┴──────────────────────┴───┴──────────────────────┴──────────────────────┴────────────────────┴────────────────┤
│ 4178 rows (40 shown)                                                                                                               37 columns (8 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D .exit

データを探索する

$ uv add streamlit
Resolved 76 packages in 387ms
Audited 67 packages in 0.02ms
$ uv run dlt pipeline sql_to_duckdb_pipeline show
Found pipeline sql_to_duckdb_pipeline in /home/ec2-user/.dlt/pipelines

Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.


  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://10.48.149.30:8501
  External URL: http://18.178.73.138:8501

置換でロード

sql_database_pipline.py
@@ -357,7 +357,7 @@
     )
 
     # Run the pipeline
-    load_info = pipeline.run(source)
+    load_info = pipeline.run(source, write_disposition="replace") # Set write_disposition to load the data with "replace"
 
     # Pretty print load information
     print(load_info)
$ uv run sql_database_pipeline.py 
Pipeline sql_to_duckdb_pipeline load step completed in 4.47 seconds
1 load package(s) were loaded to destination duckdb and into dataset sql_to_duckdb_pipeline_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/sql-db/sql_to_duckdb_pipeline.duckdb location to store data
Load package 1744007697.087969 is LOADED and contains no failed jobs

マージでロード

sql_database_pipline.py
@@ -349,6 +349,10 @@
     # Create a dlt source that will load tables "family" and "genome"
     source = sql_database().with_resources("family", "genome")
 
+    # specify different loading strategy for each resource using apply_hints
+    source.family.apply_hints(write_disposition="merge", primary_key="rfam_id") # merge table "family" on column "rfam_id"
+    source.genome.apply_hints(write_disposition="merge", primary_key="upid") # merge table "genome" on column "upid"
+
     # Create a dlt pipeline object
     pipeline = dlt.pipeline(
         pipeline_name="sql_to_duckdb_pipeline", # Custom name for the pipeline
@@ -357,7 +361,7 @@
     )
 
     # Run the pipeline
-    load_info = pipeline.run(source, write_disposition="replace") # Set write_disposition to load the data with "replace"
+    load_info = pipeline.run(source)
 
     # Pretty print load information
     print(load_info)
$ uv run sql_database_pipeline.py 
Pipeline sql_to_duckdb_pipeline load step completed in 4.84 seconds
1 load package(s) were loaded to destination duckdb and into dataset sql_to_duckdb_pipeline_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/sql-db/sql_to_duckdb_pipeline.duckdb location to store data
Load package 1744007926.344812 is LOADED and contains no failed jobs

インクリメンタルにロード

sql_database_pipline.py
@@ -349,9 +349,8 @@ 
     # Create a dlt source that will load tables "family" and "genome"
     source = sql_database().with_resources("family", "genome")
 
-    # specify different loading strategy for each resource using apply_hints
-    source.family.apply_hints(write_disposition="merge", primary_key="rfam_id") # merge table "family" on column "rfam_id"
-    source.genome.apply_hints(write_disposition="merge", primary_key="upid") # merge table "genome" on column "upid"
+    # only load rows whose "updated" value is greater than the last pipeline run
+    source.family.apply_hints(incremental=dlt.sources.incremental("updated"))
 
     # Create a dlt pipeline object
     pipeline = dlt.pipeline(
$ uv run sql_database_pipeline.py 
2025-04-07 06:42:45,784|[WARNING]|36348|140550876481344|dlt|__init__.py|_check_duplicate_cursor_threshold:591|Large number of records (832) sharing the same value of cursor field 'updated'. This can happen if the cursor field has a low resolution (e.g., only stores dates without times), causing many records to share the same cursor value. Consider using a cursor column with higher resolution to reduce the deduplication state size.
Pipeline sql_to_duckdb_pipeline load step completed in 4.71 seconds
1 load package(s) were loaded to destination duckdb and into dataset sql_to_duckdb_pipeline_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/sql-db/sql_to_duckdb_pipeline.duckdb location to store data
Load package 1744008155.7218933 is LOADED and contains no failed jobs

File System

同様に、uv のワークスペースのメンバーとしてプロジェクトを作成します。
dlt init します。途中の質問は Y で続行します。

$ uv run dlt init filesystem duckdb
Creating a new pipeline with the dlt core source filesystem (Reads files in s3, gs or azure buckets using fsspec and provides convenience resources for chunked reading of various file formats)
NOTE: Beginning with dlt 1.0.0, the source filesystem will no longer be copied from the verified sources repo but imported from dlt.sources. You can provide the --eject flag to revert to the old behavior.
Do you want to proceed? [Y/n]: y

Your new pipeline filesystem is ready to be customized!
* Review and change how dlt loads your data in filesystem_pipeline.py
* Add credentials for duckdb and other secrets in ./.dlt/secrets.toml
* Add the required dependencies to pyproject.toml:
  dlt[duckdb,filesystem]>=1.9.0
  If the dlt dependency is already added, make sure you install the extra for duckdb to it
  If you are using poetry you may issue the following command:
  poetry add dlt -E duckdb

* Read https://dlthub.com/docs/walkthroughs/create-a-pipeline for more information
$ tree
.
├── .dlt
│   ├── config.toml
│   └── secrets.toml
├── .gitignore
├── README.md
├── filesystem_pipeline.py
├── main.py
└── pyproject.toml

同様に、出力されたとおり、pyproject.toml に依存関係を追加します。

pyproject.toml
@@ -4,4 +4,6 @@
 description = "Add your description here"
 readme = "README.md"
 requires-python = ">=3.12"
-dependencies = []
+dependencies = [
+    "dlt[duckdb,filesystem]>=1.9.0"
+]

パイプラインの作成

filesystem_pipeline.py
import dlt
from dlt.sources.filesystem import filesystem, read_csv

files = filesystem(bucket_url="gs://filesystem-tutorial", file_glob="encounters*.csv")
reader = (files | read_csv()).with_name("encounters")
pipeline = dlt.pipeline(pipeline_name="hospital_data_pipeline", dataset_name="hospital_data", destination="duckdb")

info = pipeline.run(reader)
print(info)

ファイルシステムソースの設定

.dlt/secrets.toml
[sources.filesystem.credentials]
client_email = "public-access@dlthub-sandbox.iam.gserviceaccount.com"
project_id = "dlthub-sandbox"
private_key = "-----BEGIN PRIVATE KEY-----\nMIIEvAIBADANBgkqhkiG9w0BAQEFAASCBKYwggSiAgEAAoIBAQDGWsVHJRjliojx\nTo+j1qu+x8PzC5ZHZrMx6e8OD6tO8uxMyl65ByW/4FZkVXkS4SF/UYPigGN+rel4\nFmySTbP9orva4t3Pk1B9YSvQMB7V5IktmTIW9Wmdmn5Al8Owb1RehgIidm1EX/Z9\nLr09oLpO6+jUu9RIP2Lf2mVQ6tvkgl7UOdpdGACSNGzRiZgVZDOaDIgH0Tl4UWmK\n6iPxhwZy9YC2B1beLB/NU+F6DUykrEpBzCFQTqFoTUcuDAEvuvpU9JrU2iBMiOGw\nuP3TYSiudhBjmauEUWaMiqWAgFeX5ft1vc7/QWLdI//SAjaiTAu6pTer29Q0b6/5\niGh0jRXpAgMBAAECggEAL8G9C9MXunRvYkH6/YR7F1T7jbH1fb1xWYwsXWNSaJC+\nagKzabMZ2KfHxSJ7IxuHOCNFMKyex+pRcvNbMqJ4upGKzzmeFBMw5u8VYGulkPQU\nPyFKWRK/Wg3PZffkSr+TPargKrH+vt6n9x3gvEzNbqEIDugmRTrVsHXhvOi/BrYc\nWhppHSVQidWZi5KVwDEPJjDQiHEcYI/vfIy1WhZ8VuPAaE5nMZ1m7gTdeaWWKIAj\n/p2ZkLgRdCY8vNkfaNDAxDbvH+CMuTtOw55GydzsYYiofANS6xZ8CedGkYaGi82f\nqGdLghX61Sg3UAb5SI36T/9XbuCpTf3B/SMV20ew8QKBgQDm2yUxL71UqI/xK9LS\nHWnqfHpKmHZ+U9yLvp3v79tM8XueSRKBTJJ4H+UvVQrXlypT7cUEE+sGpTrCcDGL\nm8irtdUmMvdi7AnRBgmWdYKig/kgajLOUrjXqFt/BcFgqMyTfzqPt3xdp6F3rSEK\nHE6PQ8I3pJ0BJOSJRa6Iw2VH1QKBgQDb9WbVFjYwTIKJOV4J2plTK581H8PI9FSt\nUASXcoMTixybegk8beGdwfm2TkyF/UMzCvHfuaUhf+S0GS5Zk31Wkmh1YbmFU4Q9\nm9K/3eoaqF7CohpigB0wJw4HfqNh6Qt+nICOMCv++gw7+/UwfV72dCqr0lpzfX5F\nAsez8igTxQKBgDsq/axOnQr+rO3WGpGJwmS8BKfrzarxGXyjnV0qr51X4yQdfGWx\nV3T8T8RC2qWI8+tQ7IbwB/PLE3VURg6PHe6MixXgSDGNZ7KwBnMOqS23/3kEXwMs\nhn2Xg+PZeMeqW8yN9ldxYqmqViMTN32c5bGoXzXdtfPeHcjlGCerVOEFAoGADVPi\nRjkRUX3hTvVF6Gzxa2OyQuLI1y1O0C2QCakrngyI0Dblxl6WFBwDyHMYGepNnxMj\nsr2p7sy0C+GWuGDCcHNwluQz/Ish8SW28F8+5xyamUp/NMa0fg1vwS6AMdeQFbzf\n4T2z/MAj66KJqcV+8on5Z+3YAzVwaDgR56pdmU0CgYBo2KWcNWAhZ1Qa6sNrITLV\nGlxg6tWP3OredZrmKb1kj5Tk0V+EwVN+HnKzMalv6yyyK7SWq1Z6rvCye37vy27q\nD7xfuz0c0H+48uWJpdLcsxpTioopsRPayiVDKlHSe/Qa+MEjAG3ded5TJiC+5iSw\nxWJ51y0wpme0LWgzzoLbRw==\n-----END PRIVATE KEY-----\n"
.dlt/config.toml
[sources.filesystem]
bucket_url="gs://filesystem-tutorial"

パイプラインの実行

このままいくと死ぬので、gs の依存関係を追加します。

$ uv add dlt[gs]
Resolved 106 packages in 83ms
Prepared 16 packages in 54ms
Installed 16 packages in 19ms
:
 + rsa==4.9
$ uv run python filesystem_pipeline.py
Pipeline hospital_data_pipeline load step completed in 3.29 seconds
1 load package(s) were loaded to destination duckdb and into dataset hospital_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/filesystem/hospital_data_pipeline.duckdb location to store data
Load package 1744009708.2871656 is LOADED and contains no failed jobs

データの調査

これまでと一緒ですが、

$ uv add streamlit
Resolved 106 packages in 46ms
Audited 93 packages in 0.05ms
$ uv run dlt pipeline hospital_data_pipeline show
Found pipeline hospital_data_pipeline in /home/ec2-user/.dlt/pipelines

Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.


  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://10.48.149.30:8501
  External URL: http://18.178.73.138:8501

ロードデータの追加、置換、マージ

ページにある merge の場合

filesystem_pipeline.py
@@ -3,7 +3,8 @@ 
 
 files = filesystem(bucket_url="gs://filesystem-tutorial", file_glob="encounters*.csv")
 reader = (files | read_csv()).with_name("encounters")
+reader.apply_hints(primary_key="id")
 pipeline = dlt.pipeline(pipeline_name="hospital_data_pipeline", dataset_name="hospital_data", destination="duckdb")
 
-info = pipeline.run(reader)
+info = pipeline.run(reader, write_disposition="merge")
 print(info)
$ uv run python filesystem_pipeline.py
Pipeline hospital_data_pipeline load step completed in 3.52 seconds
1 load package(s) were loaded to destination duckdb and into dataset hospital_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/filesystem/hospital_data_pipeline.duckdb location to store data
Load package 1744010182.1334121 is LOADED and contains no failed jobs

データをインクリメンタルにロードする

ここは、2段階で実行します。

filesystem_pipeline.py
@@ -2,6 +2,7 @@ 
 from dlt.sources.filesystem import filesystem, read_csv
 
 files = filesystem(bucket_url="gs://filesystem-tutorial", file_glob="encounters*.csv")
+files.apply_hints(incremental=dlt.sources.incremental("modification_date"))
 reader = (files | read_csv()).with_name("encounters")
 reader.apply_hints(primary_key="id")
 pipeline = dlt.pipeline(pipeline_name="hospital_data_pipeline", dataset_name="hospital_data", destination="duckdb")
$ uv run python filesystem_pipeline.py
Pipeline hospital_data_pipeline load step completed in 3.47 seconds
1 load package(s) were loaded to destination duckdb and into dataset hospital_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/filesystem/hospital_data_pipeline.duckdb location to store data
Load package 1744011013.538883 is LOADED and contains no failed jobs

1回実行してから、以下の修正を行います。

filesystem_pipeline.py
@@ -4,7 +4,7 @@ 
 files = filesystem(bucket_url="gs://filesystem-tutorial", file_glob="encounters*.csv")
 files.apply_hints(incremental=dlt.sources.incremental("modification_date"))
 reader = (files | read_csv()).with_name("encounters")
-reader.apply_hints(primary_key="id")
+reader.apply_hints(primary_key="id", incremental=dlt.sources.incremental("STOP"))
 pipeline = dlt.pipeline(pipeline_name="hospital_data_pipeline", dataset_name="hospital_data", destination="duckdb")
 
 info = pipeline.run(reader, write_disposition="merge")
$ uv run python filesystem_pipeline.py
Pipeline hospital_data_pipeline load step completed in 0.07 seconds
1 load package(s) were loaded to destination duckdb and into dataset hospital_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/filesystem/hospital_data_pipeline.duckdb location to store data
Load package 1744011132.9863658 is LOADED and contains no failed jobs

以降は、サンプルデータを調達しないといけないようなので、スキップします。

おわりに

いかがでしたでしょうか。
これまでのデータ連携では、まず、オリジナルのデータを 中立的な立場 で読み込んで、それを加工して、格納するという ETL の処理をイメージしていましたが、dlt では、読むと同時に書き込み先に書き込めるように自動で整えて、どんどん書いていけるのが新鮮でした。

後続の dbt が、標準化、中間形式、最終のビジネス形式のように加工していくので、dlt は、準備に時間をかけずに、そのままアップするという思想なんだと思います。ELT は、ETL の T が後ろに行っただけというのは甘かったようです。T が後ろに行くことによって、E と L の処理も変わっていく。テーブル一つ一つ、名前を確認して、スキーマ定義して、とかではなく、データソースにつないだら、まとめてドカンと機械的にコピーしましょうと理解しました。

GitHubで編集を提案
株式会社ROBONの技術ブログ

Discussion