dlt入門
はじめに
前回の dbt に続いて、dlt です。dbt は、data build tool。dlt は、data load tool。
ま、やってみます。
やってみる
インストール
uv でプロジェクトを作って、インストールします。
$ uv init
Initialized project `dlt-getting-started`
$ uv venv
Using CPython 3.12.9
Creating virtual environment at: .venv
Activate with: source .venv/bin/activate
$ uv add dlt[duckdb]
Resolved 50 packages in 566ms
Prepared 30 packages in 164ms
Installed 43 packages in 46ms
:
+ urllib3==2.3.0
$ uv run dlt --version
dlt 1.9.0
REST API
uv のワークスペースのメンバーとしてプロジェクトを作成します。
$ uv init rest-api
Adding `rest-api` as member of workspace `/home/ec2-user/work/mds/dlt-getting-started`
Initialized project `rest-api` at `/home/ec2-user/work/mds/dlt-getting-started/rest-api`
$ cd rest-api
$ uv venv
Using CPython 3.12.9
Creating virtual environment at: .venv
Activate with: source .venv/bin/activate
$ uv run dlt --version
dlt 1.9.0
dlt init
します。途中の質問は Y
で続行します。
$ uv run dlt init rest_api duckdb
Creating a new pipeline with the dlt core source rest_api (Generic API Source)
NOTE: Beginning with dlt 1.0.0, the source rest_api will no longer be copied from the verified sources repo but imported from dlt.sources. You can provide the --eject flag to revert to the old behavior.
Do you want to proceed? [Y/n]:
Your new pipeline rest_api is ready to be customized!
* Review and change how dlt loads your data in rest_api_pipeline.py
* Add credentials for duckdb and other secrets in ./.dlt/secrets.toml
* Add the required dependencies to pyproject.toml:
dlt[duckdb]>=1.9.0
If the dlt dependency is already added, make sure you install the extra for duckdb to it
If you are using poetry you may issue the following command:
poetry add dlt -E duckdb
* Read https://dlthub.com/docs/walkthroughs/create-a-pipeline for more information
$ tree
.
├── .dlt
│ ├── config.toml
│ └── secrets.toml
├── .gitignore
├── README.md
├── main.py
├── pyproject.toml
└── rest_api_pipeline.py
パイプラインの実行
ドキュメントと異なり、requirements.txt
はできていないので、出力されたとおり、pyproject.toml
に依存関係を追加します。
@@ -4,4 +4,6 @@
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
-dependencies = []
+dependencies = [
+ dlt[duckdb]>=1.9.0
+]
このまま動かすと github の rate 制限でエラーになるため、github をコメントアウトして実行します。
@@ -149,5 +149,5 @@
if __name__ == "__main__":
- load_github()
+ # load_github()
load_pokemon()
$ uv run rest_api_pipeline.py
2025-04-06 08:11:11,321|[ERROR]|34180|140455832667968|dlt|utils.py|check_connection:21|Error checking connection: The following resources could not be found in source rest_api: {'not_existing_endpoint'}. Available resources are: {'location', 'pokemon', 'berry'}
2025-04-06 08:11:11,538|[WARNING]|34180|140455832667968|dlt|client.py|detect_paginator:312|Fallback paginator used: SinglePagePaginator at 7fbe5bb2aa50. Please provide right paginator manually.
Pipeline rest_api_pokemon load step completed in 0.17 seconds
1 load package(s) were loaded to destination duckdb and into dataset rest_api_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/rest-api/rest_api_pokemon.duckdb location to store data
Load package 1743927071.383116 is LOADED and contains no failed jobs
エラーや警告が出ているが、以下のとおり、1302 行が DuckDB に格納されました。
$ ~/.duckdb/cli/latest/duckdb
v1.2.1 8e52ec4395
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D .open rest_api_pokemon.duckdb
D .tables
_dlt_loads _dlt_version location
_dlt_pipeline_state berry pokemon
D select * from rest_api_data.pokemon;
┌─────────────────────────────┬──────────────────────────────────────────┬───────────────────┬────────────────┐
│ name │ url │ _dlt_load_id │ _dlt_id │
│ varchar │ varchar │ varchar │ varchar │
├─────────────────────────────┼──────────────────────────────────────────┼───────────────────┼────────────────┤
│ bulbasaur │ https://pokeapi.co/api/v2/pokemon/1/ │ 1743927071.383116 │ x5YRTG/PnZDqLA │
│ ivysaur │ https://pokeapi.co/api/v2/pokemon/2/ │ 1743927071.383116 │ aOkyEKUQqT5I+g │
│ venusaur │ https://pokeapi.co/api/v2/pokemon/3/ │ 1743927071.383116 │ LSFAjHFmu3nksw │
│ charmander │ https://pokeapi.co/api/v2/pokemon/4/ │ 1743927071.383116 │ 9LH0ak/5YqWQ/Q │
│ charmeleon │ https://pokeapi.co/api/v2/pokemon/5/ │ 1743927071.383116 │ KbgLYyCcZmWBZg │
│ charizard │ https://pokeapi.co/api/v2/pokemon/6/ │ 1743927071.383116 │ HC8UYN6tXMi4LQ │
│ squirtle │ https://pokeapi.co/api/v2/pokemon/7/ │ 1743927071.383116 │ Jp9dJ61hn5neEQ │
│ wartortle │ https://pokeapi.co/api/v2/pokemon/8/ │ 1743927071.383116 │ 0b15UoCBQMu20Q │
│ blastoise │ https://pokeapi.co/api/v2/pokemon/9/ │ 1743927071.383116 │ q3XR4Zx3UiWTPQ │
│ caterpie │ https://pokeapi.co/api/v2/pokemon/10/ │ 1743927071.383116 │ h7Hs47VCSQpGaQ │
│ metapod │ https://pokeapi.co/api/v2/pokemon/11/ │ 1743927071.383116 │ Xib1OanrtXydsw │
│ butterfree │ https://pokeapi.co/api/v2/pokemon/12/ │ 1743927071.383116 │ wk79bEDFV7pgQQ │
│ weedle │ https://pokeapi.co/api/v2/pokemon/13/ │ 1743927071.383116 │ u8HsCI6QhFUtrw │
│ kakuna │ https://pokeapi.co/api/v2/pokemon/14/ │ 1743927071.383116 │ +6rbvqAFsFDTvw │
│ beedrill │ https://pokeapi.co/api/v2/pokemon/15/ │ 1743927071.383116 │ /txiUsIf0Kd/vQ │
│ pidgey │ https://pokeapi.co/api/v2/pokemon/16/ │ 1743927071.383116 │ j5wfFIfo1ZCVFQ │
│ pidgeotto │ https://pokeapi.co/api/v2/pokemon/17/ │ 1743927071.383116 │ rwCzjLbfmGyHrw │
│ pidgeot │ https://pokeapi.co/api/v2/pokemon/18/ │ 1743927071.383116 │ Q25KkTxI0cKF/w │
│ rattata │ https://pokeapi.co/api/v2/pokemon/19/ │ 1743927071.383116 │ 37qoBEd+yA2azQ │
│ raticate │ https://pokeapi.co/api/v2/pokemon/20/ │ 1743927071.383116 │ 9JZo2lO/tw7zow │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ tatsugiri-droopy │ https://pokeapi.co/api/v2/pokemon/10258/ │ 1743927071.383116 │ glDbU62GwBf+Xw │
│ tatsugiri-stretchy │ https://pokeapi.co/api/v2/pokemon/10259/ │ 1743927071.383116 │ L6/XzsIRwmSYCQ │
│ squawkabilly-blue-plumage │ https://pokeapi.co/api/v2/pokemon/10260/ │ 1743927071.383116 │ ci2F5LvfTJG+AA │
│ squawkabilly-yellow-plumage │ https://pokeapi.co/api/v2/pokemon/10261/ │ 1743927071.383116 │ t7nsdzb8iq2Yig │
│ squawkabilly-white-plumage │ https://pokeapi.co/api/v2/pokemon/10262/ │ 1743927071.383116 │ dhQevhXet2Ivng │
│ gimmighoul-roaming │ https://pokeapi.co/api/v2/pokemon/10263/ │ 1743927071.383116 │ kvk7X9K0cgAPyQ │
│ koraidon-limited-build │ https://pokeapi.co/api/v2/pokemon/10264/ │ 1743927071.383116 │ syWzCkYGTmxW4w │
│ koraidon-sprinting-build │ https://pokeapi.co/api/v2/pokemon/10265/ │ 1743927071.383116 │ QLRg03EXUlxBSQ │
│ koraidon-swimming-build │ https://pokeapi.co/api/v2/pokemon/10266/ │ 1743927071.383116 │ KgMSLmMgVUEYng │
│ koraidon-gliding-build │ https://pokeapi.co/api/v2/pokemon/10267/ │ 1743927071.383116 │ Gr2Q9K9iteTejw │
│ miraidon-low-power-mode │ https://pokeapi.co/api/v2/pokemon/10268/ │ 1743927071.383116 │ bIkyu2abvRVECA │
│ miraidon-drive-mode │ https://pokeapi.co/api/v2/pokemon/10269/ │ 1743927071.383116 │ UfBO+4Mfxw293w │
│ miraidon-aquatic-mode │ https://pokeapi.co/api/v2/pokemon/10270/ │ 1743927071.383116 │ Mr5sNDwmjyv70g │
│ miraidon-glide-mode │ https://pokeapi.co/api/v2/pokemon/10271/ │ 1743927071.383116 │ P7HIoBde+8ytgg │
│ ursaluna-bloodmoon │ https://pokeapi.co/api/v2/pokemon/10272/ │ 1743927071.383116 │ w70EJa8XFs6Rgw │
│ ogerpon-wellspring-mask │ https://pokeapi.co/api/v2/pokemon/10273/ │ 1743927071.383116 │ 2Z7lu3PHgNQPYw │
│ ogerpon-hearthflame-mask │ https://pokeapi.co/api/v2/pokemon/10274/ │ 1743927071.383116 │ pZWubGPwSSsKxQ │
│ ogerpon-cornerstone-mask │ https://pokeapi.co/api/v2/pokemon/10275/ │ 1743927071.383116 │ 6YNuuUZuJVz3Cg │
│ terapagos-terastal │ https://pokeapi.co/api/v2/pokemon/10276/ │ 1743927071.383116 │ +5y8clVXDqV7eQ │
│ terapagos-stellar │ https://pokeapi.co/api/v2/pokemon/10277/ │ 1743927071.383116 │ K9TmDZyyMsyynA │
├─────────────────────────────┴──────────────────────────────────────────┴───────────────────┴────────────────┤
│ 1302 rows (40 shown) 4 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D .exit
データの調査
$ uv add streamlit
Resolved 72 packages in 388ms
Prepared 13 packages in 977ms
Installed 21 packages in 38ms
:
+ watchdog==6.0.0
$ uv run dlt pipeline rest_api_pokemon show
Found pipeline rest_api_pokemon in /home/ec2-user/.dlt/pipelines
Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
You can now view your Streamlit app in your browser.
Local URL: http://localhost:8501
Network URL: http://10.48.149.30:8501
External URL: http://18.178.73.138:8501
データの置換
@@ -125,6 +125,7 @@
"limit": 1000,
},
},
+ "write_disposition": "replace", # Setting the write disposition to `replace
},
"resources": [
"pokemon",
データのマージ
@@ -125,10 +125,20 @@ def load_pokemon() -> None:
"limit": 1000,
},
},
- "write_disposition": "replace", # Setting the write disposition to `replace
+ # For the `berry` and `location` resources, we keep
+ # the `replace` write disposition
+ "write_disposition": "replace",
},
"resources": [
- "pokemon",
+ # We create a specific configuration for the `pokemon` resource
+ # using a dictionary instead of a string to configure
+ # the primary key and write disposition
+ {
+ "name": "pokemon",
+ "primary_key": "name",
+ "write_disposition": "merge",
+ },
+ # The `berry` and `location` resources will use the default
"berry",
"location",
],
$ uv run rest_api_pipeline.py
2025-04-07 05:37:43,399|[ERROR]|31535|140519944869696|dlt|utils.py|check_connection:21|Error checking connection: The following resources could not be found in source rest_api: {'not_existing_endpoint'}. Available resources are: {'location', 'pokemon', 'berry'}
2025-04-07 05:37:43,602|[WARNING]|31535|140519944869696|dlt|client.py|detect_paginator:312|Fallback paginator used: SinglePagePaginator at 7fcd30ed1f10. Please provide right paginator manually.
Pipeline rest_api_pokemon load step completed in 0.24 seconds
1 load package(s) were loaded to destination duckdb and into dataset rest_api_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/rest-api/rest_api_pokemon.duckdb location to store data
Load package 1744004263.47209 is LOADED and contains no failed jobs
$ ~/.duckdb/cli/latest/duckdb
v1.2.1 8e52ec4395
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D .open rest_api_pokemon.duckdb
D select database_name, schema_name, table_name from duckdb_tables;
┌──────────────────┬───────────────────────┬─────────────────────┐
│ database_name │ schema_name │ table_name │
│ varchar │ varchar │ varchar │
├──────────────────┼───────────────────────┼─────────────────────┤
│ rest_api_pokemon │ rest_api_data │ berry │
│ rest_api_pokemon │ rest_api_data │ location │
│ rest_api_pokemon │ rest_api_data │ pokemon │
│ rest_api_pokemon │ rest_api_data │ _dlt_loads │
│ rest_api_pokemon │ rest_api_data │ _dlt_pipeline_state │
│ rest_api_pokemon │ rest_api_data │ _dlt_version │
│ rest_api_pokemon │ rest_api_data_staging │ pokemon │
│ rest_api_pokemon │ rest_api_data_staging │ _dlt_version │
└──────────────────┴───────────────────────┴─────────────────────┘
D .exit
データのインクリメンタルなロード
github は rate 制限でエラーになるため、スキップします。
SQL DB
同様に、uv のワークスペースのメンバーとしてプロジェクトを作成します。
dlt init します。途中の質問は Y で続行します。
$ uv run dlt init sql_database duckdb
Creating a new pipeline with the dlt core source sql_database (Source that loads tables form any SQLAlchemy supported database, supports batching requests and incremental loads.)
NOTE: Beginning with dlt 1.0.0, the source sql_database will no longer be copied from the verified sources repo but imported from dlt.sources. You can provide the --eject flag to revert to the old behavior.
Do you want to proceed? [Y/n]: y
Your new pipeline sql_database is ready to be customized!
* Review and change how dlt loads your data in sql_database_pipeline.py
* Add credentials for duckdb and other secrets in ./.dlt/secrets.toml
* Add the required dependencies to pyproject.toml:
dlt[duckdb,sql-database]>=1.9.0
If the dlt dependency is already added, make sure you install the extra for duckdb to it
If you are using poetry you may issue the following command:
poetry add dlt -E duckdb
* Read https://dlthub.com/docs/walkthroughs/create-a-pipeline for more information
$ tree
.
├── .dlt
│ ├── config.toml
│ └── secrets.toml
├── .gitignore
├── README.md
├── main.py
├── pyproject.toml
└── sql_database_pipeline.py
同様に、出力されたとおり、pyproject.toml
に依存関係を追加します。
@@ -4,4 +4,6 @@
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
-dependencies = []
+dependencies = [
+ "dlt[duckdb,sql-database]>=1.9.0"
+]
パイプラインスクリプトを構成する
@@ -344,8 +344,26 @@
info = pipeline.run(sql_alchemy_source)
print(info)
+def load_tables_family_and_genome():
+
+ # Create a dlt source that will load tables "family" and "genome"
+ source = sql_database().with_resources("family", "genome")
+
+ # Create a dlt pipeline object
+ pipeline = dlt.pipeline(
+ pipeline_name="sql_to_duckdb_pipeline", # Custom name for the pipeline
+ destination="duckdb", # dlt destination to which the data will be loaded
+ dataset_name="sql_to_duckdb_pipeline_data" # Custom name for the dataset created in the destination
+ )
+
+ # Run the pipeline
+ load_info = pipeline.run(source)
+
+ # Pretty print load information
+ print(load_info)
if __name__ == "__main__":
+ load_tables_family_and_genome()
# Load selected tables with different settings
# load_select_tables_from_database()
@@ -356,7 +374,7 @@
# select_with_end_value_and_row_order()
# Load tables with the standalone table resource
- load_standalone_table_resource()
+ # load_standalone_table_resource()
# Load all tables from the database.
# Warning: The sample database is very large
資格情報を追加する
[sources.sql_database.credentials]
drivername = "mysql+pymysql" # database+dialect
database = "Rfam"
password = ""
username = "rfamro"
host = "mysql-rfam-public.ebi.ac.uk"
port = 4497
依存関係をインストールする
最初にやってしまったので、pymysql
だけ追加しておきます。
$ uv add pymysql
Resolved 76 packages in 268ms
Prepared 3 packages in 69ms
Installed 3 packages in 8ms
+ greenlet==3.1.1
+ pymysql==1.1.1
+ sqlalchemy==2.0.40
結局、pyproject.toml の差分は、
@@ -4,4 +4,7 @@
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
-dependencies = []
+dependencies = [
+ "dlt[duckdb,sql-database]>=1.9.0",
+ "pymysql>=1.1.1",
+]
パイプラインを実行する
$ uv run sql_database_pipeline.py
Pipeline sql_to_duckdb_pipeline load step completed in 4.34 seconds
1 load package(s) were loaded to destination duckdb and into dataset sql_to_duckdb_pipeline_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/sql-db/sql_to_duckdb_pipeline.duckdb location to store data
Load package 1744006598.0197978 is LOADED and contains no failed jobs
$ ~/.duckdb/cli/latest/duckdb
v1.2.1 8e52ec4395
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D .open sql_to_duckdb_pipeline.duckdb
D select * from sql_to_duckdb_pipeline_data.genome;
┌─────────────┬─────────────────┬──────────────────┬─────────┬───┬──────────────────────┬──────────────────────┬────────────────────┬────────────────┐
│ upid │ assembly_acc │ assembly_version │ wgs_acc │ … │ created │ updated │ _dlt_load_id │ _dlt_id │
│ varchar │ varchar │ int64 │ varchar │ │ timestamp with tim… │ timestamp with tim… │ varchar │ varchar │
├─────────────┼─────────────────┼──────────────────┼─────────┼───┼──────────────────────┼──────────────────────┼────────────────────┼────────────────┤
│ RG000000001 │ NULL │ NULL │ NULL │ … │ 2017-06-06 15:11:0… │ 2020-04-23 11:46:0… │ 1744006598.0197978 │ RAoK+ODZamk3UQ │
│ RG000000002 │ NULL │ NULL │ NULL │ … │ 2017-06-06 15:11:0… │ 2020-04-23 11:46:0… │ 1744006598.0197978 │ drTISYKdQQ9B2g │
│ RG000000003 │ NULL │ NULL │ NULL │ … │ 2017-06-06 15:11:1… │ 2020-04-23 11:47:1… │ 1744006598.0197978 │ PpqHdR+rXcfIrw │
│ RG000000004 │ NULL │ NULL │ NULL │ … │ 2017-06-06 15:11:1… │ 2020-04-23 11:46:0… │ 1744006598.0197978 │ Gx2w3Lx18MhAJw │
│ RG000000005 │ NULL │ NULL │ NULL │ … │ 2017-06-06 15:11:2… │ 2020-04-23 11:46:2… │ 1744006598.0197978 │ drDFcqU8/hAPAw │
│ RG000000006 │ NULL │ NULL │ NULL │ … │ 2017-06-06 15:11:2… │ 2020-04-23 11:52:3… │ 1744006598.0197978 │ JqAi5zqERM4xyQ │
│ RG000000007 │ GCA_000413255.1 │ 1 │ NULL │ … │ 2017-06-06 15:11:2… │ 2020-04-23 11:44:5… │ 1744006598.0197978 │ kf3xKgFBwCRY2w │
│ RG000000009 │ GCF_000836845.1 │ 1 │ NULL │ … │ 2017-06-06 15:11:4… │ 2020-04-23 11:49:5… │ 1744006598.0197978 │ zfrLC0X2XNk8iQ │
│ RG000000010 │ GCF_000840125.1 │ 1 │ NULL │ … │ 2017-06-06 15:11:4… │ 2020-04-23 11:46:2… │ 1744006598.0197978 │ 77rxN5om2PJxRw │
│ RG000000011 │ GCF_000847605.1 │ 1 │ NULL │ … │ 2017-06-06 15:11:5… │ 2020-04-23 11:46:0… │ 1744006598.0197978 │ 2FwYLNks8oC1IQ │
│ RG000000012 │ GCF_000847825.1 │ 1 │ NULL │ … │ 2017-06-06 15:11:5… │ 2020-04-23 11:46:3… │ 1744006598.0197978 │ ffZy/QDTO/AS4Q │
│ RG000000013 │ GCF_000849085.1 │ 1 │ NULL │ … │ 2017-06-06 15:12:0… │ 2020-04-23 11:47:2… │ 1744006598.0197978 │ hgJPzGytJjMW7Q │
│ RG000000016 │ GCF_000849285.2 │ 2 │ NULL │ … │ 2017-06-06 15:12:1… │ 2020-04-23 11:48:0… │ 1744006598.0197978 │ +ClNxwioiVyKAA │
│ RG000000019 │ GCF_000853665.1 │ 1 │ NULL │ … │ 2017-06-06 15:12:2… │ 2020-04-23 11:46:0… │ 1744006598.0197978 │ pkPzLyG19KqYoQ │
│ RG000000020 │ GCF_000854685.1 │ 1 │ NULL │ … │ 2017-06-06 15:12:3… │ 2020-04-23 11:49:5… │ 1744006598.0197978 │ RcAEzAhDMJZhYA │
│ RG000000021 │ GCF_000854865.1 │ 1 │ NULL │ … │ 2017-06-06 15:12:3… │ 2020-04-23 11:46:0… │ 1744006598.0197978 │ gs+pOektB6J8IA │
│ RG000000025 │ GCF_000849985.1 │ 1 │ NULL │ … │ 2017-06-06 15:12:5… │ 2020-04-23 11:46:1… │ 1744006598.0197978 │ lN2m551byT+kww │
│ RG000000026 │ GCF_000852745.1 │ 1 │ NULL │ … │ 2017-06-06 15:12:5… │ 2020-04-23 11:48:0… │ 1744006598.0197978 │ RO2pMMDgrpUP+w │
│ RG000000027 │ NC_038861.1 │ 1 │ NULL │ … │ 2020-05-06 14:56:5… │ 2024-09-10 04:53:1… │ 1744006598.0197978 │ 39PRFerp+sByMw │
│ RG000000028 │ NC_032730.1 │ 1 │ NULL │ … │ 2020-05-06 14:56:5… │ 2024-09-10 04:56:1… │ 1744006598.0197978 │ v1HJFZlwxL/w+A │
│ · │ · │ · │ · │ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │ · │ · │ · │ · │
│ UP001269074 │ GCA_031320445.1 │ 1 │ NULL │ … │ 2024-06-27 09:23:0… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ 3N9L7w5vHw7Daw │
│ UP001269135 │ GCA_031297675.1 │ 1 │ NULL │ … │ 2024-06-28 19:29:5… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ R4AgPHTwF01Kuw │
│ UP001269161 │ GCA_032416515.1 │ 1 │ NULL │ … │ 2024-06-27 09:21:3… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ PpzQ8eKDQ2ZyjQ │
│ UP001269216 │ GCA_031314795.1 │ 1 │ NULL │ … │ 2024-06-27 09:22:0… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ z9Fpycw+xob8TQ │
│ UP001269251 │ GCA_032423685.1 │ 1 │ NULL │ … │ 2024-06-28 19:33:2… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ 7rBXZviyfFCniA │
│ UP001269254 │ GCA_031249865.1 │ 1 │ NULL │ … │ 2024-06-27 11:07:5… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ gHS9qUlQJ/w60g │
│ UP001269288 │ GCA_031298415.1 │ 1 │ NULL │ … │ 2024-06-28 19:30:5… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ 8UiqwxmwRImv1A │
│ UP001269304 │ GCA_031579545.1 │ 1 │ NULL │ … │ 2024-06-27 09:20:5… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ dnLPvoyA6qaoUA │
│ UP001269323 │ GCA_031315845.1 │ 1 │ NULL │ … │ 2024-06-27 09:23:4… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ aTc+gmY5ZiDT2g │
│ UP001269365 │ GCA_031297925.1 │ 1 │ NULL │ … │ 2024-06-27 09:19:5… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ 4OsdbYElmRpmJg │
│ UP001269389 │ GCA_031315295.1 │ 1 │ NULL │ … │ 2024-06-27 09:23:1… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ xDaVRqSqQkNP8Q │
│ UP001269398 │ GCA_031311295.1 │ 1 │ NULL │ … │ 2024-06-27 09:26:3… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ Cb0ZVbpstoqOSg │
│ UP001269477 │ GCA_031345405.1 │ 1 │ NULL │ … │ 2024-06-27 09:19:2… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ kCVeFFc3kJsTzA │
│ UP001269491 │ GCA_031313775.1 │ 1 │ NULL │ … │ 2024-06-28 19:33:5… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ 4jniuAfE2ukDJg │
│ UP001295947 │ GCA_963583325.1 │ 1 │ NULL │ … │ 2024-06-27 09:26:4… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ 5cMRssA3IYvIuw │
│ UP001295980 │ GCA_963583275.1 │ 1 │ NULL │ … │ 2024-06-27 09:26:3… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ 0uw4c/GtNPHWKw │
│ UP001296009 │ GCA_963583145.1 │ 1 │ NULL │ … │ 2024-06-28 19:29:4… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ v4LNPAv7TBgHXg │
│ UP001296230 │ GCA_963583065.1 │ 1 │ NULL │ … │ 2024-06-27 09:19:4… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ a3Xa1ncobXmLeg │
│ UP001296237 │ GCA_963583445.1 │ 1 │ NULL │ … │ 2024-06-28 19:34:5… │ 2024-09-02 20:32:5… │ 1744006598.0197978 │ Tu6c6BRCZM8ORg │
│ x │ NULL │ NULL │ NULL │ … │ 2017-05-15 11:09:5… │ 2020-04-23 11:53:1… │ 1744006598.0197978 │ 0tjpW6cUvc+WbA │
├─────────────┴─────────────────┴──────────────────┴─────────┴───┴──────────────────────┴──────────────────────┴────────────────────┴────────────────┤
│ 32169 rows (40 shown) 24 columns (8 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D select * from sql_to_duckdb_pipeline_data.family;
┌──────────┬───────────────────┬───────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┬────────────────────┬────────────────┐
│ rfam_acc │ rfam_id │ auto_wiki │ description │ … │ created │ updated │ _dlt_load_id │ _dlt_id │
│ varchar │ varchar │ int64 │ varchar │ │ timestamp with tim… │ timestamp with tim… │ varchar │ varchar │
├──────────┼───────────────────┼───────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┼────────────────────┼────────────────┤
│ RF00001 │ 5S_rRNA │ 1302 │ 5S ribosomal RNA │ … │ 2013-10-03 20:41:4… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ OClMfuGcB5Wrvg │
│ RF00002 │ 5_8S_rRNA │ 1303 │ 5.8S ribosomal RNA │ … │ 2013-10-03 20:47:0… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ qZDwAYdIO/camw │
│ RF00003 │ U1 │ 1304 │ U1 spliceosomal RNA │ … │ 2013-10-03 20:57:1… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ K+Wl36xNJTWIqQ │
│ RF00004 │ U2 │ 1305 │ U2 spliceosomal RNA │ … │ 2013-10-03 20:58:3… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ UAk9TdSlUWIC8A │
│ RF00005 │ tRNA │ 1306 │ tRNA │ … │ 2013-10-03 21:00:2… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ rXe0H2YAW/nzGA │
│ RF00006 │ Vault │ 1307 │ Vault RNA │ … │ 2013-10-03 22:04:0… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ Jxk7lzwmRwCDsg │
│ RF00007 │ U12 │ 1308 │ U12 minor spliceos… │ … │ 2013-10-03 22:04:0… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ PDH7f4fn2cjg5g │
│ RF00008 │ Hammerhead_3 │ 1309 │ Hammerhead ribozym… │ … │ 2013-10-03 22:04:1… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ IDS8muklHuRsjg │
│ RF00009 │ RNaseP_nuc │ 1310 │ Nuclear RNase P │ … │ 2013-10-03 22:04:1… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ 5k/trucon72JVA │
│ RF00010 │ RNaseP_bact_a │ 2441 │ Bacterial RNase P … │ … │ 2013-10-03 22:04:2… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ /xTM/2cIzDyN5A │
│ RF00011 │ RNaseP_bact_b │ 2441 │ Bacterial RNase P … │ … │ 2013-10-03 22:04:5… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ AeKbJ2sqX1ARyQ │
│ RF00012 │ U3 │ 1312 │ Small nucleolar RN… │ … │ 2013-10-03 22:04:5… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ HKKoiog1fgcjOw │
│ RF00013 │ 6S │ 2461 │ 6S / SsrS RNA │ … │ 2013-10-03 22:05:0… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ +DEnYRXhvYO2oA │
│ RF00014 │ DsrA │ 1237 │ DsrA RNA │ … │ 2013-02-01 11:56:1… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ A42TH1HH7R4/qQ │
│ RF00015 │ U4 │ 1314 │ U4 spliceosomal RNA │ … │ 2013-10-03 22:05:2… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ xj8cgaEmGDGM0Q │
│ RF00016 │ SNORD14 │ 1242 │ Small nucleolar RN… │ … │ 2013-02-01 11:56:2… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ auFI4SGhh2eBzA │
│ RF00017 │ Metazoa_SRP │ 1315 │ Metazoan signal re… │ … │ 2013-10-03 22:07:5… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ E+gtShJ+PfZwNw │
│ RF00018 │ CsrB │ 2460 │ CsrB/RsmB RNA family │ … │ 2013-10-03 23:07:2… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ je4IWfYmNIftNQ │
│ RF00019 │ Y_RNA │ 1317 │ Y RNA │ … │ 2013-10-03 23:07:3… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ Z6xD4ZC8vhXooA │
│ RF00020 │ U5 │ 1318 │ U5 spliceosomal RNA │ … │ 2013-10-03 23:08:4… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ I3QJhHgD/ybezg │
│ · │ · │ · │ · │ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │ · │ · │ · │ · │
│ RF04290 │ mir-1197 │ 1287 │ mir-1197 microRNA … │ … │ 2023-10-23 23:12:4… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ G8z6tURzn78SYw │
│ RF04291 │ mir-368 │ 1287 │ mir-368 microRNA p… │ … │ 2023-10-23 23:47:3… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ uQXJEqWwI+PK9g │
│ RF04292 │ mir-379 │ 1287 │ mir-379 microRNA p… │ … │ 2023-10-24 00:00:1… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ KusLrBG8LTGjqw │
│ RF04293 │ mir-889 │ 1287 │ mir-889 microRNA p… │ … │ 2023-10-24 00:14:3… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ m0bCDPXGDZ666Q │
│ RF04294 │ mir-3578 │ 1287 │ mir-3578 microRNA … │ … │ 2023-10-24 00:22:2… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ fyLYP9F1uRsdYg │
│ RF04295 │ mir-329 │ 1287 │ mir-329 microRNA p… │ … │ 2023-10-24 00:58:2… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ D+3PjIw6tlulXg │
│ RF04296 │ mir-485 │ 1287 │ mir-485 microRNA p… │ … │ 2023-10-24 01:09:4… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ b+gnt7raOGOc0g │
│ RF04297 │ mir-35_2 │ 1287 │ mir-35_2 microRNA … │ … │ 2023-10-24 10:55:2… │ 2024-09-10 04:51:1… │ 1744006598.0197978 │ PkTj21tQpmaRpw │
│ RF04298 │ mir-36_2 │ 1287 │ mir-36_2 microRNA … │ … │ 2023-10-24 10:56:5… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ JwaN+kyG6J70fw │
│ RF04299 │ MIR814 │ 1287 │ MIR814 microRNA pr… │ … │ 2023-10-24 12:12:1… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ 8nxpH2YvaTgk/Q │
│ RF04300 │ mir-39 │ 1287 │ mir-39 microRNA pr… │ … │ 2023-10-24 15:20:3… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ ITyPpnYuaU/ZHw │
│ RF04301 │ mir-200 │ 1287 │ mir-200 microRNA p… │ … │ 2023-11-21 12:46:0… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ vIfA9jEzkihQpQ │
│ RF04302 │ mir-506 │ 2235 │ mir-506 microRNA p… │ … │ 2023-11-27 13:46:2… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ aiEjYQDfbKQJDw │
│ RF04303 │ MIR162_2 │ 1287 │ MIR162_2 microRNA … │ … │ 2023-11-28 13:04:3… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ YbHE+qxzGCVpGA │
│ RF04305 │ HCV_SL1412 │ 2418 │ Hepatitis C virus … │ … │ 2024-04-08 15:25:1… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ ZrN6XN9KvBkTlw │
│ RF04306 │ HCV_SL8001 │ 2418 │ Hepatitis C virus … │ … │ 2024-04-08 15:35:2… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ 6qKRAp9LtyvY4g │
│ RF04307 │ HCV_SL8670 │ 2418 │ Hepatitis C virus … │ … │ 2024-04-08 16:11:3… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ zIyQtZBl7OmFqw │
│ RF04308 │ HCV_SL2531-SL2549 │ 2418 │ Hepatitis C virus … │ … │ 2024-04-10 15:16:5… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ jLvZqnyvNdqK0w │
│ RF04309 │ HCV_J7880 │ 2418 │ Hepatitis C virus … │ … │ 2024-06-17 13:59:0… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ L7T1KucJBTXoMQ │
│ RF04310 │ nqrA-II │ 2768 │ nqrA-II ncRNA motif │ … │ 2024-06-25 14:43:4… │ 2024-09-09 21:15:1… │ 1744006598.0197978 │ YExDc03iO5RNSA │
├──────────┴───────────────────┴───────────┴──────────────────────┴───┴──────────────────────┴──────────────────────┴────────────────────┴────────────────┤
│ 4178 rows (40 shown) 37 columns (8 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D .exit
データを探索する
$ uv add streamlit
Resolved 76 packages in 387ms
Audited 67 packages in 0.02ms
$ uv run dlt pipeline sql_to_duckdb_pipeline show
Found pipeline sql_to_duckdb_pipeline in /home/ec2-user/.dlt/pipelines
Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
You can now view your Streamlit app in your browser.
Local URL: http://localhost:8501
Network URL: http://10.48.149.30:8501
External URL: http://18.178.73.138:8501
置換でロード
@@ -357,7 +357,7 @@
)
# Run the pipeline
- load_info = pipeline.run(source)
+ load_info = pipeline.run(source, write_disposition="replace") # Set write_disposition to load the data with "replace"
# Pretty print load information
print(load_info)
$ uv run sql_database_pipeline.py
Pipeline sql_to_duckdb_pipeline load step completed in 4.47 seconds
1 load package(s) were loaded to destination duckdb and into dataset sql_to_duckdb_pipeline_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/sql-db/sql_to_duckdb_pipeline.duckdb location to store data
Load package 1744007697.087969 is LOADED and contains no failed jobs
マージでロード
@@ -349,6 +349,10 @@
# Create a dlt source that will load tables "family" and "genome"
source = sql_database().with_resources("family", "genome")
+ # specify different loading strategy for each resource using apply_hints
+ source.family.apply_hints(write_disposition="merge", primary_key="rfam_id") # merge table "family" on column "rfam_id"
+ source.genome.apply_hints(write_disposition="merge", primary_key="upid") # merge table "genome" on column "upid"
+
# Create a dlt pipeline object
pipeline = dlt.pipeline(
pipeline_name="sql_to_duckdb_pipeline", # Custom name for the pipeline
@@ -357,7 +361,7 @@
)
# Run the pipeline
- load_info = pipeline.run(source, write_disposition="replace") # Set write_disposition to load the data with "replace"
+ load_info = pipeline.run(source)
# Pretty print load information
print(load_info)
$ uv run sql_database_pipeline.py
Pipeline sql_to_duckdb_pipeline load step completed in 4.84 seconds
1 load package(s) were loaded to destination duckdb and into dataset sql_to_duckdb_pipeline_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/sql-db/sql_to_duckdb_pipeline.duckdb location to store data
Load package 1744007926.344812 is LOADED and contains no failed jobs
インクリメンタルにロード
@@ -349,9 +349,8 @@
# Create a dlt source that will load tables "family" and "genome"
source = sql_database().with_resources("family", "genome")
- # specify different loading strategy for each resource using apply_hints
- source.family.apply_hints(write_disposition="merge", primary_key="rfam_id") # merge table "family" on column "rfam_id"
- source.genome.apply_hints(write_disposition="merge", primary_key="upid") # merge table "genome" on column "upid"
+ # only load rows whose "updated" value is greater than the last pipeline run
+ source.family.apply_hints(incremental=dlt.sources.incremental("updated"))
# Create a dlt pipeline object
pipeline = dlt.pipeline(
$ uv run sql_database_pipeline.py
2025-04-07 06:42:45,784|[WARNING]|36348|140550876481344|dlt|__init__.py|_check_duplicate_cursor_threshold:591|Large number of records (832) sharing the same value of cursor field 'updated'. This can happen if the cursor field has a low resolution (e.g., only stores dates without times), causing many records to share the same cursor value. Consider using a cursor column with higher resolution to reduce the deduplication state size.
Pipeline sql_to_duckdb_pipeline load step completed in 4.71 seconds
1 load package(s) were loaded to destination duckdb and into dataset sql_to_duckdb_pipeline_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/sql-db/sql_to_duckdb_pipeline.duckdb location to store data
Load package 1744008155.7218933 is LOADED and contains no failed jobs
File System
同様に、uv のワークスペースのメンバーとしてプロジェクトを作成します。
dlt init します。途中の質問は Y で続行します。
$ uv run dlt init filesystem duckdb
Creating a new pipeline with the dlt core source filesystem (Reads files in s3, gs or azure buckets using fsspec and provides convenience resources for chunked reading of various file formats)
NOTE: Beginning with dlt 1.0.0, the source filesystem will no longer be copied from the verified sources repo but imported from dlt.sources. You can provide the --eject flag to revert to the old behavior.
Do you want to proceed? [Y/n]: y
Your new pipeline filesystem is ready to be customized!
* Review and change how dlt loads your data in filesystem_pipeline.py
* Add credentials for duckdb and other secrets in ./.dlt/secrets.toml
* Add the required dependencies to pyproject.toml:
dlt[duckdb,filesystem]>=1.9.0
If the dlt dependency is already added, make sure you install the extra for duckdb to it
If you are using poetry you may issue the following command:
poetry add dlt -E duckdb
* Read https://dlthub.com/docs/walkthroughs/create-a-pipeline for more information
$ tree
.
├── .dlt
│ ├── config.toml
│ └── secrets.toml
├── .gitignore
├── README.md
├── filesystem_pipeline.py
├── main.py
└── pyproject.toml
同様に、出力されたとおり、pyproject.toml に依存関係を追加します。
@@ -4,4 +4,6 @@
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
-dependencies = []
+dependencies = [
+ "dlt[duckdb,filesystem]>=1.9.0"
+]
パイプラインの作成
import dlt
from dlt.sources.filesystem import filesystem, read_csv
files = filesystem(bucket_url="gs://filesystem-tutorial", file_glob="encounters*.csv")
reader = (files | read_csv()).with_name("encounters")
pipeline = dlt.pipeline(pipeline_name="hospital_data_pipeline", dataset_name="hospital_data", destination="duckdb")
info = pipeline.run(reader)
print(info)
ファイルシステムソースの設定
[sources.filesystem.credentials]
client_email = "public-access@dlthub-sandbox.iam.gserviceaccount.com"
project_id = "dlthub-sandbox"
private_key = "-----BEGIN PRIVATE KEY-----\nMIIEvAIBADANBgkqhkiG9w0BAQEFAASCBKYwggSiAgEAAoIBAQDGWsVHJRjliojx\nTo+j1qu+x8PzC5ZHZrMx6e8OD6tO8uxMyl65ByW/4FZkVXkS4SF/UYPigGN+rel4\nFmySTbP9orva4t3Pk1B9YSvQMB7V5IktmTIW9Wmdmn5Al8Owb1RehgIidm1EX/Z9\nLr09oLpO6+jUu9RIP2Lf2mVQ6tvkgl7UOdpdGACSNGzRiZgVZDOaDIgH0Tl4UWmK\n6iPxhwZy9YC2B1beLB/NU+F6DUykrEpBzCFQTqFoTUcuDAEvuvpU9JrU2iBMiOGw\nuP3TYSiudhBjmauEUWaMiqWAgFeX5ft1vc7/QWLdI//SAjaiTAu6pTer29Q0b6/5\niGh0jRXpAgMBAAECggEAL8G9C9MXunRvYkH6/YR7F1T7jbH1fb1xWYwsXWNSaJC+\nagKzabMZ2KfHxSJ7IxuHOCNFMKyex+pRcvNbMqJ4upGKzzmeFBMw5u8VYGulkPQU\nPyFKWRK/Wg3PZffkSr+TPargKrH+vt6n9x3gvEzNbqEIDugmRTrVsHXhvOi/BrYc\nWhppHSVQidWZi5KVwDEPJjDQiHEcYI/vfIy1WhZ8VuPAaE5nMZ1m7gTdeaWWKIAj\n/p2ZkLgRdCY8vNkfaNDAxDbvH+CMuTtOw55GydzsYYiofANS6xZ8CedGkYaGi82f\nqGdLghX61Sg3UAb5SI36T/9XbuCpTf3B/SMV20ew8QKBgQDm2yUxL71UqI/xK9LS\nHWnqfHpKmHZ+U9yLvp3v79tM8XueSRKBTJJ4H+UvVQrXlypT7cUEE+sGpTrCcDGL\nm8irtdUmMvdi7AnRBgmWdYKig/kgajLOUrjXqFt/BcFgqMyTfzqPt3xdp6F3rSEK\nHE6PQ8I3pJ0BJOSJRa6Iw2VH1QKBgQDb9WbVFjYwTIKJOV4J2plTK581H8PI9FSt\nUASXcoMTixybegk8beGdwfm2TkyF/UMzCvHfuaUhf+S0GS5Zk31Wkmh1YbmFU4Q9\nm9K/3eoaqF7CohpigB0wJw4HfqNh6Qt+nICOMCv++gw7+/UwfV72dCqr0lpzfX5F\nAsez8igTxQKBgDsq/axOnQr+rO3WGpGJwmS8BKfrzarxGXyjnV0qr51X4yQdfGWx\nV3T8T8RC2qWI8+tQ7IbwB/PLE3VURg6PHe6MixXgSDGNZ7KwBnMOqS23/3kEXwMs\nhn2Xg+PZeMeqW8yN9ldxYqmqViMTN32c5bGoXzXdtfPeHcjlGCerVOEFAoGADVPi\nRjkRUX3hTvVF6Gzxa2OyQuLI1y1O0C2QCakrngyI0Dblxl6WFBwDyHMYGepNnxMj\nsr2p7sy0C+GWuGDCcHNwluQz/Ish8SW28F8+5xyamUp/NMa0fg1vwS6AMdeQFbzf\n4T2z/MAj66KJqcV+8on5Z+3YAzVwaDgR56pdmU0CgYBo2KWcNWAhZ1Qa6sNrITLV\nGlxg6tWP3OredZrmKb1kj5Tk0V+EwVN+HnKzMalv6yyyK7SWq1Z6rvCye37vy27q\nD7xfuz0c0H+48uWJpdLcsxpTioopsRPayiVDKlHSe/Qa+MEjAG3ded5TJiC+5iSw\nxWJ51y0wpme0LWgzzoLbRw==\n-----END PRIVATE KEY-----\n"
[sources.filesystem]
bucket_url="gs://filesystem-tutorial"
パイプラインの実行
このままいくと死ぬので、gs の依存関係を追加します。
$ uv add dlt[gs]
Resolved 106 packages in 83ms
Prepared 16 packages in 54ms
Installed 16 packages in 19ms
:
+ rsa==4.9
$ uv run python filesystem_pipeline.py
Pipeline hospital_data_pipeline load step completed in 3.29 seconds
1 load package(s) were loaded to destination duckdb and into dataset hospital_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/filesystem/hospital_data_pipeline.duckdb location to store data
Load package 1744009708.2871656 is LOADED and contains no failed jobs
データの調査
これまでと一緒ですが、
$ uv add streamlit
Resolved 106 packages in 46ms
Audited 93 packages in 0.05ms
$ uv run dlt pipeline hospital_data_pipeline show
Found pipeline hospital_data_pipeline in /home/ec2-user/.dlt/pipelines
Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
You can now view your Streamlit app in your browser.
Local URL: http://localhost:8501
Network URL: http://10.48.149.30:8501
External URL: http://18.178.73.138:8501
ロードデータの追加、置換、マージ
ページにある merge
の場合
@@ -3,7 +3,8 @@
files = filesystem(bucket_url="gs://filesystem-tutorial", file_glob="encounters*.csv")
reader = (files | read_csv()).with_name("encounters")
+reader.apply_hints(primary_key="id")
pipeline = dlt.pipeline(pipeline_name="hospital_data_pipeline", dataset_name="hospital_data", destination="duckdb")
-info = pipeline.run(reader)
+info = pipeline.run(reader, write_disposition="merge")
print(info)
$ uv run python filesystem_pipeline.py
Pipeline hospital_data_pipeline load step completed in 3.52 seconds
1 load package(s) were loaded to destination duckdb and into dataset hospital_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/filesystem/hospital_data_pipeline.duckdb location to store data
Load package 1744010182.1334121 is LOADED and contains no failed jobs
データをインクリメンタルにロードする
ここは、2段階で実行します。
@@ -2,6 +2,7 @@
from dlt.sources.filesystem import filesystem, read_csv
files = filesystem(bucket_url="gs://filesystem-tutorial", file_glob="encounters*.csv")
+files.apply_hints(incremental=dlt.sources.incremental("modification_date"))
reader = (files | read_csv()).with_name("encounters")
reader.apply_hints(primary_key="id")
pipeline = dlt.pipeline(pipeline_name="hospital_data_pipeline", dataset_name="hospital_data", destination="duckdb")
$ uv run python filesystem_pipeline.py
Pipeline hospital_data_pipeline load step completed in 3.47 seconds
1 load package(s) were loaded to destination duckdb and into dataset hospital_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/filesystem/hospital_data_pipeline.duckdb location to store data
Load package 1744011013.538883 is LOADED and contains no failed jobs
1回実行してから、以下の修正を行います。
@@ -4,7 +4,7 @@
files = filesystem(bucket_url="gs://filesystem-tutorial", file_glob="encounters*.csv")
files.apply_hints(incremental=dlt.sources.incremental("modification_date"))
reader = (files | read_csv()).with_name("encounters")
-reader.apply_hints(primary_key="id")
+reader.apply_hints(primary_key="id", incremental=dlt.sources.incremental("STOP"))
pipeline = dlt.pipeline(pipeline_name="hospital_data_pipeline", dataset_name="hospital_data", destination="duckdb")
info = pipeline.run(reader, write_disposition="merge")
$ uv run python filesystem_pipeline.py
Pipeline hospital_data_pipeline load step completed in 0.07 seconds
1 load package(s) were loaded to destination duckdb and into dataset hospital_data
The duckdb destination used duckdb:////home/ec2-user/work/mds/dlt-getting-started/filesystem/hospital_data_pipeline.duckdb location to store data
Load package 1744011132.9863658 is LOADED and contains no failed jobs
以降は、サンプルデータを調達しないといけないようなので、スキップします。
おわりに
いかがでしたでしょうか。
これまでのデータ連携では、まず、オリジナルのデータを 中立的な立場
で読み込んで、それを加工して、格納するという ETL の処理をイメージしていましたが、dlt では、読むと同時に書き込み先に書き込めるように自動で整えて、どんどん書いていけるのが新鮮でした。
後続の dbt が、標準化、中間形式、最終のビジネス形式のように加工していくので、dlt は、準備に時間をかけずに、そのままアップするという思想なんだと思います。ELT は、ETL の T が後ろに行っただけというのは甘かったようです。T が後ろに行くことによって、E と L の処理も変わっていく。テーブル一つ一つ、名前を確認して、スキーマ定義して、とかではなく、データソースにつないだら、まとめてドカンと機械的にコピーしましょうと理解しました。
Discussion