Livebook vs. Dbt + DuckDB: Which is useful for trivial ETL processes in SES?
What is SES
Why Livebook, DuckDB (with/without Dbt), or both can be useful in SES
- These workers work on a lot of tedious tasks
- Most of the tasks are ETL (Extract, Transform, and Load)
- Elixir is especially good at this, while Dbt is created specifically for it
- Utilizing one of these tools can boost productivity of the industry
Alternatives
Interactive notebooks
- Livebook
- Jupyter
Data transformation
- Dbt
Portable databases
- DuckDB
- Sqlite
Comparison: Ease of installation
Clearly, DuckDB wins because it's a single static executable that can be installed to anywhere on the system.
Livebook
- Livebook is distributed as a standalone Windows application, so it is easy to install
- There is also a Docker image
- Livebook requires installation of Elixir packages during runtime (mostly using
mix
). Behind a corporate proxy, this can require extra manual workarounds by the user.- For example, to install explorer, you have to download one of the nifs from the Releases page and put it in a designated directory
Dbt and DuckDB
- DuckDB is distributed as a compressed archive containing a single standalone executable, so it is easy to install
- dbt-core and adapters are just Python packages, so they are easy to install
Comparison: What they can do
Livebook wins because Elixir is a general-purpose programming language and Livebook is a well-designed programming environment for it.
However, Livebook usually requires manual intervention, which can be a limitation for quite a few use cases. Livebook can be considered a more programming-oriented alternative to Microsoft Excel.
SQL is more ubiquitous than Elixir, so DuckDB should be easier to learn than Livebook for most users.
Dbt and DuckDB basically accept only tabular data, so you may have to implement some preprocessing or develop a custom dbt adapter. Livebook is useful for preprocessing. Livebook + DuckDB is a good option.
Other options for preprocessing data
For simple processing, you don't need Elixir. Since Elixir is less popular than mainstream programming languages (e.g. TypeScript and Python) and Livebook is harder to set up, it's better to consider other options especially if workers on your team are relatively low-skilled.
Companies want to outsource the task of preprocessing, which is more tedious and uninteresting, to external workers. Those workers probably don't know Elixir.
Extract data from HTML
- TypeScript: Bun (TypeScript runtime) + htmlparser2 (streaming HTML parser) + fast-csv (streaming CSV parser/serializer) + iconv-lite (character encoding conversion)
Data integration
After all, it will be suboptimal for those workers to perform preprocessing on Livebook on their client computers.
while it may be possible to run tools such as grep, sed, and awk to analyze millions or billions of log lines on a source system, this is likely slow and inefficient. A more efficient data integration technique is to drive data into a centralized system that is purpose-built for analyzing big data, and then run your data integration analytics jobs on that central system.