Closed7

Livebook vs. Dbt + DuckDB: Which is useful for trivial ETL processes in SES?

Akira KomamuraAkira Komamura

Why Livebook, DuckDB (with/without Dbt), or both can be useful in SES

  • These workers work on a lot of tedious tasks
  • Most of the tasks are ETL (Extract, Transform, and Load)
  • Elixir is especially good at this, while Dbt is created specifically for it
  • Utilizing one of these tools can boost productivity of the industry
Akira KomamuraAkira Komamura

Comparison: Ease of installation

Clearly, DuckDB wins because it's a single static executable that can be installed to anywhere on the system.

Livebook

  • Livebook is distributed as a standalone Windows application, so it is easy to install
  • There is also a Docker image
  • Livebook requires installation of Elixir packages during runtime (mostly using mix). Behind a corporate proxy, this can require extra manual workarounds by the user.
    • For example, to install explorer, you have to download one of the nifs from the Releases page and put it in a designated directory

Dbt and DuckDB

  • DuckDB is distributed as a compressed archive containing a single standalone executable, so it is easy to install
  • dbt-core and adapters are just Python packages, so they are easy to install
Akira KomamuraAkira Komamura

Comparison: What they can do

Livebook wins because Elixir is a general-purpose programming language and Livebook is a well-designed programming environment for it.

However, Livebook usually requires manual intervention, which can be a limitation for quite a few use cases. Livebook can be considered a more programming-oriented alternative to Microsoft Excel.

SQL is more ubiquitous than Elixir, so DuckDB should be easier to learn than Livebook for most users.

Dbt and DuckDB basically accept only tabular data, so you may have to implement some preprocessing or develop a custom dbt adapter. Livebook is useful for preprocessing. Livebook + DuckDB is a good option.

Akira KomamuraAkira Komamura

Other options for preprocessing data

For simple processing, you don't need Elixir. Since Elixir is less popular than mainstream programming languages (e.g. TypeScript and Python) and Livebook is harder to set up, it's better to consider other options especially if workers on your team are relatively low-skilled.

https://twitter.com/BigDataBorat/status/306596352991830016

Companies want to outsource the task of preprocessing, which is more tedious and uninteresting, to external workers. Those workers probably don't know Elixir.

Extract data from HTML

  • TypeScript: Bun (TypeScript runtime) + htmlparser2 (streaming HTML parser) + fast-csv (streaming CSV parser/serializer) + iconv-lite (character encoding conversion)
Akira KomamuraAkira Komamura

Data integration

After all, it will be suboptimal for those workers to perform preprocessing on Livebook on their client computers.

while it may be possible to run tools such as grep, sed, and awk to analyze millions or billions of log lines on a source system, this is likely slow and inefficient. A more efficient data integration technique is to drive data into a centralized system that is purpose-built for analyzing big data, and then run your data integration analytics jobs on that central system.

https://airbyte.com/blog/data-integration#analyze-data-faster-with-dedicated-technology

このスクラップは2023/12/30にクローズされました