iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🍨

Introduction to Polars: A Blazingly Fast DataFrame Library

に公開

Introduction

Recently, I have been studying Polars as it seemed quite interesting. In this article, I would like to introduce the characteristics of Polars while organizing what I've learned.

What is Polars?

https://www.pola.rs/
https://github.com/pola-rs/polars/

Polars is a "blazingly fast" DataFrame library that can be used in Rust and Python[1]—essentially, a library for data analysis. It's a clever play on "pandas" (Polar bear) vs "Polars".
The core part is implemented in Rust, and it is designed to be callable from Python as an interface. The build from Rust to a Python package uses maturin (PyO3).

Environment

The OS, language, and library versions at the time of writing are as follows. Only parts that seem highly relevant have been extracted.

Ubntu 22.04
Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0] on linux
 maturin                       0.13.7
 pandas                        1.5.2
 polars                        0.15.2
 pyarrow                       11.0.0.dev141
cargo 1.67.0-nightly (eb5d35917 2022-11-17)
rustc 1.67.0-nightly (ff8c8dfbe 2022-11-22)

Note that Polars is still under development and is updated daily. (While I was writing this article, the version was updated to 0.15.6!)

Background of Speedup

Data Format

https://arrow.apache.org/overview/
The library is implemented based on Apache Arrow, a column-oriented in-memory data format. More precisely, it is implemented using a crate called Arrow2 in the Apache Arrow format. As a result, the data processing itself is inherently fast. Descriptions of the actually defined data types can be found in the User Guide.

Lazy API

In Polars, for queries that manipulate DataFrames, you can distinguish between "Lazy" (deferred) and "Eager" (immediate) evaluation timing. By choosing Lazy, Polars can plan parallelization and optimization across queries, achieving further speed improvements.

Here are some examples[2].

read_csv and scan_csv
>>> import polars as pl
>>> pl.Config.set_tbl_rows(2)  # To omit row display
<class 'polars.cfg.Config'>
>>> pl.read_csv("PS.csv", skip_rows=290)
shape: (33796, 287)
┌───────┬──────────┬──────────┬───────────┬─────┬─────────┬──────────┬───────────┬──────────────┐
│ rowid ┆ pl_name  ┆ hostname ┆ pl_letter ┆ ... ┆ st_nrvc ┆ st_nspec ┆ pl_nespec ┆ pl_ntranspec │
------------       ┆     ┆ ------------
│ i64   ┆ strstrstr       ┆     ┆ i64     ┆ i64      ┆ i64       ┆ i64          │
╞═══════╪══════════╪══════════╪═══════════╪═════╪═════════╪══════════╪═══════════╪══════════════╡
111 Com b ┆ 11 Com   ┆ b         ┆ ...2000
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
...........................
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
33796 ┆ xi Aql b ┆ xi Aql   ┆ b         ┆ ...1000
└───────┴──────────┴──────────┴───────────┴─────┴─────────┴──────────┴───────────┴──────────────┘


>>> (df := pl.scan_csv("PS.csv", skip_rows=290))
<polars.LazyFrame object at 0x7F99301331C0>
>>> df.collect()
shape: (33796, 287)
┌───────┬──────────┬──────────┬───────────┬─────┬─────────┬──────────┬───────────┬──────────────┐
│ rowid ┆ pl_name  ┆ hostname ┆ pl_letter ┆ ... ┆ st_nrvc ┆ st_nspec ┆ pl_nespec ┆ pl_ntranspec │
------------       ┆     ┆ ------------
│ i64   ┆ strstrstr       ┆     ┆ i64     ┆ i64      ┆ i64       ┆ i64          │
╞═══════╪══════════╪══════════╪═══════════╪═════╪═════════╪══════════╪═══════════╪══════════════╡
111 Com b ┆ 11 Com   ┆ b         ┆ ...2000
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
...........................
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
33796 ┆ xi Aql b ┆ xi Aql   ┆ b         ┆ ...1000
└───────┴──────────┴──────────┴───────────┴─────┴─────────┴──────────┴───────────┴──────────────┘

While read_csv returns the result of reading the CSV eagerly, scan_csv returns a LazyFrame object at the time of the call. The subsequent collect causes the reading process to be evaluated lazily.

I will add a little more processing to check the effects of optimization and parallelization.

lazy_example.py
import polars as pl


def main() -> None:
    radius_by_method = (
        pl.scan_csv("PS.csv", skip_rows=290)
        .groupby("discoverymethod")
        .agg([
            pl.count(),
            pl.col("pl_rade").mean().alias("Planet Radius [Earth Radius]"),
            pl.col("pl_rade").max().alias("Max. Planet Radius"),
            pl.col("pl_rade").min().alias("Min. Planet Radius"),
        ])
        .sort("count", reverse=True)
        .limit(4)
    )
    print(radius_by_method.collect())


if __name__ == "__main__":
    main()

We group exoplanets by their observation method and calculate the average, maximum, and minimum planet radii[3].

In the case of Eager

Only scan_csv has been changed to read_csv.

eager_example.py
import polars as pl


def main() -> None:
    radius_by_method = (
        pl.read_csv("PS.csv", skip_rows=290)
        .groupby("discoverymethod")
        .agg([
            pl.count(),
            pl.col("pl_rade").mean().alias("Planet Radius [Earth Radius]"),
            pl.col("pl_rade").max().alias("Max. Planet Radius"),
            pl.col("pl_rade").min().alias("Min. Planet Radius"),
        ])
        .sort("count", reverse=True)
        .limit(4)
    )
    print(radius_by_method)


if __name__ == "__main__":
    main()
Execution results and execution time (lazy)
$ time python lazy_example.py 
shape: (4, 5)
┌───────────────────────────┬───────┬──────────────────────────────┬────────────────────┬────────────────────┐
 discoverymethod count Planet Radius [Earth Radius] Max. Planet Radius Min. Planet Radius
 --- --- --- --- ---
 str u32 f64 f64 f64
╞═══════════════════════════╪═══════╪══════════════════════════════╪════════════════════╪════════════════════╡
 Transit 30802 5.111434 3791.05 0.27
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 Radial Velocity 2270 7.211939 16.264 1.211
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 Microlensing 430 null null null
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 Transit Timing Variations 120 2.206051 3.01 0.37
└───────────────────────────┴───────┴──────────────────────────────┴────────────────────┴────────────────────┘

real	0m0.793s
user	0m0.784s
sys	0m0.040s
Execution results and execution time (eager)
$ time python eager_example.py 
shape: (4, 5)
┌───────────────────────────┬───────┬──────────────────────────────┬────────────────────┬────────────────────┐
 discoverymethod count Planet Radius [Earth Radius] Max. Planet Radius Min. Planet Radius
 --- --- --- --- ---
 str u32 f64 f64 f64
╞═══════════════════════════╪═══════╪══════════════════════════════╪════════════════════╪════════════════════╡
 Transit 30802 5.111434 3791.05 0.27
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 Radial Velocity 2270 7.211939 16.264 1.211
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 Microlensing 430 null null null
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 Transit Timing Variations 120 2.206051 3.01 0.37
└───────────────────────────┴───────┴──────────────────────────────┴────────────────────┴────────────────────┘

real	0m0.916s
user	0m9.997s
sys	0m0.184s

While the inherent parallelization performance is so good that there is almost no difference in real time, there is a large difference in user time. I understand this to be the effect of optimization during the Lazy execution.

The show_graph method can be used to visualize the query plan of a LazyFrame.

radius_by_method.show_graph(optimized=True, show=False, output_path="optimized.png")
radius_by_method.show_graph(optimized=False, show=False, output_path="no_optimized.png")

With optimization
With optimization

Without optimization
Without optimization
The top is the query plan with optimization, and the bottom is without. The difference in CSV SCAN is easy to see. With optimization, it is \pi\ 2/287;, which means only 2 out of 287 columns required for this process are being read. Without optimization, it is \pi\ */287;, meaning all 287 columns are being read.

Reference: Converting from Eager to Lazy

You can switch from Eager to Lazy by using the lazy method.

@@ -6,2 +6,3 @@
         pl.read_csv("PS.csv", skip_rows=290)
+        .lazy()
         .groupby("discoverymethod")
@@ -16,3 +17,3 @@
     )
-    print(radius_by_method)
+    print(radius_by_method.collect())

Additional Notes on Optimization

https://github.com/pola-rs/polars/blob/e172db7ace210f03740297ec1480042d9daa4719/polars/polars-lazy/polars-plan/src/logical_plan/optimizer/mod.rs#L52

By looking inside optimize, you can trace various optimizations for the query plan. Here is an overview of each optimization rule based on my current understanding.

  • PredicatePushDown
    When there are multiple filter operations, it combines them to be performed during data loading. Here is an example:
>>> import polars as pl
>>> qp = (
...     pl.scan_csv("PS.csv", skip_rows=290)
...     .filter(pl.col("pl_rade") > 10)
...     .filter(pl.col("pl_masse") < 10)
...     .select([
...             "rowid", "pl_name", "hostname", "discoverymethod", "pl_rade", "pl_masse"
...     ])
... )
>>> qp.collect()
shape: (1, 6)
┌───────┬─────────────┬───────────┬─────────────────┬─────────┬──────────┐
│ rowid ┆ pl_name     ┆ hostname  ┆ discoverymethod ┆ pl_rade ┆ pl_masse │
------------------
│ i64   ┆ strstrstr             ┆ f64     ┆ f64      │
╞═══════╪═════════════╪═══════════╪═════════════════╪═════════╪══════════╡
25028 ┆ Kepler-51 d ┆ Kepler-51 ┆ Transit         ┆ 11.86.2
└───────┴─────────────┴───────────┴─────────────────┴─────────┴──────────┘
>>> qp.show_graph(optimized=False, show=False, output_path="no_optimized.png")
>>> qp.show_graph(optimized=True, show=False, output_path="optimized.png")

Without optimization
Without optimization

With optimization
With optimization

  • ProjectionPushdown
    Optimizes column reading during data scans, etc., by only reading the columns required by the query plan. I believe the example given in the Lazy API section also falls under this.

  • TypeCoercionRule
    Defines rules for type casting when columns with different data types are compared.

  • SimplifyBooleanRule
    Simplifies boolean operations whose results are trivial, such as x and false or x and true, before the actual calculation.

  • SimplifyExprRule
    Simplifies the query operations side, such as commutative operations or null operations, before the actual calculation.

Out of Core

Optimization also saves memory, but if the data you're handling still doesn't fit in RAM, you can specify the streaming option as collect(streaming=True), according to this issue. However, it seems this is still an alpha version feature (as per the API reference).

Others

Regarding parallelization and memory efficiency, the technical article written by Ritchie, the author of Polars, is also helpful. (Please note that the certificate might be expired, so browse at your own risk...)

https://www.ritchievink.com/blog/2021/02/28/i-wrote-one-of-the-fastest-dataframe-libraries/

Differences from pandas

The user guide also has a page describing the differences for those coming from pandas[4], so I will introduce some excerpts other than the parallelization and optimization explained so far.

https://pola-rs.github.io/polars-book/user-guide/coming_from_pandas.html

About indices

In Polars, DataFrames do not have indices. Since the Lazy API is fundamental, my understanding is that maintaining or having indices would hinder parallelization. Consequently, there are no methods equivalent to pandas' iloc or loc. (To be precise, though deprecated, it seems they are implemented only in Eager mode. Reference)
→ The description in the user guide was corrected around the end of January 2023, as it might cause confusion to users. The mention of indices is no longer "deprecated," but rather states that they can be effective in some cases. Reference
However, since you cannot benefit from lazy evaluation, it becomes a trade-off with optimization and parallelization.)

For actual data selection, use the select or filter methods.

Examples of select and filter
>>> df.select(["discoverymethod"]).first().collect()
shape: (1, 1)
┌─────────────────┐
│ discoverymethod │
---
str
╞═════════════════╡
│ Radial Velocity │
└─────────────────┘
>>> df.filter(pl.col("pl_rade") > 100.0).first().collect()
shape: (1, 287)
┌───────┬───────────────┬─────────────┬───────────┬─────┬─────────┬──────────┬───────────┬──────────────┐
│ rowid ┆ pl_name       ┆ hostname    ┆ pl_letter ┆ ... ┆ st_nrvc ┆ st_nspec ┆ pl_nespec ┆ pl_ntranspec │
------------       ┆     ┆ ------------
│ i64   ┆ strstrstr       ┆     ┆ i64     ┆ i64      ┆ i64       ┆ i64          │
╞═══════╪═══════════════╪═════════════╪═══════════╪═════╪═════════╪══════════╪═══════════╪══════════════╡
14325 ┆ Kepler-1698 b ┆ Kepler-1698 ┆ b         ┆ ...0000
└───────┴───────────────┴─────────────┴───────────┴─────┴─────────┴──────────┴───────────┴──────────────┘
  • rowid is a data column included in the original CSV.

About missing values

In Polars, since data types are based on the Apache Arrow format, missing values are represented as null values.

>>> df = pl.DataFrame({"value": [1, None, 3]})
>>> df
shape: (3, 1)
┌───────┐
│ value │
---
│ i64   │
╞═══════╡
1
├╌╌╌╌╌╌╌┤
│ null  │
├╌╌╌╌╌╌╌┤
3
└───────┘
>>> df.mean()
shape: (1, 1)
┌───────┐
│ value │
---
│ f64   │
╞═══════╡
2.0
└───────┘

While you can intentionally use NaN, it's important to note that NaN is not treated as a missing value, so care is needed when performing operations like calculating the mean. (It's also worth noting that np.nan is of float type.)

>>> import numpy as np
>>> df = pl.DataFrame({"value": [1, np.nan, 3]})
>>> df
shape: (3, 1)
┌───────┐
│ value │
---
│ f64   │
╞═══════╡
1.0
├╌╌╌╌╌╌╌┤
│ NaN   │
├╌╌╌╌╌╌╌┤
3.0
└───────┘
>>> df.mean()
shape: (1, 1)
┌───────┐
│ value │
---
│ f64   │
╞═══════╡
│ NaN   │
└───────┘
>>> df.fill_nan(None).mean()
shape: (1, 1)
┌───────┐
│ value │
---
│ f64   │
╞═══════╡
2.0
└───────┘

Appendix: Library Structure

https://github.com/pola-rs/polars/
Regarding the polars repository, it can be broadly divided into the Core part implemented in Rust and the interface part implemented in Python. The former resides under the polars directory, and the latter under the py-polars directory. Each directory has a Makefile, so running something like make test will execute everything from the build to the tests. For the Python side, the environment is set up using venv, so after building, you can easily perform debugging by executing source venv/bin/activate in the py-polars directory.

Example of source code tracing

Tracing the source code using the example of to_dot(), which is called within show_graph(), results in the following flow:

Note that there is a PyLazyFrame class that bridges the LazyFrame class on the Python side and the LazyFrame class on the Rust side.

If you are aiming to actually contribute, please also take a look at CONTRIBUTING.md.

脚注
  1. It can technically be used with Node.js as well, but judging by the repository's update frequency and issues, it seems the main maintainers are not experts in TS/JS and leave this area to community contributions. ↩︎

  2. The CSV data used in the examples is an exoplanet data catalog provided by the NASA Exoplanet Archive. (61.3MB) ↩︎

  3. In the catalog, measurement data is provided per paper per row, so there may be multiple rows of data for the same planet, and some columns may be null. Therefore, the physical accuracy of the results is not guaranteed here. This is presented as an example of Polars operations. ↩︎

  4. Personally, I don't really think the use cases for pandas and Polars compete that much. I believe the official user guide also created the page with the intention of widening the entry point by taking advantage of pandas' name recognition. ↩︎

Discussion