iTranslated by AI
Introduction to Polars: A Blazingly Fast DataFrame Library
Introduction
Recently, I have been studying Polars as it seemed quite interesting. In this article, I would like to introduce the characteristics of Polars while organizing what I've learned.
What is Polars?
Polars is a "blazingly fast" DataFrame library that can be used in Rust and Python[1]—essentially, a library for data analysis. It's a clever play on "pandas" (Polar bear) vs "Polars".
The core part is implemented in Rust, and it is designed to be callable from Python as an interface. The build from Rust to a Python package uses maturin (PyO3).
Environment
The OS, language, and library versions at the time of writing are as follows. Only parts that seem highly relevant have been extracted.
Ubntu 22.04
Python 3.10.6 (main, Nov 2 2022, 18:53:38) [GCC 11.3.0] on linux
maturin 0.13.7
pandas 1.5.2
polars 0.15.2
pyarrow 11.0.0.dev141
cargo 1.67.0-nightly (eb5d35917 2022-11-17)
rustc 1.67.0-nightly (ff8c8dfbe 2022-11-22)
Note that Polars is still under development and is updated daily. (While I was writing this article, the version was updated to 0.15.6!)
Background of Speedup
Data Format
The library is implemented based on Apache Arrow, a column-oriented in-memory data format. More precisely, it is implemented using a crate called Arrow2 in the Apache Arrow format. As a result, the data processing itself is inherently fast. Descriptions of the actually defined data types can be found in the User Guide.
Lazy API
In Polars, for queries that manipulate DataFrames, you can distinguish between "Lazy" (deferred) and "Eager" (immediate) evaluation timing. By choosing Lazy, Polars can plan parallelization and optimization across queries, achieving further speed improvements.
Here are some examples[2].
>>> import polars as pl
>>> pl.Config.set_tbl_rows(2) # To omit row display
<class 'polars.cfg.Config'>
>>> pl.read_csv("PS.csv", skip_rows=290)
shape: (33796, 287)
┌───────┬──────────┬──────────┬───────────┬─────┬─────────┬──────────┬───────────┬──────────────┐
│ rowid ┆ pl_name ┆ hostname ┆ pl_letter ┆ ... ┆ st_nrvc ┆ st_nspec ┆ pl_nespec ┆ pl_ntranspec │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪══════════╪══════════╪═══════════╪═════╪═════════╪══════════╪═══════════╪══════════════╡
│ 1 ┆ 11 Com b ┆ 11 Com ┆ b ┆ ... ┆ 2 ┆ 0 ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 33796 ┆ xi Aql b ┆ xi Aql ┆ b ┆ ... ┆ 1 ┆ 0 ┆ 0 ┆ 0 │
└───────┴──────────┴──────────┴───────────┴─────┴─────────┴──────────┴───────────┴──────────────┘
>>> (df := pl.scan_csv("PS.csv", skip_rows=290))
<polars.LazyFrame object at 0x7F99301331C0>
>>> df.collect()
shape: (33796, 287)
┌───────┬──────────┬──────────┬───────────┬─────┬─────────┬──────────┬───────────┬──────────────┐
│ rowid ┆ pl_name ┆ hostname ┆ pl_letter ┆ ... ┆ st_nrvc ┆ st_nspec ┆ pl_nespec ┆ pl_ntranspec │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪══════════╪══════════╪═══════════╪═════╪═════════╪══════════╪═══════════╪══════════════╡
│ 1 ┆ 11 Com b ┆ 11 Com ┆ b ┆ ... ┆ 2 ┆ 0 ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 33796 ┆ xi Aql b ┆ xi Aql ┆ b ┆ ... ┆ 1 ┆ 0 ┆ 0 ┆ 0 │
└───────┴──────────┴──────────┴───────────┴─────┴─────────┴──────────┴───────────┴──────────────┘
While read_csv returns the result of reading the CSV eagerly, scan_csv returns a LazyFrame object at the time of the call. The subsequent collect causes the reading process to be evaluated lazily.
I will add a little more processing to check the effects of optimization and parallelization.
import polars as pl
def main() -> None:
radius_by_method = (
pl.scan_csv("PS.csv", skip_rows=290)
.groupby("discoverymethod")
.agg([
pl.count(),
pl.col("pl_rade").mean().alias("Planet Radius [Earth Radius]"),
pl.col("pl_rade").max().alias("Max. Planet Radius"),
pl.col("pl_rade").min().alias("Min. Planet Radius"),
])
.sort("count", reverse=True)
.limit(4)
)
print(radius_by_method.collect())
if __name__ == "__main__":
main()
We group exoplanets by their observation method and calculate the average, maximum, and minimum planet radii[3].
In the case of Eager
Only scan_csv has been changed to read_csv.
import polars as pl
def main() -> None:
radius_by_method = (
pl.read_csv("PS.csv", skip_rows=290)
.groupby("discoverymethod")
.agg([
pl.count(),
pl.col("pl_rade").mean().alias("Planet Radius [Earth Radius]"),
pl.col("pl_rade").max().alias("Max. Planet Radius"),
pl.col("pl_rade").min().alias("Min. Planet Radius"),
])
.sort("count", reverse=True)
.limit(4)
)
print(radius_by_method)
if __name__ == "__main__":
main()
$ time python lazy_example.py
shape: (4, 5)
┌───────────────────────────┬───────┬──────────────────────────────┬────────────────────┬────────────────────┐
│ discoverymethod ┆ count ┆ Planet Radius [Earth Radius] ┆ Max. Planet Radius ┆ Min. Planet Radius │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════════════════╪═══════╪══════════════════════════════╪════════════════════╪════════════════════╡
│ Transit ┆ 30802 ┆ 5.111434 ┆ 3791.05 ┆ 0.27 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Radial Velocity ┆ 2270 ┆ 7.211939 ┆ 16.264 ┆ 1.211 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Microlensing ┆ 430 ┆ null ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Transit Timing Variations ┆ 120 ┆ 2.206051 ┆ 3.01 ┆ 0.37 │
└───────────────────────────┴───────┴──────────────────────────────┴────────────────────┴────────────────────┘
real 0m0.793s
user 0m0.784s
sys 0m0.040s
$ time python eager_example.py
shape: (4, 5)
┌───────────────────────────┬───────┬──────────────────────────────┬────────────────────┬────────────────────┐
│ discoverymethod ┆ count ┆ Planet Radius [Earth Radius] ┆ Max. Planet Radius ┆ Min. Planet Radius │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════════════════╪═══════╪══════════════════════════════╪════════════════════╪════════════════════╡
│ Transit ┆ 30802 ┆ 5.111434 ┆ 3791.05 ┆ 0.27 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Radial Velocity ┆ 2270 ┆ 7.211939 ┆ 16.264 ┆ 1.211 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Microlensing ┆ 430 ┆ null ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Transit Timing Variations ┆ 120 ┆ 2.206051 ┆ 3.01 ┆ 0.37 │
└───────────────────────────┴───────┴──────────────────────────────┴────────────────────┴────────────────────┘
real 0m0.916s
user 0m9.997s
sys 0m0.184s
While the inherent parallelization performance is so good that there is almost no difference in real time, there is a large difference in user time. I understand this to be the effect of optimization during the Lazy execution.
The show_graph method can be used to visualize the query plan of a LazyFrame.
radius_by_method.show_graph(optimized=True, show=False, output_path="optimized.png")
radius_by_method.show_graph(optimized=False, show=False, output_path="no_optimized.png")

With optimization

Without optimization
The top is the query plan with optimization, and the bottom is without. The difference in CSV SCAN is easy to see. With optimization, it is
Reference: Converting from Eager to Lazy
You can switch from Eager to Lazy by using the lazy method.
@@ -6,2 +6,3 @@
pl.read_csv("PS.csv", skip_rows=290)
+ .lazy()
.groupby("discoverymethod")
@@ -16,3 +17,3 @@
)
- print(radius_by_method)
+ print(radius_by_method.collect())
Additional Notes on Optimization
By looking inside optimize, you can trace various optimizations for the query plan. Here is an overview of each optimization rule based on my current understanding.
-
PredicatePushDown
When there are multiplefilteroperations, it combines them to be performed during data loading. Here is an example:
>>> import polars as pl
>>> qp = (
... pl.scan_csv("PS.csv", skip_rows=290)
... .filter(pl.col("pl_rade") > 10)
... .filter(pl.col("pl_masse") < 10)
... .select([
... "rowid", "pl_name", "hostname", "discoverymethod", "pl_rade", "pl_masse"
... ])
... )
>>> qp.collect()
shape: (1, 6)
┌───────┬─────────────┬───────────┬─────────────────┬─────────┬──────────┐
│ rowid ┆ pl_name ┆ hostname ┆ discoverymethod ┆ pl_rade ┆ pl_masse │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ f64 ┆ f64 │
╞═══════╪═════════════╪═══════════╪═════════════════╪═════════╪══════════╡
│ 25028 ┆ Kepler-51 d ┆ Kepler-51 ┆ Transit ┆ 11.8 ┆ 6.2 │
└───────┴─────────────┴───────────┴─────────────────┴─────────┴──────────┘
>>> qp.show_graph(optimized=False, show=False, output_path="no_optimized.png")
>>> qp.show_graph(optimized=True, show=False, output_path="optimized.png")

Without optimization

With optimization
-
ProjectionPushdown
Optimizes column reading during data scans, etc., by only reading the columns required by the query plan. I believe the example given in the Lazy API section also falls under this. -
TypeCoercionRule
Defines rules for type casting when columns with different data types are compared. -
SimplifyBooleanRule
Simplifies boolean operations whose results are trivial, such asx and falseorx and true, before the actual calculation. -
SimplifyExprRule
Simplifies the query operations side, such as commutative operations or null operations, before the actual calculation.
Out of Core
Optimization also saves memory, but if the data you're handling still doesn't fit in RAM, you can specify the streaming option as collect(streaming=True), according to this issue. However, it seems this is still an alpha version feature (as per the API reference).
Others
Regarding parallelization and memory efficiency, the technical article written by Ritchie, the author of Polars, is also helpful. (Please note that the certificate might be expired, so browse at your own risk...)
Differences from pandas
The user guide also has a page describing the differences for those coming from pandas[4], so I will introduce some excerpts other than the parallelization and optimization explained so far.
About indices
In Polars, DataFrames do not have indices. Since the Lazy API is fundamental, my understanding is that maintaining or having indices would hinder parallelization. Consequently, there are no methods equivalent to pandas' iloc or loc. (To be precise, though deprecated, it seems they are implemented only in Eager mode. Reference)
→ The description in the user guide was corrected around the end of January 2023, as it might cause confusion to users. The mention of indices is no longer "deprecated," but rather states that they can be effective in some cases. Reference
However, since you cannot benefit from lazy evaluation, it becomes a trade-off with optimization and parallelization.)
For actual data selection, use the select or filter methods.
>>> df.select(["discoverymethod"]).first().collect()
shape: (1, 1)
┌─────────────────┐
│ discoverymethod │
│ --- │
│ str │
╞═════════════════╡
│ Radial Velocity │
└─────────────────┘
>>> df.filter(pl.col("pl_rade") > 100.0).first().collect()
shape: (1, 287)
┌───────┬───────────────┬─────────────┬───────────┬─────┬─────────┬──────────┬───────────┬──────────────┐
│ rowid ┆ pl_name ┆ hostname ┆ pl_letter ┆ ... ┆ st_nrvc ┆ st_nspec ┆ pl_nespec ┆ pl_ntranspec │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪═══════════════╪═════════════╪═══════════╪═════╪═════════╪══════════╪═══════════╪══════════════╡
│ 14325 ┆ Kepler-1698 b ┆ Kepler-1698 ┆ b ┆ ... ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
└───────┴───────────────┴─────────────┴───────────┴─────┴─────────┴──────────┴───────────┴──────────────┘
- rowid is a data column included in the original CSV.
About missing values
In Polars, since data types are based on the Apache Arrow format, missing values are represented as null values.
>>> df = pl.DataFrame({"value": [1, None, 3]})
>>> df
shape: (3, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
├╌╌╌╌╌╌╌┤
│ null │
├╌╌╌╌╌╌╌┤
│ 3 │
└───────┘
>>> df.mean()
shape: (1, 1)
┌───────┐
│ value │
│ --- │
│ f64 │
╞═══════╡
│ 2.0 │
└───────┘
While you can intentionally use NaN, it's important to note that NaN is not treated as a missing value, so care is needed when performing operations like calculating the mean. (It's also worth noting that np.nan is of float type.)
>>> import numpy as np
>>> df = pl.DataFrame({"value": [1, np.nan, 3]})
>>> df
shape: (3, 1)
┌───────┐
│ value │
│ --- │
│ f64 │
╞═══════╡
│ 1.0 │
├╌╌╌╌╌╌╌┤
│ NaN │
├╌╌╌╌╌╌╌┤
│ 3.0 │
└───────┘
>>> df.mean()
shape: (1, 1)
┌───────┐
│ value │
│ --- │
│ f64 │
╞═══════╡
│ NaN │
└───────┘
>>> df.fill_nan(None).mean()
shape: (1, 1)
┌───────┐
│ value │
│ --- │
│ f64 │
╞═══════╡
│ 2.0 │
└───────┘
Appendix: Library Structure
Regarding the polars repository, it can be broadly divided into the Core part implemented in Rust and the interface part implemented in Python. The former resides under the polars directory, and the latter under the py-polars directory. Each directory has a Makefile, so running something like make test will execute everything from the build to the tests. For the Python side, the environment is set up using venv, so after building, you can easily perform debugging by executing source venv/bin/activate in the py-polars directory.
Example of source code tracing
Tracing the source code using the example of to_dot(), which is called within show_graph(), results in the following flow:
Note that there is a PyLazyFrame class that bridges the LazyFrame class on the Python side and the LazyFrame class on the Rust side.
If you are aiming to actually contribute, please also take a look at CONTRIBUTING.md.
-
It can technically be used with Node.js as well, but judging by the repository's update frequency and issues, it seems the main maintainers are not experts in TS/JS and leave this area to community contributions. ↩︎
-
The CSV data used in the examples is an exoplanet data catalog provided by the NASA Exoplanet Archive. (61.3MB) ↩︎
-
In the catalog, measurement data is provided per paper per row, so there may be multiple rows of data for the same planet, and some columns may be null. Therefore, the physical accuracy of the results is not guaranteed here. This is presented as an example of Polars operations. ↩︎
-
Personally, I don't really think the use cases for pandas and Polars compete that much. I believe the official user guide also created the page with the intention of widening the entry point by taking advantage of pandas' name recognition. ↩︎
Discussion