pandasとpolarsのクエリ実行時間の比較

pandasとpolarsのクエリ実行時間の比較。
以下のクエリを比較した。
同等の処理内容のクエリをpandasで書けなかったので、測定結果にはクエリの書き方による差異も含まれている。

pandasのクエリ

# クエリ1
pd.concat(
    [
        train_pandas.groupby("session")["aid"].transform(lambda x: x),
        train_pandas.groupby("session")["aid"]
        .transform(lambda x: x.shift(-1))
        .rename("aid_next"),
    ],
    axis=1,
).dropna()

polarsのクエリ

# クエリ2
train_pairs = (
    pl.concat([train, test])
    .groupby("session")
    .agg([pl.col("aid"), pl.col("aid").shift(-1).alias("aid_next")])
    .explode(["aid", "aid_next"])
    .drop_nulls()
)[["aid", "aid_next"]]

クエリ実行時間の比較

N	pandas	polars	pandas/polars
2000000	1.95	0.0641	30.421217
20000000	27.00	0.3910	69.053708
216716096	720.00	7.5600	95.238095

bilzard

ちなみにpandasでグループごとに列を結合するクエリは遅かった（おそらくconcatのオーバーヘッドが大きい）。

データ数: 2M
クエリ1: 2.69sec
クエリ3: 13.8sec

# クエリ3
def get_consecutive_aids(x):
    return pd.concat([x["aid"], x["aid"].shift(-1).rename("aid_next")], axis=1)


train_pandas.groupby("session").apply(get_consecutive_aids).reset_index().drop(
    "level_1", axis=1
).dropna()

bilzard

[追記] mac環境で最新のpandasで、クエリも修正して実行し直したらそこまで致命的な差ではなかった。
pandas==1.5.1, polars==0.14.31
最新のpandasはマルチプロセスに対応しているのか？
それでもpolarsが倍早い。

pyarrowのバージョンとかでも変わるのだろうか？

# pandas==1.5.1
pd.concat(
    [pd.concat([df["b"], df.groupby("a")["b"].shift(-1).rename("b_next")], axis=1)],
    axis=1,
).dropna()

# polars==0.14.31
pl_df.groupby("a").agg([pl.col("b"), pl.col("b").shift(-1).alias("b_next")]).explode(
    ["b", "b_next"]
).drop_nulls()[["b", "b_next"]]

N	pandas	polars	pandas/polars
20000000	0.618	0.354	1.745763
50000000	1.680	0.765	2.196078
100000000	5.020	2.220	2.261261
200000000	11.200	4.970	2.253521

bilzard

pandasのクエリが冗長だったので測定し直した。
データの生成方法を変えたためか、polarsが上の例より速くなっている。
pandasとpolarsの速度の比は2-4倍。

N = 200_000_000
df = (
    pd.DataFrame(
        {
            "a": np.random.randint(0, N // 500, N).astype("uint32"),
            "b": np.random.randint(0, N // 100, N).astype("uint32"),
        }
    )
    .sort_values("a")
    .reset_index(drop=True)
)
df

%%time
pd.concat([df["b"], df.groupby("a")["b"].shift(-1).rename("b_next")], axis=1).dropna()

%%time
pl_df.groupby("a").agg([pl.col("b"), pl.col("b").shift(-1).alias("b_next")]).explode(
    ["b", "b_next"]
).drop_nulls()[["b", "b_next"]]

N	pandas	polars	pandas/polars
20000000	0.694	0.390	1.779487
50000000	1.650	0.415	3.975904
100000000	4.100	1.380	2.971014
200000000	10.800	3.590	3.008357

bilzard

Ubuntu環境でもバージョンを揃えて同じことをしてみた。
クエリとデータの生成方法はmac環境のもの^[1]と同じ。
結果はmac環境とほぼ同様。計算速度はほぼ線形で、polarsとpandasの速度比は概ね3倍。

pandas==1.5.2
polars==0.14.31

N	pandas	polars	pandas/polars
20000000	0.547	0.202	2.707921
50000000	1.380	0.463	2.980562
100000000	2.770	0.948	2.921941
200000000	5.500	1.930	2.849741

脚注

https://zenn.dev/link/comments/f46352dc2302cc ↩︎

bilzard

データのカーディナりてぃを変えて測定。
この条件ではpandasとpolarsがほぼ互角だった。

N = 50_000_000
df = (
    pd.DataFrame(
        {
            "a": np.random.randint(0, N // 20, N).astype("uint32"),
            "b": np.random.randint(0, N // 50, N).astype("uint32"),
        }
    )
    .sort_values("a")
    .reset_index(drop=True)
)

N	pandas	polars	pandas/polars
20000000	0.288	0.263	1.095057
50000000	1.510	1.410	1.070922
100000000	2.990	2.770	1.079422
200000000	5.960	5.780	1.031142

bilzard

グループ数を変化させた時の比較（N=50_000_000）。
グループ数が多くなるとpandasの方が早くなる。

n_groups	pandas	polars	ratio
1000000	1.380760	0.864030	1.598048
2000000	1.425849	1.201947	1.186282
5000000	1.581673	2.175279	0.727113
10000000	1.843535	3.818709	0.482764

source code: https://gist.github.com/bilzard/ba58612ecc17dfb6ddd051503ce55f62

bilzard

グループ数による実行速度の差異は、クエリの違いが原因かもしれない。

bilzard

pandas 1.3.5での計測結果。pandas 1.5.1よりやや遅いが、概ね実行時間は同じ。

n_groups	pandas	polars	ratio
1000000	1.641448	0.845856	1.940576
2000000	1.733902	1.187401	1.460250
5000000	1.892741	2.183766	0.866733
10000000	2.248044	3.763220	0.597372

source: https://gist.github.com/bilzard/6f7ad80b6eb3a6011e7919a10824c37d

bilzard

結局一番最初の計測はpandasのクエリが悪すぎたよう。

このスクラップは2022/11/23にクローズされました