「PolarsとPandasで同じような処理ができるけど、微妙にその仕様が違う」という落とし穴的な情報がまとまったものはなさそうなので、備忘録的にメモしておく。

Groupby実行時の順序性担保

PandasはGroupby実行時に集約に使う列の順序性が保持されるが、Polarsはデフォルトでは保持されない。

Pandas

df = pd.DataFrame(
    {
        "a": ["a", "b", "a", "b", "c"],
        "b": [1, 2, 1, 3, 3],
    }
)
df.groupby("a").sum()

出力結果

a	b
a	2
b	5
c	3

バージョン：1.5.3

Polars

df = pl.DataFrame(
    {
        "a": ["a", "b", "a", "b", "c"],
        "b": [1, 2, 1, 3, 3],
    }
)
df.groupby("a").sum()

出力結果

a	b
b	5
c	3
a	2

ちなみにmaintain_order=Trueのオプションを指定することで、Polarsでも順序性の保持は可能（ただし公式ドキュメントによるとデフォルトより実行速度が下がるとのこと）

df.groupby("a", maintain_order=True).sum().to_pandas()

出力結果

a	b
a	2
b	5
c	3

バージョン：0.17.3

chimuichimu

countが欠損値を含むか否か

Pandasのpandas.DataFrame.countはNoneなどの欠損値を数えないのに対して、Polarsのpolars.Expr.countは欠損値を数える。

Pandas

df = pd.DataFrame(
    {
        "a": ["a", "b", "a", "b", "c"],
        "b": [None, 2, 1, 3, np.nan],
    }
)
df.count()

出力結果

a	b
5	3

バージョン：1.5.3

Polars

df = pl.DataFrame(
    {
        "a": ["a", "b", "a", "b", "c"],
        "b": [None, 2, 1, 3, np.nan],
    }
)
df.select(pl.all().count())

出力結果

a	b
5	5

バージョン：0.17.3

chimuichimu

型変換時に精度外の数値がある時の挙動

Pandasでは数値属性のカラムを型変換する時、精度外の数値があった場合でもエラーを吐かない（ただし値が壊れる）。一方でPolarsではエラーが発生する。

Pandas

df = pd.DataFrame(
    {
        "a": [1, 2, 1, 3, 99999]
    }
)
df["a"] = df["a"].astype(np.int8)
df

出力結果

index	a
0	1
1	2
2	1
3	3
4	-97

バージョン：1.5.3

Polars

df = pl.DataFrame(
    {
        "a": [1, 2, 1, 3, 99999]
    }
)
df.with_columns(pl.col("a").cast(pl.Int8))

出力結果
以下のエラーが出力される

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
<ipython-input-129-3e17b524f8c7> in <cell line: 6>()
      4     }
      5 )
----> 6 df.with_columns(pl.col("a").cast(pl.Int8))

1 frames
/usr/local/lib/python3.9/dist-packages/polars/lazyframe/frame.py in collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
   1604             streaming,
   1605         )
-> 1606         return wrap_df(ldf.collect())
   1607 
   1608     def sink_parquet(

ComputeError: strict conversion from `i64` to `i8` failed for value(s) [99999]; if you were trying to cast Utf8 to temporal dtypes, consider using `strptime`

バージョン：0.17.3

chimuichimu

同じ名前のカラムを持つDataFrameを結合したときのカラム名

同じ名前のカラムを持つDataFrameを結合したとき、Pandasでは{カラム名}_x, {カラム名}_yとなるのに対して、Polarsでは{カラム名}, {カラム名}_right のようになる。

Pandas

df = pd.DataFrame(
    {
        "id": ["a", "b", "c"],
        "col": [1, 2, 3],
    }
)
df.merge(df, on="id")

出力結果

index	id	col_x	col_y
0	a	1	1
1	b	2	2
2	c	3	3

バージョン：1.5.3

Polars

df = pl.DataFrame(
    {
        "id": ["a", "b", "c"],
        "col": [1, 2, 3],
    }
)
df.join(df, on="id")

出力結果

id	col	col_right
a	1	1
b	2	2
c	3	3

バージョン：0.17.3

chimuichimu

型が異なるカラムをキーにDataFrameを結合した時の挙動

型が異なるカラムをキーにDataFrameを結合したとき、Pandasでは結合元のDataFrameのカラムの型にキャストされるのに対し、Polarsではエラーになる。

Pandas

# pandas merge
df1 = pd.DataFrame(
    {
        "key": [1, 2, 3],
    },
    dtype = np.int8
)
df2 = pd.DataFrame(
    {
        "key": [1, 2],
    },
    dtype = np.int16
)

df = df1.merge(df2, on="key")
display(df)
print(df["key"].dtypes)

出力結果

index	key
0	1
1	2

int8

バージョン：1.5.3

Polars

# polars join
df1 = pl.DataFrame(
    {
        "key": [1, 2, 3],
    },
    schema={
        "key": pl.Int8
    }
)
df2 = pl.DataFrame(
    {
        "key": [1, 2],
    },
    schema={
        "key": pl.Int16
    }
)

df = df1.join(df2, on="key")
display(df)
print(df.dtypes)

出力結果

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
<ipython-input-25-d3fee5d54aeb> in <cell line: 19>()
     17 )
     18 
---> 19 df = df1.join(df2, on="key")
     20 display(df)
     21 print(df.dtypes)

1 frames
/usr/local/lib/python3.10/dist-packages/polars/lazyframe/frame.py in collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, _eager)
   1704             _eager,
   1705         )
-> 1706         return wrap_df(ldf.collect())
   1707 
   1708     @overload

ComputeError: datatypes of join keys don't match - `key`: i8 on left does not match `key`: i16 on right

バージョン：0.20.2