🍣

Polarsで「pandasライブラリ活用入門」を書いてみました

2023/02/25に公開約38,700字

はじめに

最近、pandasに代わる高速データ分析ライブラリとしてpolarsが注目されているとのことです。
私もpolarsの勉強の一環として、Pythonデータ分析/機械学習のための基本コーディング! Pandasライブラリ活用入門 - インプレスブックスのプログラムをpolarsで書いてみました。

この記事では、原著のGitHubのプログラムに沿って、polarsとpandasのコードを比較しています。
記事の構成は、逆引きとして使用するために以下の構成とします。

  • 章 プログラム
    • 節 行いたいこと
      • 句 polarsコード
      • 句 pandas

01-intro

tsvファイルを読み込む

polars

import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
df.head()

pandas

import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
df.head()

行数・列数、列名、インデックス、型情報を取得する

polars

import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
print(df.shape)
print(df.columns)
print('polarsにindexはない')
print(df.dtypes)

pandas

import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
print(df.shape)
print(df.columns)
print(df.index)
print(df.dtypes)

numpy形式に変換する

polars

import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
df.to_numpy()

pandas

import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
df.values

基本統計量

polars

import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
df.describe()

pandas

import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
df.info()

Series(列)を返す

polars

import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
country_df = df.get_column('country')
country_df.head(5)

pandas

import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
country_df = df['country']
country_df.head(5)

DataFrame(Subset 部分集合)を返す

polars

import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
subset = df['country', 'continent', 'year']
subset.head(5)

pandas

import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
subset = df[['country', 'continent', 'year']]
subset.head(5)

任意の行や列を抽出する

polars

import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
#print(df[[0, 1]]) # 先頭の2行

#print(df.head()) # 先頭の5行

#print(df[0]) # 先頭行

#print(df[[0, 99]]) # 第0行と第99行

#print(df[-1]) # 最終行

#print('polarsにilocはない')

#print('polarsにilocはない')

#print('polarsにilocはない')

subset = df[:, ['year', 'pop']] # Year列とpop列
#print(subset.head())

subset = df[-5::2, :] # 最終行の4行前から1行飛ばし
#print(subset.head())

pandas

import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
#print('pandasでは df[[0, 1]] と書くとエラーになる')

#print(df.head()) # 先頭の5行を表示する

#print(df.loc[0]) # 先頭行

#print(df.loc[[0, 99]]) # 第0行と第99行

#print('pandasでは df.loc[-1] と書くとエラーになる')

#print(df.iloc[0]) # 先頭行

#print(df.iloc[[0, 99]]) # 第0行と第99行

#print(df.iloc[-1]) # 最終行

subset = df.loc[:, ['year', 'pop']] # Year列とpop列
#print(subset.head())

subset = df.iloc[-5::2, :] # 最終行の4行前から1行飛ばし
#print(subset.head())

グループごとに平均値を計算する

polars

import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
df.groupby("year").agg(pl.mean("lifeExp")).sort('year').head(6)

pandas

import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
df.groupby('year')['lifeExp'].mean().head(6)

02-more_pandas

Seriesを作る(複数の型のデータを含む場合。誰が使うの?)

polars

import polars as pl
s = pl.Series('item', ['banana', 42], dtype = pl.Object)
s

pandas

import pandas as pd
s = pd.Series(['banana', 42])
s

Seriesを作る(pandasではindex列を設定しているが、polarsにはindexはない)

polars

import polars as pl
s = pl.Series(['Wes', 'Creater'])
s

pandas

import pandas as pd
s = pd.Series(['Wes', 'Creater'], index=['person', 'who'])
s

要素へのアクセス

polars

import polars as pl
s = pl.Series(['Wes', 'Creater'])
print('polarsにlocはない')
print('polarsにilocはない')
print(s[0])

pandas

import pandas as pd
s = pd.Series(['Wes', 'Creater'], index=['person', 'who'])
print(s.loc['person'])
print(s.iloc[0])
print(s[0])

DataFrameを作る(①polarsにindexはない)

polars

import polars as pl
scientists = pl.DataFrame(
data={
'name': ['Rosaline Franklin','William Gosset'],
'Occupation':['Chemist','Statistician'],
'Born':['1920-07-25', '1876-06-13'],
'Died':['1958-04-16', '1937-10-16'],
'Age':[37,61]},
columns=['name', 'Occupation', 'Born','Died','Age'])

# 日付への変換
scientists = scientists \
.replace('Born', scientists.get_column('Born').str.strptime(pl.Date, '%Y-%m-%d', strict = False)) \
.replace('Died', scientists.get_column('Died').str.strptime(pl.Date, '%Y-%m-%d', strict = False))

scientists

pandas

import pandas as pd
scientists=pd.DataFrame(
data={'Occupation':['Chemist','Statistician'],
'Born':['1920-07-25', '1876-06-13'],
'Died':['1958-04-16', '1937-10-16'],
'Age':[37,61]},
index=['Rosaline Franklin','William Gosset'],
columns=['Occupation', 'Born','Died','Age'])
scientists

CSVファイルを読み込む(pandasの方が日付変換が強い)

polars

import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pl.read_csv(path)

# 日付への変換
scientists = scientists \
.replace('Born', scientists.get_column('Born').str.strptime(pl.Date, '%Y-%m-%d', strict = False)) \
.replace('Died', scientists.get_column('Died').str.strptime(pl.Date, '%Y-%m-%d', strict = False))

scientists

pandas

import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pd.read_csv(path)

scientists

Seriesのメソッド

polars

import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pl.read_csv(path)
# 日付への変換
scientists = scientists \
.replace('Born', scientists.get_column('Born').str.strptime(pl.Date, '%Y-%m-%d', strict = False)) \
.replace('Died', scientists.get_column('Died').str.strptime(pl.Date, '%Y-%m-%d', strict = False))

# ここからメソッド
ages = scientists['Age']
#print(ages.head())

#print('polarsは、列名をpropertyのように扱うことができない。')

#print(type(ages)) # 型名

#print(ages.mean()) # 平均値

#print(ages.shape) # 行数

#print(ages.min()) # 最小値

#print(ages.describe()) # 要約

#print(ages[ages > ages.mean()]) # 年齢が平均値よりも高い行を抽出する

#print(ages[(ages > ages.mean()) & (ages > 75)]) # 年齢が平均値よりも高い、かつ75歳より高い行を抽出する

#print(ages[(ages > ages.mean()) & ~(ages > 75)]) # 年齢が平均値よりも高い、かつ75歳以下の行を抽出する

#print(ages + 100) # すべての要素に100を加える

pandas

import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pd.read_csv(path)

# ここからメソッド
ages = scientists['Age']
#print(ages.head())

ages = scientists.Age

#print(type(ages)) # 型名

#print(ages.mean()) # 平均値

#print(ages.shape) # 行数

#print(ages.min()) # 最小値

#print(ages.describe()) # 要約

#print(ages[ages > ages.mean()]) # 年齢が平均値よりも高い行を抽出する

#print(ages[(ages > ages.mean()) & (ages > 75)]) # 年齢が平均値よりも高い、かつ75歳より高い行を抽出する

#print(ages[(ages > ages.mean()) & ~(ages > 75)]) # 年齢が平均値よりも高い、かつ75歳以下の行を抽出する

#print(ages + 100) # すべての要素に100を加える

データを絞り込む フィルタリング 真偽値

polars

import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pl.read_csv(path)
# 日付への変換
scientists = scientists \
.replace('Born', scientists.get_column('Born').str.strptime(pl.Date, '%Y-%m-%d', strict = False)) \
.replace('Died', scientists.get_column('Died').str.strptime(pl.Date, '%Y-%m-%d', strict = False))

# ここからメソッド
#print(scientists.filter(pl.col('Age') > pl.col('Age').mean())) # 年齢が平均値より高い行を抽出する

#print(scientists['Age'] > scientists['Age'].mean()) # 年齢が平均より高いとTrueとする列を返す

pandas

import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pd.read_csv(path)

# ここからメソッド
#print(scientists[scientists['Age'] > scientists['Age'].mean()]) # 年齢が平均値より高い行を抽出する

#print(scientists['Age'] > scientists['Age'].mean()) # 年齢が平均より高いとTrueとする列を返す

文字列型で読み込んだ日付の列をパースして、日付型の列をDataFrameに加える

polars

import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pl.read_csv(path)

# 生年月日
# 死亡日
scientists = scientists.with_columns([
    scientists.get_column('Born').str.strptime(pl.Date, '%Y-%m-%d', strict = False).alias('born_dt'),
    scientists.get_column('Died').str.strptime(pl.Date, '%Y-%m-%d', strict = False).alias('died_dt')
])

scientists.head()

pandas

import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pd.read_csv(path)

# 生年月日
born_datetime = pd.to_datetime(scientists['Born'], format='%Y-%m-%d')
scientists['born_dt'] = born_datetime

# 死亡日
scientists['died_dt'] = pd.to_datetime(scientists['Died'], format='%Y-%m-%d')

scientists.head()

列を削除する

polars

import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pl.read_csv(path)

scientists = scientists.drop(['Born', 'Died'])

scientists.head()

pandas

import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pd.read_csv(path)

scientists = scientists.drop(['Born', 'Died'], axis=1)

scientists.head()

文字列で読み込んだ生年月日と死亡日を日付に変換して保存した後もう一度読み込む(変換した意味はない)

polars

import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pl.read_csv(path)

# 生年月日
# 死亡日
scientists = scientists.with_columns([
    scientists.get_column('Born').str.strptime(pl.Date, '%Y-%m-%d', strict = False).alias('born_dt'),
    scientists.get_column('Died').str.strptime(pl.Date, '%Y-%m-%d', strict = False).alias('died_dt')
])

# 列の削除
scientists = scientists.drop(['Born', 'Died'])

# 日付型に変換したデータフレームを保存する
scientists.write_csv('scientists_clean.csv', sep=",")

# 日付型に変換して保存したcsvファイルを読み込む
scientists = pl.read_csv('scientists_clean.csv')

scientists.dtypes

pandas

import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pd.read_csv(path)

# 生年月日
born_datetime = pd.to_datetime(scientists['Born'], format='%Y-%m-%d')
scientists['born_dt'] = born_datetime

# 死亡日
scientists['died_dt'] = pd.to_datetime(scientists['Died'], format='%Y-%m-%d')

# 列の削除
scientists = scientists.drop(['Born', 'Died'], axis=1)

# 日付型に変換したデータフレームを保存する
scientists.to_csv('scientists_clean.csv', index=False)

# 日付型に変換して保存したcsvファイルを読み込む
scientists = pd.read_csv('scientists_clean.csv')

scientists.dtypes

03-intro_plotting

04-concat_merge

3つのデータフレームを縦に結合する

polars

import polars as pl

# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'

# ファイルの読み込み
df1 = pl.read_csv(path1)
df2 = pl.read_csv(path2)
df3 = pl.read_csv(path3)

# 3つのデータフレームを縦に結合する
row_concat = pl.concat([df1, df2, df3])
row_concat

pandas

import pandas as pd

# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'

# ファイルの読み込み
df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)

# 3つのデータフレームを縦に結合する
row_concat = pd.concat([df1, df2, df3])
row_concat

3つのデータフレームを結合して、結合前の3つのデータフレームのそれぞれの第0行を結合したデータフレームを返す

pandasのindex機能の影響

polars

import polars as pl

# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'

# ファイルの読み込み
df1 = pl.read_csv(path1)
df2 = pl.read_csv(path2)
df3 = pl.read_csv(path3)

# 3つのデータフレームを縦に結合する
row_concat = pl.concat([df1, df2, df3])

row_concat[[0, len(df1), (len(df1) + len(df2))]]

pandas

import pandas as pd

# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'

# ファイルの読み込み
df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)

# 3つのデータフレームを縦に結合する
row_concat = pd.concat([df1, df2, df3])

row_concat.loc[0]

3つのデータフレームを結合して、結合したデータフレームの第0行を返す

polars

import polars as pl

# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'

# ファイルの読み込み
df1 = pl.read_csv(path1)
df2 = pl.read_csv(path2)
df3 = pl.read_csv(path3)

# 3つのデータフレームを縦に結合する
row_concat = pl.concat([df1, df2, df3])

row_concat[0]

pandas

import pandas as pd

# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'

# ファイルの読み込み
df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)

# 3つのデータフレームを縦に結合する
row_concat = pd.concat([df1, df2, df3])

row_concat.iloc[0]

DataFrameとSeriesを縦方向に結合する

polars

import polars as pl

# ファイルの読み込み
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
df1 = pl.read_csv(path1)

# 新しいSeries
new_row_series = pl.Series('new', ['n1', 'n2', 'n3', 'n4'])

pl.concat([df1, pl.DataFrame(new_row_series)], how = "diagonal")

pandas

import pandas as pd

# ファイルの読み込み
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
df1 = pd.read_csv(path1)

new_row_series = pd.Series(['n1', 'n2', 'n3', 'n4'])

pd.concat([df1, new_row_series])

DataFrameとDataFrameを縦方向に結合する

polars

import polars as pl

# ファイルの読み込み
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
df1 = pl.read_csv(path1)

# 新しいDataFrame
new_row_data = pl.DataFrame([['n1', 'n2', 'n3', 'n4']],
                            columns=['A', 'B', 'C', 'D'])

pl.concat([df1, new_row_data])

pandas

import pandas as pd

# ファイルの読み込み
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
df1 = pd.read_csv(path1)

# 新しいDataFrame
new_row_data = pd.DataFrame([['n1', 'n2', 'n3', 'n4']],
                            columns=['A', 'B', 'C', 'D'])

pd.concat([df1, new_row_data])

同じ列名を持つ3つのDataFrameを横方向に結合する(polarsは列名の重複を許さないので、列名を調整する必要がある)

polars

import polars as pl

# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'

# ファイルの読み込み
df1 = pl.read_csv(path1)
df2 = pl.read_csv(path2)
df3 = pl.read_csv(path3)

# 3つのデータフレームを横に結合する
col_concat = pl.concat([df1,
                        df2.select([
                        pl.col('A').alias('A1'),
                        pl.col('B').alias('B1'),
                        pl.col('C').alias('C1')]),
                        df3.select([
                        pl.col('A').alias('A2'),
                        pl.col('B').alias('B2'),
                        pl.col('C').alias('C2')])],
                       how="horizontal")
col_concat

pandas

import pandas as pd

# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'

# ファイルの読み込み
df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)

# 3つのデータフレームを横に結合する
col_concat = pd.concat([df1, df2, df3], axis=1)
col_concat

同じ列名を持つ3つのDataFrameを横方向に結合して、元のA列だけ抽出する(重複する列名を許さないpolarsのコードが長い)

polars

import polars as pl

# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'

# ファイルの読み込み
df1 = pl.read_csv(path1)
df2 = pl.read_csv(path2)
df3 = pl.read_csv(path3)

# 3つのデータフレームを横に結合する
col_concat = pl.concat([df1,
                        df2.select([
                        pl.col('A').alias('A1'),
                        pl.col('B').alias('B1'),
                        pl.col('C').alias('C1')]),
                        df3.select([
                        pl.col('A').alias('A2'),
                        pl.col('B').alias('B2'),
                        pl.col('C').alias('C2')])],
                       how="horizontal")
col_concat[['A', 'A1', 'A2']]

pandas

import pandas as pd

# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'

# ファイルの読み込み
df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)

# 3つのデータフレームを横に結合する
col_concat = pd.concat([df1, df2, df3], axis=1)
col_concat['A']

異なる列名を持つ3つのDataFrameを横方向に結合する(pandasは列名が異なる場合、自動的に列数を増やすようになっているようだ)

polars

import polars as pl

# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'

# ファイルの読み込み
df1 = pl.read_csv(path1)
df2 = pl.read_csv(path2)
df3 = pl.read_csv(path3)

# 列名の変更
df1.columns=['A','B','C','D']
df2.columns=['E','F','G','H']
df3.columns=['A','H','F','C']

# 3つのデータフレームを縦に結合する
pl.concat([df1, df2, df3], how = 'diagonal')

pandas

import pandas as pd

# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'

# ファイルの読み込み
df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)

# 列名の変更
df1.columns=['A','B','C','D']
df2.columns=['E','F','G','H']
df3.columns=['A','H','F','C']

# 3つのデータフレームを縦に結合する
pd.concat([df1, df2, df3])

列名が異なるデータフレームをマージする

polars

import polars as pl

# データの読み込み
path_survey_person = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_person.csv'
path_survey_site = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_site.csv'
path_survey_survey = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_survey.csv'
path_survey_visited = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_visited.csv'

person = pl.read_csv(path_survey_person)
site = pl.read_csv(path_survey_site)
survey = pl.read_csv(path_survey_survey)
visited = pl.read_csv(path_survey_visited)

# 3行分を抽出して、列名が異なるデータフレームをマージする
visited_sub = visited[[0, 2, 6]]
o2o = pl.concat([site, visited_sub], how = 'horizontal')
o2o

pandas

import pandas as pd

# データの読み込み
path_survey_person = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_person.csv'
path_survey_site = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_site.csv'
path_survey_survey = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_survey.csv'
path_survey_visited = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_visited.csv'

person = pd.read_csv(path_survey_person)
site = pd.read_csv(path_survey_site)
survey = pd.read_csv(path_survey_survey)
visited = pd.read_csv(path_survey_visited)

visited_sub = visited.loc[[0, 2, 6]]
o2o = site.merge(visited_sub, left_on=['name'], right_on='site')
o2o

複数のDataFrameをマージする

polars

import polars as pl

# データの読み込み
path_survey_person = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_person.csv'
path_survey_site = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_site.csv'
path_survey_survey = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_survey.csv'
path_survey_visited = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_visited.csv'

person = pl.read_csv(path_survey_person)
site = pl.read_csv(path_survey_site)
survey = pl.read_csv(path_survey_survey)
visited = pl.read_csv(path_survey_visited)

# DataFrameをマージする
ps = person.join(survey, left_on = 'ident', right_on = 'person', how='outer')
vs = visited.join(site, left_on='site', right_on='name', how='outer')
ps_vs = ps.join(vs, left_on=['taken'], right_on=['ident'], how='outer')
ps_vs.head()

pandas

import pandas as pd

# データの読み込み
path_survey_person = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_person.csv'
path_survey_site = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_site.csv'
path_survey_survey = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_survey.csv'
path_survey_visited = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_visited.csv'

person = pd.read_csv(path_survey_person)
site = pd.read_csv(path_survey_site)
survey = pd.read_csv(path_survey_survey)
visited = pd.read_csv(path_survey_visited)

# DataFrameをマージする
ps = person.merge(survey, left_on='ident', right_on='person', how='outer')
vs = visited.merge(site, left_on='site', right_on='name', how='outer')
ps_vs = ps.merge(vs, left_on=['taken'], right_on=['ident'], how='outer')
ps_vs.head()

05-missing

nullチェック(polarsには1変数のnull判定は無いようだ)

polars

from numpy import NaN, NAN, nan
import polars as pl

test_series = pl.Series('nan', [None])
print(test_series.is_null())

test_series = pl.Series('nan', [42])
print(test_series.is_null())

test_series = pl.Series('nan', [NAN])
print(test_series.is_null())

test_series = pl.Series('nan', [NaN])
print(test_series.is_null())

pandas

from numpy import NaN, NAN, nan
import pandas as pd

print(pd.isnull(nan))

print(pd.isnull(42))

print(pd.isnull(NAN))

print(pd.isnull(NaN))

集計する(groupby agg)

polars

import polars as pl

path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pl.read_csv(path_ebola, parse_dates=True)

ebola.groupby('Cases_Guinea').agg(pl.count()).sort('count', reverse=True).head()

pandas

import pandas as pd

path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pd.read_csv(path_ebola, parse_dates=True)

ebola.Cases_Guinea.value_counts(dropna=False).head()

nullデータ数の確認

polars

import polars as pl
import numpy as np

path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pl.read_csv(path_ebola, parse_dates=True)

np.count_nonzero( ebola.select(pl.all().is_null()).to_pandas() )

pandas

import pandas as pd
import numpy as np

path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pd.read_csv(path_ebola, parse_dates=True)

np.count_nonzero( ebola.isnull() )

nullデータを置換する

polars

import polars as pl
import numpy as np

path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pl.read_csv(path_ebola, parse_dates=True)

#print(ebola.fill_null(0)[:10, :5]) # nullデータを 0 に置換して第10行から下、および第5列から右の列を取得する

#print(ebola.fill_null(strategy="forward")[:10, :5]) # nullデータを 前の行のデータ に置換して第10行から下、および第5列から右の列を取得する

#print(ebola.fill_null(strategy="backward")[:10, :5]) # nullデータを 後の行のデータ に置換して第10行から下、および第5列から右の列を取得する

#print(ebola.interpolate()[:10, :5]) # 補間して第10行から下、および第5列から右の列を取得する

pandas

import pandas as pd
import numpy as np

path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pd.read_csv(path_ebola, parse_dates=True)

#print(ebola.fillna(0).iloc[:10, :5]) # nullデータを 0 に置換して第10行から下、および第5列から右の列を取得する

#print(ebola.fillna(method='ffill').iloc[:10, :5]) # nullデータを 前の行のデータ に置換して第10行から下、および第5列から右の列を取得する

#print(ebola.fillna(method='bfill').iloc[:10, :5]) # nullデータを 後の行のデータ に置換して第10行から下、および第5列から右の列を取得する

#print(ebola.interpolate().iloc[:10, :5]) # 補間して第10行から下、および第5列から右の列を取得する

nullデータを置換して、置換した列をDataFrameに加える

polars

import polars as pl
import numpy as np

path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pl.read_csv(path_ebola, parse_dates=True)

ebola = ebola.with_columns([
    pl.col('Cases_Guinea').fill_null(strategy = 'min').alias('Cases_Guinea_filled')
])

ebola.head()

pandas

import pandas as pd
import numpy as np

path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pd.read_csv(path_ebola, parse_dates=True)

ebola['Cases_Guinea_filled'] = ebola['Cases_Guinea'].fillna(ebola['Cases_Guinea'].min())
ebola.head()

06-tidy

## Pivotテーブルを'religion'列に関して融解(melt)して集計する。集計した値の列名をデフォルト(value)から変更する

polars

import polars as pl

path_pew = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/pew.csv'
pew = pl.read_csv(path_pew)

pew_long = pew.melt(id_vars = 'religion', value_name='count')
pew_long.head()

pandas

import pandas as pd

path_pew = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/pew.csv'
pew = pd.read_csv(path_pew)

pew_long = pd.melt(pew, id_vars='religion', value_name='count')
pew_long.head()

Pivotテーブルを['year', 'artist', 'track', 'time', 'date.entered']に関して融解(melt)して集計する。残りのカテゴリ変数の列名をデフォルト(variable)から'week'に変更して、集計した値の列名をデフォルト(value)から変更する。

polars

import polars as pl

path_billboard = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/billboard.csv'
billboard = pl.read_csv(path_billboard)

billboard_long = billboard.melt(
    id_vars=['year', 'artist', 'track', 'time', 'date.entered'],
    value_name='rating'
)
# 集計する列名のデフォルトが'variable'のため、列名を変更する必要がある。
billboard_long.columns = list(map(lambda x: x.replace("variable", "week"), billboard_long.columns))

billboard_long.head()

pandas

import pandas as pd

path_billboard = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/billboard.csv'
billboard = pd.read_csv(path_billboard)

billboard_long = pd.melt(
    billboard,
    id_vars=['year', 'artist', 'track', 'time', 'date.entered'],
    var_name='week',
    value_name='rating'
)

billboard_long.head()

エボラ出血熱のPivotテーブルを集計日(?)毎に融解して(melt)、状況(発症 cases、死亡 death)の列を加える。

polars

import polars as pl

path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pl.read_csv(path_ebola, parse_dates=True)

# melt
ebola_long = ebola.melt(id_vars=['Date', 'Day'])

# split
split_values = ebola_long.select(
    pl.col('variable').str.split_exact('_', 1)
).unnest('variable')
split_values.columns = ['status', 'country']

ebola_long = pl.concat([ebola_long, split_values], how = 'horizontal')
ebola_long.head()

pandas

import pandas as pd

path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pd.read_csv(path_ebola, parse_dates=True)

# melt
ebola_long = pd.melt(ebola, id_vars=['Date', 'Day'])

# string accessor
variable_split = ebola_long['variable'].str.split('_')
status_values = variable_split.str.get(0)
country_values = variable_split.str.get(1)

ebola_long['status'] = status_values
ebola_long['country'] = country_values
ebola_long.head()

時系列順の気温の最大値と最小値を集計する

polars

import polars as pl

path_weather = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/weather.csv'
weather = pl.read_csv(path_weather, parse_dates=True)

# 融解
weather_melt = weather.melt(
    id_vars=['id', 'year', 'month', 'element'],
    variable_name = 'day',
    value_name='temp'
)
# 文字列を数値に変換して、null値を含む行を削除する
weather_melt = weather_melt \
.replace('temp', weather_melt.get_column('temp').cast(pl.Float32, strict = False)) \
.drop_nulls()

weather_tidy = weather_melt \
.groupby(['id', 'year', 'month', 'day']) \
.agg([
    pl.col('temp').filter(pl.col('element') == 'tmax').mean().alias('tmax'),
    pl.col('temp').filter(pl.col('element') == 'tmin').mean().alias('tmin')
])

weather_tidy.sort(['id', 'year', 'month']).head()

pandas

import pandas as pd

path_weather = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/weather.csv'
weather = pd.read_csv(path_weather, parse_dates=True)

weather_melt = pd.melt(
    weather,
    id_vars=['id', 'year', 'month', 'element'],
    var_name='day',
    value_name='temp'
)

weather_tidy = weather_melt.pivot_table(
    index=['id', 'year', 'month', 'day'],
    columns='element',
    values='temp'
)

weather_tidy.reset_index().head()

ちょっと目的が良くわからなかった

polars

import polars as pl

path_billboard = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/billboard.csv'
billboard = pl.read_csv(path_billboard)

billboard_long = billboard.melt(
    id_vars=['year', 'artist', 'track', 'time', 'date.entered'],
    variable_name = 'week',
    value_name='rating'
)

billboard_songs = billboard_long[['year', 'artist', 'track', 'time']]
billboard_songs = billboard_songs.unique()
billboard_songs = billboard_songs.with_columns([
    pl.Series('id', list(range(len(billboard_songs))))
])

billboard_ratings = billboard_long.join(billboard_songs, on=['year', 'artist', 'track', 'time'])
billboard_ratings = billboard_ratings[['id', 'date.entered', 'week', 'rating']]

pandas

import pandas as pd

path_billboard = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/billboard.csv'
billboard = pd.read_csv(path_billboard)

billboard_long = pd.melt(
    billboard,
    id_vars=['year', 'artist', 'track', 'time', 'date.entered'],
    var_name='week',
    value_name='rating'
)

billboard_songs = billboard_long[['year', 'artist', 'track', 'time']]
billboard_songs = billboard_songs.drop_duplicates()
billboard_songs['id'] = range(len(billboard_songs))

billboard_ratings = billboard_long.merge(
    billboard_songs, on=['year', 'artist', 'track', 'time']
)
billboard_ratings = billboard_ratings[['id', 'date.entered', 'week', 'rating']]

07-apply

Seriesの各要素を引数とする関数を使いたい

polars

import polars as pl

df=pl.DataFrame({'a':[10,20,30],
                 'b':[20,30,40]})

def my_sq(x):
    # assert isinstance(x, int)
    return x ** 2

df['a'].apply(my_sq)

pandas

import pandas as pd

df=pd.DataFrame({'a':[10,20,30],
                 'b':[20,30,40]})

def my_sq(x):
    # assert isinstance(x, int)
    return x ** 2

df['a'].apply(my_sq)

Seriesの各要素とハイパーパラメータを引数とする関数を使いたい

polars

import polars as pl

df=pl.DataFrame({'a':[10,20,30],
                 'b':[20,30,40]})

def my_exp(x, e):
    return x ** e

df['a'].apply(lambda x: my_exp(x, 10))

pandas

import pandas as pd

df=pd.DataFrame({'a':[10,20,30],
                 'b':[20,30,40]})

def my_exp(x, e):
    return x ** e

df['a'].apply(my_exp, e=10)

DataFrameの各Seriesを引数とする関数を使いたい

polars

import polars as pl

df=pl.DataFrame({'a':[10,20,30],
                 'b':[20,30,40]})

def print_me(x):
    print(x)

df.select([
    pl.col('a').apply(print_me),
    pl.col('b').apply(print_me)
])

pandas

import pandas as pd

df=pd.DataFrame({'a':[10,20,30],
                 'b':[20,30,40]})

def print_me(x):
    print(x)

df.apply(print_me, axis=0)

DataFrameの各Seriesを引数とする関数を使いたい その2

polars

import polars as pl

df=pl.DataFrame({'a':[10,20,30],
                 'b':[20,30,40]})

def avg_3_apply(col):
    x = col[0]
    y = col[1]
    z = col[2]
    return (x + y + z) / 3


print(avg_3_apply(df.get_column('a')))
print(avg_3_apply(df.get_column('b')))

pandas

import pandas as pd

df=pd.DataFrame({'a':[10,20,30],
                 'b':[20,30,40]})

def avg_3_apply(col):
    x = col[0]
    y = col[1]
    z = col[2]
    return (x + y + z) / 3

df.apply(avg_3_apply)

DataFrameの各行を引数とする関数を使いたい

polars

import polars as pl

df=pl.DataFrame({'a':[10,20,30],
                 'b':[20,30,40]})

def avg_2_apply(row):
    x = row[0]
    y = row[1]
    return (x + y) / 2

df.apply(lambda x: avg_2_apply(x))

pandas

import pandas as pd

df=pd.DataFrame({'a':[10,20,30],
                 'b':[20,30,40]})

def avg_2_apply(row):
    x = row[0]
    y = row[1]
    return (x + y) / 2

df.apply(avg_2_apply, axis=1)

08-groupby

年度ごとの寿命平均値を計算する

polars

import polars as pl

path_gapminder = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path_gapminder, sep='\t')

df.groupby('year').agg(pl.col('lifeExp').mean())

pandas

import pandas as pd

path_gapminder = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path_gapminder, sep='\t')

df.groupby('year')['lifeExp'].mean()

年度・大陸ごとの寿命平均値を計算する

polars

import polars as pl

path_gapminder = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path_gapminder, sep='\t')

df.groupby(['year', 'continent']).agg(pl.col('lifeExp').mean()).sort(['year', 'continent'])

pandas

import pandas as pd

path_gapminder = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path_gapminder, sep='\t')

df.groupby(['year', 'continent'])['lifeExp'].mean()

09-models

sklearnなどとの連携がほとんどなので省略

終わりに

polarsは、C#や関数型プログラミング言語に近い書き味で使いやすいと感じました。
null値の削除・置換が必須などpandasが裏でやってくれることを明示的に指示する必要がありますが、将来的に改善してくれると思います。

参考URL

https://github.com/chendaniely/pandas_for_everyone

Discussion

ログインするとコメントできます