🍣
Polarsで「pandasライブラリ活用入門」を書いてみました
はじめに
最近、pandasに代わる高速データ分析ライブラリとしてpolarsが注目されているとのことです。
私もpolarsの勉強の一環として、Pythonデータ分析/機械学習のための基本コーディング! Pandasライブラリ活用入門 - インプレスブックスのプログラムをpolarsで書いてみました。
この記事では、原著のGitHubのプログラムに沿って、polarsとpandasのコードを比較しています。
記事の構成は、逆引きとして使用するために以下の構成とします。
- 章 プログラム
- 節 行いたいこと
- 句 polarsコード
- 句 pandas
- 節 行いたいこと
01-intro
tsvファイルを読み込む
polars
import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
df.head()
pandas
import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
df.head()
行数・列数、列名、インデックス、型情報を取得する
polars
import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
print(df.shape)
print(df.columns)
print('polarsにindexはない')
print(df.dtypes)
pandas
import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
print(df.shape)
print(df.columns)
print(df.index)
print(df.dtypes)
numpy形式に変換する
polars
import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
df.to_numpy()
pandas
import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
df.values
基本統計量
polars
import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
df.describe()
pandas
import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
df.info()
Series(列)を返す
polars
import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
country_df = df.get_column('country')
country_df.head(5)
pandas
import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
country_df = df['country']
country_df.head(5)
DataFrame(Subset 部分集合)を返す
polars
import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
subset = df['country', 'continent', 'year']
subset.head(5)
pandas
import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
subset = df[['country', 'continent', 'year']]
subset.head(5)
任意の行や列を抽出する
polars
import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
#print(df[[0, 1]]) # 先頭の2行
#print(df.head()) # 先頭の5行
#print(df[0]) # 先頭行
#print(df[[0, 99]]) # 第0行と第99行
#print(df[-1]) # 最終行
#print('polarsにilocはない')
#print('polarsにilocはない')
#print('polarsにilocはない')
subset = df[:, ['year', 'pop']] # Year列とpop列
#print(subset.head())
subset = df[-5::2, :] # 最終行の4行前から1行飛ばし
#print(subset.head())
pandas
import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
#print('pandasでは df[[0, 1]] と書くとエラーになる')
#print(df.head()) # 先頭の5行を表示する
#print(df.loc[0]) # 先頭行
#print(df.loc[[0, 99]]) # 第0行と第99行
#print('pandasでは df.loc[-1] と書くとエラーになる')
#print(df.iloc[0]) # 先頭行
#print(df.iloc[[0, 99]]) # 第0行と第99行
#print(df.iloc[-1]) # 最終行
subset = df.loc[:, ['year', 'pop']] # Year列とpop列
#print(subset.head())
subset = df.iloc[-5::2, :] # 最終行の4行前から1行飛ばし
#print(subset.head())
グループごとに平均値を計算する
polars
import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path, sep='\t')
df.groupby("year").agg(pl.mean("lifeExp")).sort('year').head(6)
pandas
import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path, sep='\t')
df.groupby('year')['lifeExp'].mean().head(6)
02-more_pandas
Seriesを作る(複数の型のデータを含む場合。誰が使うの?)
polars
import polars as pl
s = pl.Series('item', ['banana', 42], dtype = pl.Object)
s
pandas
import pandas as pd
s = pd.Series(['banana', 42])
s
Seriesを作る(pandasではindex列を設定しているが、polarsにはindexはない)
polars
import polars as pl
s = pl.Series(['Wes', 'Creater'])
s
pandas
import pandas as pd
s = pd.Series(['Wes', 'Creater'], index=['person', 'who'])
s
要素へのアクセス
polars
import polars as pl
s = pl.Series(['Wes', 'Creater'])
print('polarsにlocはない')
print('polarsにilocはない')
print(s[0])
pandas
import pandas as pd
s = pd.Series(['Wes', 'Creater'], index=['person', 'who'])
print(s.loc['person'])
print(s.iloc[0])
print(s[0])
DataFrameを作る(①polarsにindexはない)
polars
import polars as pl
scientists = pl.DataFrame(
data={
'name': ['Rosaline Franklin','William Gosset'],
'Occupation':['Chemist','Statistician'],
'Born':['1920-07-25', '1876-06-13'],
'Died':['1958-04-16', '1937-10-16'],
'Age':[37,61]},
columns=['name', 'Occupation', 'Born','Died','Age'])
# 日付への変換
scientists = scientists \
.replace('Born', scientists.get_column('Born').str.strptime(pl.Date, '%Y-%m-%d', strict = False)) \
.replace('Died', scientists.get_column('Died').str.strptime(pl.Date, '%Y-%m-%d', strict = False))
scientists
pandas
import pandas as pd
scientists=pd.DataFrame(
data={'Occupation':['Chemist','Statistician'],
'Born':['1920-07-25', '1876-06-13'],
'Died':['1958-04-16', '1937-10-16'],
'Age':[37,61]},
index=['Rosaline Franklin','William Gosset'],
columns=['Occupation', 'Born','Died','Age'])
scientists
CSVファイルを読み込む(pandasの方が日付変換が強い)
polars
import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pl.read_csv(path)
# 日付への変換
scientists = scientists \
.replace('Born', scientists.get_column('Born').str.strptime(pl.Date, '%Y-%m-%d', strict = False)) \
.replace('Died', scientists.get_column('Died').str.strptime(pl.Date, '%Y-%m-%d', strict = False))
scientists
pandas
import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pd.read_csv(path)
scientists
Seriesのメソッド
polars
import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pl.read_csv(path)
# 日付への変換
scientists = scientists \
.replace('Born', scientists.get_column('Born').str.strptime(pl.Date, '%Y-%m-%d', strict = False)) \
.replace('Died', scientists.get_column('Died').str.strptime(pl.Date, '%Y-%m-%d', strict = False))
# ここからメソッド
ages = scientists['Age']
#print(ages.head())
#print('polarsは、列名をpropertyのように扱うことができない。')
#print(type(ages)) # 型名
#print(ages.mean()) # 平均値
#print(ages.shape) # 行数
#print(ages.min()) # 最小値
#print(ages.describe()) # 要約
#print(ages[ages > ages.mean()]) # 年齢が平均値よりも高い行を抽出する
#print(ages[(ages > ages.mean()) & (ages > 75)]) # 年齢が平均値よりも高い、かつ75歳より高い行を抽出する
#print(ages[(ages > ages.mean()) & ~(ages > 75)]) # 年齢が平均値よりも高い、かつ75歳以下の行を抽出する
#print(ages + 100) # すべての要素に100を加える
pandas
import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pd.read_csv(path)
# ここからメソッド
ages = scientists['Age']
#print(ages.head())
ages = scientists.Age
#print(type(ages)) # 型名
#print(ages.mean()) # 平均値
#print(ages.shape) # 行数
#print(ages.min()) # 最小値
#print(ages.describe()) # 要約
#print(ages[ages > ages.mean()]) # 年齢が平均値よりも高い行を抽出する
#print(ages[(ages > ages.mean()) & (ages > 75)]) # 年齢が平均値よりも高い、かつ75歳より高い行を抽出する
#print(ages[(ages > ages.mean()) & ~(ages > 75)]) # 年齢が平均値よりも高い、かつ75歳以下の行を抽出する
#print(ages + 100) # すべての要素に100を加える
データを絞り込む フィルタリング 真偽値
polars
import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pl.read_csv(path)
# 日付への変換
scientists = scientists \
.replace('Born', scientists.get_column('Born').str.strptime(pl.Date, '%Y-%m-%d', strict = False)) \
.replace('Died', scientists.get_column('Died').str.strptime(pl.Date, '%Y-%m-%d', strict = False))
# ここからメソッド
#print(scientists.filter(pl.col('Age') > pl.col('Age').mean())) # 年齢が平均値より高い行を抽出する
#print(scientists['Age'] > scientists['Age'].mean()) # 年齢が平均より高いとTrueとする列を返す
pandas
import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pd.read_csv(path)
# ここからメソッド
#print(scientists[scientists['Age'] > scientists['Age'].mean()]) # 年齢が平均値より高い行を抽出する
#print(scientists['Age'] > scientists['Age'].mean()) # 年齢が平均より高いとTrueとする列を返す
文字列型で読み込んだ日付の列をパースして、日付型の列をDataFrameに加える
polars
import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pl.read_csv(path)
# 生年月日
# 死亡日
scientists = scientists.with_columns([
scientists.get_column('Born').str.strptime(pl.Date, '%Y-%m-%d', strict = False).alias('born_dt'),
scientists.get_column('Died').str.strptime(pl.Date, '%Y-%m-%d', strict = False).alias('died_dt')
])
scientists.head()
pandas
import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pd.read_csv(path)
# 生年月日
born_datetime = pd.to_datetime(scientists['Born'], format='%Y-%m-%d')
scientists['born_dt'] = born_datetime
# 死亡日
scientists['died_dt'] = pd.to_datetime(scientists['Died'], format='%Y-%m-%d')
scientists.head()
列を削除する
polars
import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pl.read_csv(path)
scientists = scientists.drop(['Born', 'Died'])
scientists.head()
pandas
import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pd.read_csv(path)
scientists = scientists.drop(['Born', 'Died'], axis=1)
scientists.head()
文字列で読み込んだ生年月日と死亡日を日付に変換して保存した後もう一度読み込む(変換した意味はない)
polars
import polars as pl
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pl.read_csv(path)
# 生年月日
# 死亡日
scientists = scientists.with_columns([
scientists.get_column('Born').str.strptime(pl.Date, '%Y-%m-%d', strict = False).alias('born_dt'),
scientists.get_column('Died').str.strptime(pl.Date, '%Y-%m-%d', strict = False).alias('died_dt')
])
# 列の削除
scientists = scientists.drop(['Born', 'Died'])
# 日付型に変換したデータフレームを保存する
scientists.write_csv('scientists_clean.csv', sep=",")
# 日付型に変換して保存したcsvファイルを読み込む
scientists = pl.read_csv('scientists_clean.csv')
scientists.dtypes
pandas
import pandas as pd
path = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/scientists.csv'
scientists = pd.read_csv(path)
# 生年月日
born_datetime = pd.to_datetime(scientists['Born'], format='%Y-%m-%d')
scientists['born_dt'] = born_datetime
# 死亡日
scientists['died_dt'] = pd.to_datetime(scientists['Died'], format='%Y-%m-%d')
# 列の削除
scientists = scientists.drop(['Born', 'Died'], axis=1)
# 日付型に変換したデータフレームを保存する
scientists.to_csv('scientists_clean.csv', index=False)
# 日付型に変換して保存したcsvファイルを読み込む
scientists = pd.read_csv('scientists_clean.csv')
scientists.dtypes
03-intro_plotting
04-concat_merge
3つのデータフレームを縦に結合する
polars
import polars as pl
# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'
# ファイルの読み込み
df1 = pl.read_csv(path1)
df2 = pl.read_csv(path2)
df3 = pl.read_csv(path3)
# 3つのデータフレームを縦に結合する
row_concat = pl.concat([df1, df2, df3])
row_concat
pandas
import pandas as pd
# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'
# ファイルの読み込み
df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)
# 3つのデータフレームを縦に結合する
row_concat = pd.concat([df1, df2, df3])
row_concat
3つのデータフレームを結合して、結合前の3つのデータフレームのそれぞれの第0行を結合したデータフレームを返す
pandasのindex機能の影響
polars
import polars as pl
# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'
# ファイルの読み込み
df1 = pl.read_csv(path1)
df2 = pl.read_csv(path2)
df3 = pl.read_csv(path3)
# 3つのデータフレームを縦に結合する
row_concat = pl.concat([df1, df2, df3])
row_concat[[0, len(df1), (len(df1) + len(df2))]]
pandas
import pandas as pd
# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'
# ファイルの読み込み
df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)
# 3つのデータフレームを縦に結合する
row_concat = pd.concat([df1, df2, df3])
row_concat.loc[0]
3つのデータフレームを結合して、結合したデータフレームの第0行を返す
polars
import polars as pl
# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'
# ファイルの読み込み
df1 = pl.read_csv(path1)
df2 = pl.read_csv(path2)
df3 = pl.read_csv(path3)
# 3つのデータフレームを縦に結合する
row_concat = pl.concat([df1, df2, df3])
row_concat[0]
pandas
import pandas as pd
# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'
# ファイルの読み込み
df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)
# 3つのデータフレームを縦に結合する
row_concat = pd.concat([df1, df2, df3])
row_concat.iloc[0]
DataFrameとSeriesを縦方向に結合する
polars
import polars as pl
# ファイルの読み込み
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
df1 = pl.read_csv(path1)
# 新しいSeries
new_row_series = pl.Series('new', ['n1', 'n2', 'n3', 'n4'])
pl.concat([df1, pl.DataFrame(new_row_series)], how = "diagonal")
pandas
import pandas as pd
# ファイルの読み込み
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
df1 = pd.read_csv(path1)
new_row_series = pd.Series(['n1', 'n2', 'n3', 'n4'])
pd.concat([df1, new_row_series])
DataFrameとDataFrameを縦方向に結合する
polars
import polars as pl
# ファイルの読み込み
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
df1 = pl.read_csv(path1)
# 新しいDataFrame
new_row_data = pl.DataFrame([['n1', 'n2', 'n3', 'n4']],
columns=['A', 'B', 'C', 'D'])
pl.concat([df1, new_row_data])
pandas
import pandas as pd
# ファイルの読み込み
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
df1 = pd.read_csv(path1)
# 新しいDataFrame
new_row_data = pd.DataFrame([['n1', 'n2', 'n3', 'n4']],
columns=['A', 'B', 'C', 'D'])
pd.concat([df1, new_row_data])
同じ列名を持つ3つのDataFrameを横方向に結合する(polarsは列名の重複を許さないので、列名を調整する必要がある)
polars
import polars as pl
# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'
# ファイルの読み込み
df1 = pl.read_csv(path1)
df2 = pl.read_csv(path2)
df3 = pl.read_csv(path3)
# 3つのデータフレームを横に結合する
col_concat = pl.concat([df1,
df2.select([
pl.col('A').alias('A1'),
pl.col('B').alias('B1'),
pl.col('C').alias('C1')]),
df3.select([
pl.col('A').alias('A2'),
pl.col('B').alias('B2'),
pl.col('C').alias('C2')])],
how="horizontal")
col_concat
pandas
import pandas as pd
# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'
# ファイルの読み込み
df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)
# 3つのデータフレームを横に結合する
col_concat = pd.concat([df1, df2, df3], axis=1)
col_concat
同じ列名を持つ3つのDataFrameを横方向に結合して、元のA列だけ抽出する(重複する列名を許さないpolarsのコードが長い)
polars
import polars as pl
# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'
# ファイルの読み込み
df1 = pl.read_csv(path1)
df2 = pl.read_csv(path2)
df3 = pl.read_csv(path3)
# 3つのデータフレームを横に結合する
col_concat = pl.concat([df1,
df2.select([
pl.col('A').alias('A1'),
pl.col('B').alias('B1'),
pl.col('C').alias('C1')]),
df3.select([
pl.col('A').alias('A2'),
pl.col('B').alias('B2'),
pl.col('C').alias('C2')])],
how="horizontal")
col_concat[['A', 'A1', 'A2']]
pandas
import pandas as pd
# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'
# ファイルの読み込み
df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)
# 3つのデータフレームを横に結合する
col_concat = pd.concat([df1, df2, df3], axis=1)
col_concat['A']
異なる列名を持つ3つのDataFrameを横方向に結合する(pandasは列名が異なる場合、自動的に列数を増やすようになっているようだ)
polars
import polars as pl
# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'
# ファイルの読み込み
df1 = pl.read_csv(path1)
df2 = pl.read_csv(path2)
df3 = pl.read_csv(path3)
# 列名の変更
df1.columns=['A','B','C','D']
df2.columns=['E','F','G','H']
df3.columns=['A','H','F','C']
# 3つのデータフレームを縦に結合する
pl.concat([df1, df2, df3], how = 'diagonal')
pandas
import pandas as pd
# Path
path1 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_1.csv'
path2 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_2.csv'
path3 = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/concat_3.csv'
# ファイルの読み込み
df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)
df3 = pd.read_csv(path3)
# 列名の変更
df1.columns=['A','B','C','D']
df2.columns=['E','F','G','H']
df3.columns=['A','H','F','C']
# 3つのデータフレームを縦に結合する
pd.concat([df1, df2, df3])
列名が異なるデータフレームをマージする
polars
import polars as pl
# データの読み込み
path_survey_person = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_person.csv'
path_survey_site = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_site.csv'
path_survey_survey = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_survey.csv'
path_survey_visited = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_visited.csv'
person = pl.read_csv(path_survey_person)
site = pl.read_csv(path_survey_site)
survey = pl.read_csv(path_survey_survey)
visited = pl.read_csv(path_survey_visited)
# 3行分を抽出して、列名が異なるデータフレームをマージする
visited_sub = visited[[0, 2, 6]]
o2o = pl.concat([site, visited_sub], how = 'horizontal')
o2o
pandas
import pandas as pd
# データの読み込み
path_survey_person = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_person.csv'
path_survey_site = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_site.csv'
path_survey_survey = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_survey.csv'
path_survey_visited = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_visited.csv'
person = pd.read_csv(path_survey_person)
site = pd.read_csv(path_survey_site)
survey = pd.read_csv(path_survey_survey)
visited = pd.read_csv(path_survey_visited)
visited_sub = visited.loc[[0, 2, 6]]
o2o = site.merge(visited_sub, left_on=['name'], right_on='site')
o2o
複数のDataFrameをマージする
polars
import polars as pl
# データの読み込み
path_survey_person = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_person.csv'
path_survey_site = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_site.csv'
path_survey_survey = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_survey.csv'
path_survey_visited = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_visited.csv'
person = pl.read_csv(path_survey_person)
site = pl.read_csv(path_survey_site)
survey = pl.read_csv(path_survey_survey)
visited = pl.read_csv(path_survey_visited)
# DataFrameをマージする
ps = person.join(survey, left_on = 'ident', right_on = 'person', how='outer')
vs = visited.join(site, left_on='site', right_on='name', how='outer')
ps_vs = ps.join(vs, left_on=['taken'], right_on=['ident'], how='outer')
ps_vs.head()
pandas
import pandas as pd
# データの読み込み
path_survey_person = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_person.csv'
path_survey_site = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_site.csv'
path_survey_survey = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_survey.csv'
path_survey_visited = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/survey_visited.csv'
person = pd.read_csv(path_survey_person)
site = pd.read_csv(path_survey_site)
survey = pd.read_csv(path_survey_survey)
visited = pd.read_csv(path_survey_visited)
# DataFrameをマージする
ps = person.merge(survey, left_on='ident', right_on='person', how='outer')
vs = visited.merge(site, left_on='site', right_on='name', how='outer')
ps_vs = ps.merge(vs, left_on=['taken'], right_on=['ident'], how='outer')
ps_vs.head()
05-missing
nullチェック(polarsには1変数のnull判定は無いようだ)
polars
from numpy import NaN, NAN, nan
import polars as pl
test_series = pl.Series('nan', [None])
print(test_series.is_null())
test_series = pl.Series('nan', [42])
print(test_series.is_null())
test_series = pl.Series('nan', [NAN])
print(test_series.is_null())
test_series = pl.Series('nan', [NaN])
print(test_series.is_null())
pandas
from numpy import NaN, NAN, nan
import pandas as pd
print(pd.isnull(nan))
print(pd.isnull(42))
print(pd.isnull(NAN))
print(pd.isnull(NaN))
集計する(groupby agg)
polars
import polars as pl
path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pl.read_csv(path_ebola, parse_dates=True)
ebola.groupby('Cases_Guinea').agg(pl.count()).sort('count', reverse=True).head()
pandas
import pandas as pd
path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pd.read_csv(path_ebola, parse_dates=True)
ebola.Cases_Guinea.value_counts(dropna=False).head()
nullデータ数の確認
polars
import polars as pl
import numpy as np
path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pl.read_csv(path_ebola, parse_dates=True)
np.count_nonzero( ebola.select(pl.all().is_null()).to_pandas() )
pandas
import pandas as pd
import numpy as np
path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pd.read_csv(path_ebola, parse_dates=True)
np.count_nonzero( ebola.isnull() )
nullデータを置換する
polars
import polars as pl
import numpy as np
path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pl.read_csv(path_ebola, parse_dates=True)
#print(ebola.fill_null(0)[:10, :5]) # nullデータを 0 に置換して第10行から下、および第5列から右の列を取得する
#print(ebola.fill_null(strategy="forward")[:10, :5]) # nullデータを 前の行のデータ に置換して第10行から下、および第5列から右の列を取得する
#print(ebola.fill_null(strategy="backward")[:10, :5]) # nullデータを 後の行のデータ に置換して第10行から下、および第5列から右の列を取得する
#print(ebola.interpolate()[:10, :5]) # 補間して第10行から下、および第5列から右の列を取得する
pandas
import pandas as pd
import numpy as np
path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pd.read_csv(path_ebola, parse_dates=True)
#print(ebola.fillna(0).iloc[:10, :5]) # nullデータを 0 に置換して第10行から下、および第5列から右の列を取得する
#print(ebola.fillna(method='ffill').iloc[:10, :5]) # nullデータを 前の行のデータ に置換して第10行から下、および第5列から右の列を取得する
#print(ebola.fillna(method='bfill').iloc[:10, :5]) # nullデータを 後の行のデータ に置換して第10行から下、および第5列から右の列を取得する
#print(ebola.interpolate().iloc[:10, :5]) # 補間して第10行から下、および第5列から右の列を取得する
nullデータを置換して、置換した列をDataFrameに加える
polars
import polars as pl
import numpy as np
path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pl.read_csv(path_ebola, parse_dates=True)
ebola = ebola.with_columns([
pl.col('Cases_Guinea').fill_null(strategy = 'min').alias('Cases_Guinea_filled')
])
ebola.head()
pandas
import pandas as pd
import numpy as np
path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pd.read_csv(path_ebola, parse_dates=True)
ebola['Cases_Guinea_filled'] = ebola['Cases_Guinea'].fillna(ebola['Cases_Guinea'].min())
ebola.head()
06-tidy
## Pivotテーブルを'religion'列に関して融解(melt)して集計する。集計した値の列名をデフォルト(value)から変更する
polars
import polars as pl
path_pew = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/pew.csv'
pew = pl.read_csv(path_pew)
pew_long = pew.melt(id_vars = 'religion', value_name='count')
pew_long.head()
pandas
import pandas as pd
path_pew = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/pew.csv'
pew = pd.read_csv(path_pew)
pew_long = pd.melt(pew, id_vars='religion', value_name='count')
pew_long.head()
Pivotテーブルを['year', 'artist', 'track', 'time', 'date.entered']に関して融解(melt)して集計する。残りのカテゴリ変数の列名をデフォルト(variable)から'week'に変更して、集計した値の列名をデフォルト(value)から変更する。
polars
import polars as pl
path_billboard = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/billboard.csv'
billboard = pl.read_csv(path_billboard)
billboard_long = billboard.melt(
id_vars=['year', 'artist', 'track', 'time', 'date.entered'],
value_name='rating'
)
# 集計する列名のデフォルトが'variable'のため、列名を変更する必要がある。
billboard_long.columns = list(map(lambda x: x.replace("variable", "week"), billboard_long.columns))
billboard_long.head()
pandas
import pandas as pd
path_billboard = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/billboard.csv'
billboard = pd.read_csv(path_billboard)
billboard_long = pd.melt(
billboard,
id_vars=['year', 'artist', 'track', 'time', 'date.entered'],
var_name='week',
value_name='rating'
)
billboard_long.head()
エボラ出血熱のPivotテーブルを集計日(?)毎に融解して(melt)、状況(発症 cases、死亡 death)の列を加える。
polars
import polars as pl
path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pl.read_csv(path_ebola, parse_dates=True)
# melt
ebola_long = ebola.melt(id_vars=['Date', 'Day'])
# split
split_values = ebola_long.select(
pl.col('variable').str.split_exact('_', 1)
).unnest('variable')
split_values.columns = ['status', 'country']
ebola_long = pl.concat([ebola_long, split_values], how = 'horizontal')
ebola_long.head()
pandas
import pandas as pd
path_ebola = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/country_timeseries.csv'
ebola = pd.read_csv(path_ebola, parse_dates=True)
# melt
ebola_long = pd.melt(ebola, id_vars=['Date', 'Day'])
# string accessor
variable_split = ebola_long['variable'].str.split('_')
status_values = variable_split.str.get(0)
country_values = variable_split.str.get(1)
ebola_long['status'] = status_values
ebola_long['country'] = country_values
ebola_long.head()
時系列順の気温の最大値と最小値を集計する
polars
import polars as pl
path_weather = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/weather.csv'
weather = pl.read_csv(path_weather, parse_dates=True)
# 融解
weather_melt = weather.melt(
id_vars=['id', 'year', 'month', 'element'],
variable_name = 'day',
value_name='temp'
)
# 文字列を数値に変換して、null値を含む行を削除する
weather_melt = weather_melt \
.replace('temp', weather_melt.get_column('temp').cast(pl.Float32, strict = False)) \
.drop_nulls()
weather_tidy = weather_melt \
.groupby(['id', 'year', 'month', 'day']) \
.agg([
pl.col('temp').filter(pl.col('element') == 'tmax').mean().alias('tmax'),
pl.col('temp').filter(pl.col('element') == 'tmin').mean().alias('tmin')
])
weather_tidy.sort(['id', 'year', 'month']).head()
pandas
import pandas as pd
path_weather = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/weather.csv'
weather = pd.read_csv(path_weather, parse_dates=True)
weather_melt = pd.melt(
weather,
id_vars=['id', 'year', 'month', 'element'],
var_name='day',
value_name='temp'
)
weather_tidy = weather_melt.pivot_table(
index=['id', 'year', 'month', 'day'],
columns='element',
values='temp'
)
weather_tidy.reset_index().head()
ちょっと目的が良くわからなかった
polars
import polars as pl
path_billboard = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/billboard.csv'
billboard = pl.read_csv(path_billboard)
billboard_long = billboard.melt(
id_vars=['year', 'artist', 'track', 'time', 'date.entered'],
variable_name = 'week',
value_name='rating'
)
billboard_songs = billboard_long[['year', 'artist', 'track', 'time']]
billboard_songs = billboard_songs.unique()
billboard_songs = billboard_songs.with_columns([
pl.Series('id', list(range(len(billboard_songs))))
])
billboard_ratings = billboard_long.join(billboard_songs, on=['year', 'artist', 'track', 'time'])
billboard_ratings = billboard_ratings[['id', 'date.entered', 'week', 'rating']]
pandas
import pandas as pd
path_billboard = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/billboard.csv'
billboard = pd.read_csv(path_billboard)
billboard_long = pd.melt(
billboard,
id_vars=['year', 'artist', 'track', 'time', 'date.entered'],
var_name='week',
value_name='rating'
)
billboard_songs = billboard_long[['year', 'artist', 'track', 'time']]
billboard_songs = billboard_songs.drop_duplicates()
billboard_songs['id'] = range(len(billboard_songs))
billboard_ratings = billboard_long.merge(
billboard_songs, on=['year', 'artist', 'track', 'time']
)
billboard_ratings = billboard_ratings[['id', 'date.entered', 'week', 'rating']]
07-apply
Seriesの各要素を引数とする関数を使いたい
polars
import polars as pl
df=pl.DataFrame({'a':[10,20,30],
'b':[20,30,40]})
def my_sq(x):
# assert isinstance(x, int)
return x ** 2
df['a'].apply(my_sq)
pandas
import pandas as pd
df=pd.DataFrame({'a':[10,20,30],
'b':[20,30,40]})
def my_sq(x):
# assert isinstance(x, int)
return x ** 2
df['a'].apply(my_sq)
Seriesの各要素とハイパーパラメータを引数とする関数を使いたい
polars
import polars as pl
df=pl.DataFrame({'a':[10,20,30],
'b':[20,30,40]})
def my_exp(x, e):
return x ** e
df['a'].apply(lambda x: my_exp(x, 10))
pandas
import pandas as pd
df=pd.DataFrame({'a':[10,20,30],
'b':[20,30,40]})
def my_exp(x, e):
return x ** e
df['a'].apply(my_exp, e=10)
DataFrameの各Seriesを引数とする関数を使いたい
polars
import polars as pl
df=pl.DataFrame({'a':[10,20,30],
'b':[20,30,40]})
def print_me(x):
print(x)
df.select([
pl.col('a').apply(print_me),
pl.col('b').apply(print_me)
])
pandas
import pandas as pd
df=pd.DataFrame({'a':[10,20,30],
'b':[20,30,40]})
def print_me(x):
print(x)
df.apply(print_me, axis=0)
DataFrameの各Seriesを引数とする関数を使いたい その2
polars
import polars as pl
df=pl.DataFrame({'a':[10,20,30],
'b':[20,30,40]})
def avg_3_apply(col):
x = col[0]
y = col[1]
z = col[2]
return (x + y + z) / 3
print(avg_3_apply(df.get_column('a')))
print(avg_3_apply(df.get_column('b')))
pandas
import pandas as pd
df=pd.DataFrame({'a':[10,20,30],
'b':[20,30,40]})
def avg_3_apply(col):
x = col[0]
y = col[1]
z = col[2]
return (x + y + z) / 3
df.apply(avg_3_apply)
DataFrameの各行を引数とする関数を使いたい
polars
import polars as pl
df=pl.DataFrame({'a':[10,20,30],
'b':[20,30,40]})
def avg_2_apply(row):
x = row[0]
y = row[1]
return (x + y) / 2
df.apply(lambda x: avg_2_apply(x))
pandas
import pandas as pd
df=pd.DataFrame({'a':[10,20,30],
'b':[20,30,40]})
def avg_2_apply(row):
x = row[0]
y = row[1]
return (x + y) / 2
df.apply(avg_2_apply, axis=1)
08-groupby
年度ごとの寿命平均値を計算する
polars
import polars as pl
path_gapminder = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path_gapminder, sep='\t')
df.groupby('year').agg(pl.col('lifeExp').mean())
pandas
import pandas as pd
path_gapminder = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path_gapminder, sep='\t')
df.groupby('year')['lifeExp'].mean()
年度・大陸ごとの寿命平均値を計算する
polars
import polars as pl
path_gapminder = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pl.read_csv(path_gapminder, sep='\t')
df.groupby(['year', 'continent']).agg(pl.col('lifeExp').mean()).sort(['year', 'continent'])
pandas
import pandas as pd
path_gapminder = r'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'
df = pd.read_csv(path_gapminder, sep='\t')
df.groupby(['year', 'continent'])['lifeExp'].mean()
09-models
sklearnなどとの連携がほとんどなので省略
終わりに
polarsは、C#や関数型プログラミング言語に近い書き味で使いやすいと感じました。
null値の削除・置換が必須などpandasが裏でやってくれることを明示的に指示する必要がありますが、将来的に改善してくれると思います。
参考URL
- Pythonデータ分析/機械学習のための基本コーディング! Pandasライブラリ活用入門 - インプレスブックス
- 原著 Pandas for Everyone: Python Data Analysis, 2nd Edition [Book]
- 原著 pandas for every oneのGitHubページ
Discussion