💽

Polars vs Pandas

2024/04/20に公開

Intro

This is the example of polars, which is one of packages of python, for comparing time for reading data to pandas.

Source codes

https://github.com/mae-commits/tech_try

Polars try

Create the new data by polars, and then convert data to csv

import numpy as np
import polars as pl
import pandas as pd

create random DataFrame with 10 million rows and 2 columns

pl_df = pl.DataFrame({
    'a': np.random.rand(10000000),
    'b': np.random.rand(10000000)})

%timeit pl_df.write_csv('df.csv', separator=',')


# pd_df = pl_df.to_pandas()

# %timeit pd_df.to_csv('df_new.csv', sep=',')

Pandas sum

import pandas as pd
import numpy as np


# create random DataFrame with 10 million rows and 2 columns
pd_df = pd.read_csv('df.csv')

# compute sum of each column using Pandas
%timeit pd_df.sum()

Polars sum

import polars as pl
import numpy as np


# create random DataFrame with 10 million rows and 2 columns
pl_df = pl.read_csv('df.csv')


# compute sum of each column using Polar
%timeit pl_df.sum()
from math import exp as e
def sig(z):
  return (1/(1+e(-1*z)))

Pandas apply

%timeit pd_df.apply(lambda a: sig(a[0]+a[1]/100), axis=1)

Polars apply

%timeit pl_df.apply(lambda a: sig(a[0]+a[1]/100))

create random DataFrame with 10 million rows and 2 columns

pl_df_new = pl.DataFrame({
    'c': np.random.rand(10000000),
    'd': np.random.rand(10000000)})

%timeit pl_df_new.write_csv('df_add.csv', separator=',')

Pandas concat (horizontal)

pd_df = pd.read_csv('df.csv')

pd_df_add = pd.read_csv('df_add.csv')

%timeit pd.concat([pd_df, pd_df_add], axis=1)
import os
file_path = './df_add.csv'
# file = pd_df.to_csv(file_path)
file_size = os.path.getsize(file_path) / 1024**3

print(f"The size of the file is {file_size} giga bytes")

Pandas concat (vertical)

%timeit pd.concat([pd_df, pd_df_add], axis=0)

Polars concat (vertical)

pl_df = pl.read_csv('df.csv')

# create random DataFrame with 10 million rows and 2 columns
pl_df_add = pl.read_csv('df_add.csv')
pl_df_add_new = pl_df_add.clone().rename({
    'c': 'a', 'd': 'b'
    })

%timeit pl.concat([pl_df, pl_df_add_new])

Polars concat (horizontal)

print(pl_df_add.columns)
pl_df_add_new = pl_df_add.clone().rename({
    'c': 'c_copy', 'd': 'd_copy'
    })

%timeit pl.concat([pl_df, pl_df_add_new], how='horizontal')

Conclusion

Overall, polars is 5-10 times faster than pandas of all tests, but the shape and definition between polars and pandas are different in some points. Moreover, the number of guides for polars is less than that of pandas because of its prevalence. For these reasons, we'd better think use polars because of its advantage only in the point of speed.

Discussion