
【A little knowledge】How save python object in a smaller size


1. Preface

Have you ever experienced having to store the feature data you created in a smaller size while developing a machine learning model? I have.

2. Method

You can use zip format to compress, like gzip, bzip2 and xz.


import gzip
import bz2
import numpy as np

# Example with gzip
data = np.random.rand(1000, 1000)  # Create a numpy array
with gzip.open('data.gz', 'wb') as f:
    np.save(f, data)  # Save with gzip compression

# Example with bz2
with bz2.open('data.bz2', 'wb') as f:
    np.save(f, data)  # Save with bz2 compression

If use wanna save complex python object, you can use pickle in combination.

import pickle
import gzip
import bz2
import numpy as np

data = np.random.rand(1000, 1000)  # Create a numpy array
dictionary = {'sample':data}

with gzip.open('data.pickle.gz', 'wb') as f:
    pickle.dump(dictionary, f)  # Compress with gzip

with bz2.open('data.pickle.bz2', 'wb') as f:
    pickle.dump(dictionary, f)  # Compress with bz2


You can unzip in similar way.

import gzip
import bz2
import numpy as np

# Example with gzip
with gzip.open('data.gz', 'rb') as f:
    data = np.load(f)  # Load with gzip compression

# Example with bz2
with bz2.open('data.bz2', 'rb') as f:
    data = np.load(f)  # Load with bz2 compression
import pickle
import gzip
import bz2

with gzip.open('data.pickle.gz', 'rb') as f:
    data = pickle.load(f)  # Load from gzip-compressed file

with bz2.open('data.pickle.bz2', 'rb') as f:
    data = pickle.load(f)  # Load from bz2-compressed file

3. Which should use?

Let's compare those compression method.
In this time, introduce comparison table from this article


gzip is a good first choice, becauze it has good balance of time(both compression and extension) and compression rate.
If you need more compression rate, you can consider to select other method, but extension time can be a hindrance when infer some machine learning model.


tar コマンドの圧縮オプション比較
