🐕

【A little knowledge】How save python object in a smaller size

2024/04/23に公開

1. Preface

Have you ever experienced having to store the feature data you created in a smaller size while developing a machine learning model? I have.

2. Method

You can use zip format to compress, like gzip, bzip2 and xz.

Compress

import gzip
import bz2
import numpy as np

# Example with gzip
data = np.random.rand(1000, 1000)  # Create a numpy array
with gzip.open('data.gz', 'wb') as f:
    np.save(f, data)  # Save with gzip compression

# Example with bz2
with bz2.open('data.bz2', 'wb') as f:
    np.save(f, data)  # Save with bz2 compression

If use wanna save complex python object, you can use pickle in combination.

import pickle
import gzip
import bz2
import numpy as np

data = np.random.rand(1000, 1000)  # Create a numpy array
dictionary = {'sample':data}

with gzip.open('data.pickle.gz', 'wb') as f:
    pickle.dump(dictionary, f)  # Compress with gzip


with bz2.open('data.pickle.bz2', 'wb') as f:
    pickle.dump(dictionary, f)  # Compress with bz2

Unzip

You can unzip in similar way.

import gzip
import bz2
import numpy as np

# Example with gzip
with gzip.open('data.gz', 'rb') as f:
    data = np.load(f)  # Load with gzip compression
    print(data)

# Example with bz2
with bz2.open('data.bz2', 'rb') as f:
    data = np.load(f)  # Load with bz2 compression
    print(data)
import pickle
import gzip
import bz2

with gzip.open('data.pickle.gz', 'rb') as f:
    data = pickle.load(f)  # Load from gzip-compressed file
    print(data)

with bz2.open('data.pickle.bz2', 'rb') as f:
    data = pickle.load(f)  # Load from bz2-compressed file
    print(data)

3. Which should use?

Let's compare those compression method.
In this time, introduce comparison table from this article

・Comparison

・Summary
gzip is a good first choice, becauze it has good balance of time(both compression and extension) and compression rate.
If you need more compression rate, you can consider to select other method, but extension time can be a hindrance when infer some machine learning model.

Reference

tar コマンドの圧縮オプション比較

Discussion