🌪️

[ML] How Shrinking Data Size?

2024/05/02に公開

1. Preface

I'll introduce some method to reduce data size when you saving data for ML.
I'm glad to this article helps someone facing data size isuue.

2. Method

First of all, you should consider using normalization and chnage data type like this:

# Normalize 0-min
data = data - data.min()

# Normalize 0-255
data = (data / data.max() * 255).astype(np.uint8)

If you not using int, the data size can be smoller by this method, please try it.

・Example

import numpy as np

# sample data
data = np.array([-1.2390452, -2.23907452, -10.32597245, -50.32486374627867582137])
# [ -1.2390452   -2.23907452 -10.32597245 -50.32486375]
print(data.nbytes)

# Normalize 0-min
data = data - data.min()

# Normalize 0-255
data = (data / data.max() * 255).astype(np.uint8)
# [255 249 207   0]
print(data.nbytes)

・Output

32
4

Yes, you succeed in reducing the data size to 1/8.

3. Especially you want to reduce data size

3.1 Use dict and zip

If you are saving a data one by one now, you can consider using a dict and zip.

# register to dict
train_dict['data_name'] = train_data 

# save as zip
def save_as_picke_gzip(data, filepath):       
    with gzip.open(filepath, 'wb') as f:
        pickle.dump(data, f)

It reducing the data size, but you should note to increasing memory usage because it continue to use memory until end of register to dict.

4. Especially you want to reduce memory usage

4.1 Save data separately

Conversely, if you are using a dictonary to store it and you don't have enougth storage, yo may want to store the data separately.

# save(In case of numpy)
np.save(SAVE_PATH, data)

Summary

The data issue will be often happen when treating ML model.
Such a time, I think it would be good to consider to normalize the data.

Discussion