🐐

What is the Mode imputation?

2024/08/24に公開

1. Mode imputatiion

Mode imputation is a technique used to handle missing data specifically categorical or ordinal.
It fills in missing values with the most frequent value of that feature.

2. Example

Here is an example of imputation.
Mode imputation is for the categorical or ordinal, so using the median imputation for numerical in this time.

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

data = {
    'Color': ['Red', 'Blue', 'Green', 'Red', None, 'Blue', 'Red', None],
    'Size': [10, 15, None, 14, 13, 10, None, 12],
    'Price': [100, None, 150, 125, None, 110, 130, None]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

numerical_features = ['Size', 'Price']
categorical_features = ['Color']

# Imputer for nume
numerical_transformer = SimpleImputer(strategy='median')
# Imputer for cat
categorical_transformer = SimpleImputer(strategy='most_frequent', missing_values=None)

# apply different transformations to different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Apply
df_transformed = preprocessor.fit_transform(df)

# Convert the transformed array back to a DataFrame
df_transformed = pd.DataFrame(df_transformed, columns=numerical_features + categorical_features)

print("\nDataFrame after imputation:")
print(df_transformed)

・Output

Original DataFrame:
   Color  Size  Price
0    Red  10.0  100.0
1   Blue  15.0    NaN
2  Green   NaN  150.0
3    Red  14.0  125.0
4   None  13.0    NaN
5   Blue  10.0  110.0
6    Red   NaN  130.0
7   None  12.0    NaN

DataFrame after imputation:
   Size  Price  Color
0  10.0  100.0    Red
1  15.0  125.0   Blue
2  12.5  150.0  Green
3  14.0  125.0    Red
4  13.0  125.0    Red
5  10.0  110.0   Blue
6  12.5  130.0    Red
7  12.0  125.0    Red

3. Summary

Mode imputation is particularly useful when the majority of the data points for a feature fall into one category.
However, it can introduce bias if the missing values are not random, so it should be used with caution depending on the nature of the data and the problem being addressed.

Discussion