🐐
What is the Mode imputation?
1. Mode imputatiion
Mode imputation is a technique used to handle missing data specifically categorical or ordinal.
It fills in missing values with the most frequent value of that feature.
2. Example
Here is an example of imputation.
Mode imputation is for the categorical or ordinal, so using the median imputation for numerical in this time.
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
data = {
'Color': ['Red', 'Blue', 'Green', 'Red', None, 'Blue', 'Red', None],
'Size': [10, 15, None, 14, 13, 10, None, 12],
'Price': [100, None, 150, 125, None, 110, 130, None]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
numerical_features = ['Size', 'Price']
categorical_features = ['Color']
# Imputer for nume
numerical_transformer = SimpleImputer(strategy='median')
# Imputer for cat
categorical_transformer = SimpleImputer(strategy='most_frequent', missing_values=None)
# apply different transformations to different columns
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
]
)
# Apply
df_transformed = preprocessor.fit_transform(df)
# Convert the transformed array back to a DataFrame
df_transformed = pd.DataFrame(df_transformed, columns=numerical_features + categorical_features)
print("\nDataFrame after imputation:")
print(df_transformed)
・Output
Original DataFrame:
Color Size Price
0 Red 10.0 100.0
1 Blue 15.0 NaN
2 Green NaN 150.0
3 Red 14.0 125.0
4 None 13.0 NaN
5 Blue 10.0 110.0
6 Red NaN 130.0
7 None 12.0 NaN
DataFrame after imputation:
Size Price Color
0 10.0 100.0 Red
1 15.0 125.0 Blue
2 12.5 150.0 Green
3 14.0 125.0 Red
4 13.0 125.0 Red
5 10.0 110.0 Blue
6 12.5 130.0 Red
7 12.0 125.0 Red
3. Summary
Mode imputation is particularly useful when the majority of the data points for a feature fall into one category.
However, it can introduce bias if the missing values are not random, so it should be used with caution depending on the nature of the data and the problem being addressed.
Discussion