🦩

【ML Method】Handling the missing values of DataFrame

2024/08/25に公開

Filling missing values, also known as imputation, is a crucial step in data preprocessing. I'll introduce several methods for imputation, but the optimal one depends on the problem and how are you handling it.

1. For Table data

Mean: Replace missing values with the mean of the column (used for continuous numerical data).
Median: Replace missing values with the median (useful when the data has outliers).
Mode: Replace missing values with the most frequent value (suitable for categorical data).
Constant: You can fill missing values with a specific constant, such as 0, -1, or a new category like 'Unknown'.

df['column_name'].fillna(df['column_name'].mean(), inplace=True)
df['column_name'].fillna(df['column_name'].median(), inplace=True)
df['column_name'].fillna(df['column_name'].mode()[0], inplace=True)
df['column_name'].fillna(0, inplace=True)
df['categorical_column'].fillna('Unknown', inplace=True)

・Aplly to all columns

df.fillna(df.mean(), inplace=True)
df.fillna(df.median(), inplace=True)
df = df.apply(lambda col: col.fillna(col.mode()[0]))
df.fillna(0, inplace=True)
df.fillna('Unknown', inplace=True)

2. For Time Series Data

Forward Fill (ffill): Propagate the last observed value forward to fill missing data.
Backward Fill (bfill): Use the next observed value to fill missing data backward.
Linear Interpolation: For time-series data, interpolate missing values linearly between available points.
Polynomial/Other Methods: You can also use higher-order polynomials or other interpolation methods depending on the data.

df['column_name'].fillna(method='ffill', inplace=True)
df['column_name'].fillna(method='bfill', inplace=True)
df['column_name'].interpolate(method='linear', inplace=True)

3. Imputation Using Predictive Models

K-Nearest Neighbors (KNN) Imputation: Fill missing values based on the nearest neighbors' values.
Regression Imputation: Predict missing values using a regression model, where the missing feature is the target, and other features are used as predictors.
Iterative Imputer: This is a more advanced method where missing values are imputed iteratively using multiple rounds of predictions by models.
Multiple Imputation by Chained Equations (MICE): Do pooling(or prediction) after iterative Imputation for stability of prediction. Rubin's rule is popular as the pooling method.

from sklearn.impute import KNNImputer, IterativeImputer

knn_imputer = KNNImputer(n_neighbors=5)
df_imputed = knn_imputer.fit_transform(df)

iterative_imputer = IterativeImputer()
df_imputed = iterative_imputer.fit_transform(df)

4. Leave as Missing (if allowed)

Sometimes, leaving the missing data as it is and treating "missingness" as a separate category or feature can be beneficial. For example, some models like tree-based algorithms can handle missing values without imputation.

5. Summary

I introduced several imputation methods.
Each method has its pros and cons, and the best approach often depends on the data and the downstream task. You may need to experiment with different techniques to find the best one for your problem.

Discussion