Handling Missing Data¶

MCAR – Missing Completely At Random (Independent of any variables)
MAR – Missing At Random (depends on observed variables, explainable by other features)
MNAR – Missing Not At Random (depends on unobserved value itself)

Mean Replacement¶

Definition:

Replace missing numerical values with the column mean.

When to Use:

Numerical features
Data roughly normally distributed
Low percentage of missing values
Baseline / quick prototype

Pros:

Very simple and fast
Keeps dataset size unchanged
Easy to implement in pipelines

Cons:

Reduces variance
Distorts distribution
Sensitive to outliers
Can bias correlations

Note

Avoid when distribution is skewed or when missingness is not random.

Median Replacement¶

Definition:

Replace missing values with the column median.

When to Use:

Numerical data
Skewed distributions
Presence of outliers
Robust baseline

Pros:

Robust to outliers
Simple
Maintains dataset size

Cons:

Still reduces variance
Ignores relationships between features

Note

Often better default than mean in real-world tabular datasets.

Mode Replacement¶

Definition:

Fill missing values with the most frequent value.

When to Use:

Categorical features
Low missing percentage

Pros:

Very simple
Preserves dataset size
Works well for low-cardinality categorical features

Cons:

Can distort class distribution
Adds bias toward dominant category

Dropping Rows (Listwise Deletion)¶

Definition:

Remove rows containing missing values.

When to Use:

Missing percentage is very small
Large dataset
Missingness is completely random (MCAR)

Pros:

No artificial data introduced
Statistically clean if MCAR

Cons:

Reduces dataset size
Risk of bias if missingness is not random
Dangerous for small datasets

KNN Imputation¶

Definition:

Use k-nearest neighbors to estimate missing values from similar samples.

When to Use:

Numerical data
Moderate dataset size
Features correlated
Non-linear relationships

Pros:

Uses multivariate information
More accurate than mean/median
Works for complex patterns

Cons:

Computationally expensive
Sensitive to feature scaling
Poor performance in high dimensions

Note

Requires scaling (e.g., StandardScaler) before applying.

Regression Imputation¶

Definition:

Predict missing values using regression models trained on other features.

Advanced approach: MICE (Multiple Imputation by Chained Equations).

When to Use:

Numerical data
Strong relationships between variables
Medium to large datasets

Pros:

Preserves relationships between features
More statistically sound
MICE handles uncertainty via multiple imputations

Cons:

Assumes model is correct
Risk of data leakage if not careful (use only training data for imputation, else you may leak test data information into training)
Computationally heavier

Deep Learning Imputation¶

Definition:

Use neural networks (e.g., autoencoders) to reconstruct missing values.

When to Use:

Large datasets
High-dimensional data
Complex nonlinear dependencies
Image, text, or complex structured data

Pros:

Captures complex nonlinear patterns
Works well in high-dimensional spaces
Powerful for large-scale data

Cons:

Requires large data
Harder to interpret
Risk of overfitting
Computationally expensive

Note

More common in research or high-scale ML systems.

Get More Data¶

Definition:

Acquire missing values from external systems, users, logs, or other sources.

When to Use:

Business-critical feature
Missingness is systematic
High ROI feature

Pros:

Most accurate solution
Improves overall data quality
No statistical distortion

Cons:

Costly
Time-consuming
Sometimes impossible

Summary¶

Small dataset + low missing rate = Median or Drop
Skewed data = Median
Correlated features = Regression or KNN
High dimensional + complex data = Deep learning
High-stakes model (finance/health) = MICE
Production quick baseline = Median