Handling Missing Data¶
- MCAR – Missing Completely At Random (Independent of any variables)
- MAR – Missing At Random (depends on observed variables, explainable by other features)
- MNAR – Missing Not At Random (depends on unobserved value itself)
Mean Replacement¶
Definition:
Replace missing numerical values with the column mean.
When to Use:
- Numerical features
- Data roughly normally distributed
- Low percentage of missing values
- Baseline / quick prototype
Pros:
- Very simple and fast
- Keeps dataset size unchanged
- Easy to implement in pipelines
Cons:
- Reduces variance
- Distorts distribution
- Sensitive to outliers
- Can bias correlations
Note
Avoid when distribution is skewed or when missingness is not random.
Median Replacement¶
Definition:
Replace missing values with the column median.
When to Use:
- Numerical data
- Skewed distributions
- Presence of outliers
- Robust baseline
Pros:
- Robust to outliers
- Simple
- Maintains dataset size
Cons:
- Still reduces variance
- Ignores relationships between features
Note
Often better default than mean in real-world tabular datasets.
Mode Replacement¶
Definition:
Fill missing values with the most frequent value.
When to Use:
- Categorical features
- Low missing percentage
Pros:
- Very simple
- Preserves dataset size
- Works well for low-cardinality categorical features
Cons:
- Can distort class distribution
- Adds bias toward dominant category
Dropping Rows (Listwise Deletion)¶
Definition:
Remove rows containing missing values.
When to Use:
- Missing percentage is very small
- Large dataset
- Missingness is completely random (MCAR)
Pros:
- No artificial data introduced
- Statistically clean if MCAR
Cons:
- Reduces dataset size
- Risk of bias if missingness is not random
- Dangerous for small datasets
KNN Imputation¶
Definition:
Use k-nearest neighbors to estimate missing values from similar samples.
When to Use:
- Numerical data
- Moderate dataset size
- Features correlated
- Non-linear relationships
Pros:
- Uses multivariate information
- More accurate than mean/median
- Works for complex patterns
Cons:
- Computationally expensive
- Sensitive to feature scaling
- Poor performance in high dimensions
Note
Requires scaling (e.g., StandardScaler) before applying.
Regression Imputation¶
Definition:
Predict missing values using regression models trained on other features.
Advanced approach: MICE (Multiple Imputation by Chained Equations).
When to Use:
- Numerical data
- Strong relationships between variables
- Medium to large datasets
Pros:
- Preserves relationships between features
- More statistically sound
- MICE handles uncertainty via multiple imputations
Cons:
- Assumes model is correct
- Risk of data leakage if not careful (use only training data for imputation, else you may leak test data information into training)
- Computationally heavier
Deep Learning Imputation¶
Definition:
Use neural networks (e.g., autoencoders) to reconstruct missing values.
When to Use:
- Large datasets
- High-dimensional data
- Complex nonlinear dependencies
- Image, text, or complex structured data
Pros:
- Captures complex nonlinear patterns
- Works well in high-dimensional spaces
- Powerful for large-scale data
Cons:
- Requires large data
- Harder to interpret
- Risk of overfitting
- Computationally expensive
Note
More common in research or high-scale ML systems.
Get More Data¶
Definition:
Acquire missing values from external systems, users, logs, or other sources.
When to Use:
- Business-critical feature
- Missingness is systematic
- High ROI feature
Pros:
- Most accurate solution
- Improves overall data quality
- No statistical distortion
Cons:
- Costly
- Time-consuming
- Sometimes impossible
Summary¶
- Small dataset + low missing rate = Median or Drop
- Skewed data = Median
- Correlated features = Regression or KNN
- High dimensional + complex data = Deep learning
- High-stakes model (finance/health) = MICE
- Production quick baseline = Median