California Housing - Feature Engineering¶
Install packages
Append notebooks directory to sys.path
Import packages
Utility scripts:¶
KaggleDataExtractor:
Create data directory
Load dataset
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
<class 'pandas.core.frame.DataFrame'>
Index: 20433 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20433 non-null float64
1 latitude 20433 non-null float64
2 housing_median_age 20433 non-null float64
3 total_rooms 20433 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20433 non-null float64
6 households 20433 non-null float64
7 median_income 20433 non-null float64
8 median_house_value 20433 non-null float64
9 ocean_proximity 20433 non-null object
dtypes: float64(9), object(1)
memory usage: 1.7+ MB
Split¶
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.16 | 37.76 | 36.0 | 2781.0 | 574.0 | 1438.0 | 519.0 | 2.4598 | 155500.0 | NEAR BAY |
| 1 | -119.95 | 36.80 | 30.0 | 1233.0 | 214.0 | 620.0 | 199.0 | 3.4297 | 112500.0 | INLAND |
| 2 | -118.49 | 34.42 | 23.0 | 4166.0 | 756.0 | 2082.0 | 743.0 | 4.4107 | 213400.0 | <1H OCEAN |
| 3 | -122.24 | 37.79 | 27.0 | 1632.0 | 492.0 | 1171.0 | 429.0 | 2.3173 | 125000.0 | NEAR BAY |
| 4 | -121.45 | 36.86 | 11.0 | 1613.0 | 335.0 | 1617.0 | 342.0 | 3.1375 | 146200.0 | INLAND |
Converting target variables as array
Baseline model, setting the result for every median_house_value as mean
Dimensionality Reduction¶
As dimensions increase, data becomes sparse, it is difficult for the model see patterns because there are no dense regions
- It makes overfit more easily because the model can fit noise
- Computational cost increases (more memory usage, training time increases, inference latency)
- Feature redundance for correlated features and irrelevance for features that are just noise (does not mean anything)
- Model stability, small changes in data can do big changes in predictions
- Harder to debug and interpret
Correlation:
- When the value is positive while a variable increase the other increase also
- When the value is negative while a variable decrease the other increase
Principal Component Analysis (PCA) — Feature Extraction¶
Why is Important¶
- Reduce dimensionality while preserving variance
- Remove correlation and redundant features
- Improve generalization and computational efficiency
When to Use¶
Apply PCA when:
- Data is high-dimensional, correlated, or noisy
- Using models sensitive to distance/geometry:
- KNN, K-means, SVM
- Feature count impacts training/inference cost
Avoid when:
- Using neural networks (learn representations internally)
- Using tree models (need original feature structure)
- Interpretability or original feature meaning is required
Precondition¶
- Apply standardization:
- Zero mean, unit variance
- Prevent dominance of high-variance features
Model Impact¶
- KNN / K-means
- More stable distance calculations
-
Less noise
-
SVM
- Faster training
-
Less overfitting in high dimensions
-
Neural Networks
- Usually unnecessary
-
Useful only for very high-dimensional, small datasets
-
Tree Models
- Can degrade performance
- Lose interpretability
Trade-offs¶
| Aspect | PCA Applied | PCA Not Applied |
|---|---|---|
| Dimensionality | Reduced | High |
| Interpretability | Lost | Preserved |
| Noise | Reduced | Higher |
| Distance Stability | Improved | Degraded (high-dim) |
| Training Time | Lower | Higher |
| Information | Partial | Full |
| Complexity | Higher (extra step) | Lower |
Key Constraint¶
- PCA is linear:
- Cannot model non-linear relationships
- Assumes variance ≈ importance
| total_rooms | total_bedrooms | households | |
|---|---|---|---|
| total_rooms | 1.000000 | 0.931023 | 0.918161 |
| total_bedrooms | 0.931023 | 1.000000 | 0.979402 |
| households | 0.918161 | 0.979402 | 1.000000 |
Transform the 3 columns from X_train in 2 keeping the variance
Feature Scaling (Normalization / Standardization)¶
Why is Important¶
- Ensure comparable feature magnitudes
- Prevent one feature from dominating due to scale
- Stabilize distance computations
- Improve optimization convergence (neural networks)
- Ensure consistent L1/L2 regularization
When to Use¶
Apply scaling when:
- Using distance-based models:
- KNN, K-means, SVM, PCA
- Using gradient-based models:
- Neural networks
- Features have different numeric ranges
Avoid when:
- Using tree-based models (Random Forest, XGBoost)
- Use threshold splits
- Do not rely on distance or scale
Key Constraint¶
- Scaling changes feature scale only:
- Does NOT reduce dimensionality
- Does NOT remove noise
- Does NOT fix curse of dimensionality
Get each value, subtract by mean and divide by standard deviation
- Mean = 0
- Standard deviation = 1
Make values go between 0 and 1, when we already have a maximum known number like RGB (255)
- Small values like 0 to 1 are easily to work than 0 to 255
Create pipeline with StandardScaler
Create pipeline with Normalizer
Categorical Encoding — One-Hot / Dummy Encoding¶
Why is important¶
- Convert categorical variables into numeric representation without introducing ordinal relationships.
- Prevent models from inferring false ordering or artificial distances between categories.
- Preserve categorical independence via binary indicator features.
Example constraint:
- Label encoding:
-
Red = 1, Blue = 2, Green = 3
=> Implies: Green > Blue > Red, distance(Green, Red) = 2 (invalid) -
One-hot encoding:
- Represent each category as independent binary vector:
| Red | Blue | Green |
|---|---|---|
| 1 | 0 | 0 |
Environmental Context (When)¶
Apply one-hot encoding when:
- Model requires numeric input with linear or geometric assumptions:
- Linear Regression
- Logistic Regression
- KNN
- K-means
- SVM (especially linear)
- Model behavior depends on:
- Distance metrics
- Linear combinations
Avoid or deprioritize when:
- Using tree-based models:
- Random Forest
- XGBoost
- LightGBM
These models: - Perform threshold-based splits (e.g.,
if color == "Red") - Do not rely on distance
- Do not assume linear relationships
Execution Logic (How)¶
-
Enumerate categories - Identify all unique values in categorical feature
-
Instantiate binary columns - Create one column per category
-
Assign indicator values - Set:
- 1 => category present
- 0 => otherwise
-
Replace original feature - Drop original categorical column - Use binary feature matrix as model input
Comparative Analysis & Trade-offs¶
| Dimension | One-Hot Encoding Applied | Label Encoding Applied |
|---|---|---|
| Ordinal Assumption | None | Introduced (invalid for nominal) |
| Distance Semantics | Preserved (no artificial order) | Distorted |
| Model Compatibility | Linear, distance-based models | Tree-based models |
| Dimensionality | Increased (one column/category) | Constant (single column) |
| Interpretability | High (explicit categories) | Lower (encoded values ambiguous) |
| Memory Usage | Higher | Lower |
Key Constraints¶
- One-hot encoding increases feature dimensionality:
- Especially problematic with high cardinality (hundreds/thousands of categories)
- Can negatively impact:
- Memory usage
- Training time
- Model generalization (sparsity)
Usage Constraints¶
Avoid one-hot encoding when:
- Feature has high cardinality
- Using tree-based models
- Memory or performance constraints are critical
| <1H OCEAN | INLAND | ISLAND | NEAR BAY | NEAR OCEAN | |
|---|---|---|---|---|---|
| 0 | False | False | False | True | False |
| 1 | False | True | False | False | False |
| 2 | True | False | False | False | False |
| 3 | False | False | False | True | False |
| 4 | False | True | False | False | False |
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | <1H OCEAN | INLAND | ISLAND | NEAR BAY | NEAR OCEAN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.16 | 37.76 | 36.0 | 2781.0 | 574.0 | 1438.0 | 519.0 | 2.4598 | 155500.0 | NEAR BAY | False | False | False | True | False |
| 1 | -119.95 | 36.80 | 30.0 | 1233.0 | 214.0 | 620.0 | 199.0 | 3.4297 | 112500.0 | INLAND | False | True | False | False | False |
| 2 | -118.49 | 34.42 | 23.0 | 4166.0 | 756.0 | 2082.0 | 743.0 | 4.4107 | 213400.0 | <1H OCEAN | True | False | False | False | False |
| 3 | -122.24 | 37.79 | 27.0 | 1632.0 | 492.0 | 1171.0 | 429.0 | 2.3173 | 125000.0 | NEAR BAY | False | False | False | True | False |
| 4 | -121.45 | 36.86 | 11.0 | 1613.0 | 335.0 | 1617.0 | 342.0 | 3.1375 | 146200.0 | INLAND | False | True | False | False | False |
Drop the column with less occurrences
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | <1H OCEAN | INLAND | NEAR BAY | NEAR OCEAN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.16 | 37.76 | 36.0 | 2781.0 | 574.0 | 1438.0 | 519.0 | 2.4598 | 155500.0 | NEAR BAY | False | False | True | False |
| 1 | -119.95 | 36.80 | 30.0 | 1233.0 | 214.0 | 620.0 | 199.0 | 3.4297 | 112500.0 | INLAND | False | True | False | False |
| 2 | -118.49 | 34.42 | 23.0 | 4166.0 | 756.0 | 2082.0 | 743.0 | 4.4107 | 213400.0 | <1H OCEAN | True | False | False | False |
| 3 | -122.24 | 37.79 | 27.0 | 1632.0 | 492.0 | 1171.0 | 429.0 | 2.3173 | 125000.0 | NEAR BAY | False | False | True | False |
| 4 | -121.45 | 36.86 | 11.0 | 1613.0 | 335.0 | 1617.0 | 342.0 | 3.1375 | 146200.0 | INLAND | False | True | False | False |
| <1H OCEAN | INLAND | ISLAND | NEAR BAY | NEAR OCEAN | |
|---|---|---|---|---|---|
| 0 | True | False | False | False | False |
| 1 | True | False | False | False | False |
| 2 | True | False | False | False | False |
| 3 | False | True | False | False | False |
| 4 | True | False | False | False | False |
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | <1H OCEAN | INLAND | ISLAND | NEAR BAY | NEAR OCEAN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -118.07 | 33.87 | 28.0 | 2399.0 | 436.0 | 1613.0 | 429.0 | 3.6339 | 220100.0 | <1H OCEAN | True | False | False | False | False |
| 1 | -118.26 | 34.02 | 40.0 | 1259.0 | 362.0 | 1499.0 | 327.0 | 1.8382 | 126400.0 | <1H OCEAN | True | False | False | False | False |
| 2 | -118.51 | 34.16 | 23.0 | 11154.0 | 1995.0 | 4076.0 | 1809.0 | 5.4609 | 500001.0 | <1H OCEAN | True | False | False | False | False |
| 3 | -120.04 | 36.95 | 36.0 | 1528.0 | 347.0 | 1334.0 | 304.0 | 1.3594 | 48300.0 | INLAND | False | True | False | False | False |
| 4 | -117.91 | 33.65 | 24.0 | 1494.0 | 494.0 | 814.0 | 459.0 | 2.1074 | 181300.0 | <1H OCEAN | True | False | False | False | False |
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | <1H OCEAN | INLAND | NEAR BAY | NEAR OCEAN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -118.07 | 33.87 | 28.0 | 2399.0 | 436.0 | 1613.0 | 429.0 | 3.6339 | 220100.0 | <1H OCEAN | True | False | False | False |
| 1 | -118.26 | 34.02 | 40.0 | 1259.0 | 362.0 | 1499.0 | 327.0 | 1.8382 | 126400.0 | <1H OCEAN | True | False | False | False |
| 2 | -118.51 | 34.16 | 23.0 | 11154.0 | 1995.0 | 4076.0 | 1809.0 | 5.4609 | 500001.0 | <1H OCEAN | True | False | False | False |
| 3 | -120.04 | 36.95 | 36.0 | 1528.0 | 347.0 | 1334.0 | 304.0 | 1.3594 | 48300.0 | INLAND | False | True | False | False |
| 4 | -117.91 | 33.65 | 24.0 | 1494.0 | 494.0 | 814.0 | 459.0 | 2.1074 | 181300.0 | <1H OCEAN | True | False | False | False |
Binning (Discretization / Grouping)¶
Why is important¶
- Encode non-linear relationships into discrete intervals for models assuming linearity.
- Reduce noise by aggregating continuous values into stable groups.
- Absorb outliers into boundary bins to limit their influence.
- Improve interpretability via human-readable intervals.
When to Use¶
Apply binning when:
- Using models with linear assumptions:
- Linear Regression
- Logistic Regression
- Using models that operate on discrete distributions:
- Naive Bayes
- Continuous feature exhibits:
- Non-linear relationship with target
- High variance / noise
- Outliers
Avoid or deprioritize when:
- Using tree-based models:
- Random Forest
- XGBoost
- LightGBM
These models: - Perform implicit threshold-based splits (dynamic binning)
- Using neural networks:
- Learn non-linear patterns directly
- Using distance-based models:
- KNN
- K-means Binning:
- Destroys distance semantics
- Makes nearby values appear unrelated
Key Constraint¶
- Binning introduces information loss:
- Replaces continuous variation with discrete intervals
-
May reduce predictive precision
-
Manual binning in tree-based models:
- Duplicates internal splitting logic
- Can degrade performance by removing fine-grained thresholds
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | <1H OCEAN | INLAND | NEAR BAY | NEAR OCEAN | median_age_less_than_30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.16 | 37.76 | 36.0 | 2781.0 | 574.0 | 1438.0 | 519.0 | 2.4598 | 155500.0 | NEAR BAY | False | False | True | False | 0 |
| 1 | -119.95 | 36.80 | 30.0 | 1233.0 | 214.0 | 620.0 | 199.0 | 3.4297 | 112500.0 | INLAND | False | True | False | False | 0 |
| 2 | -118.49 | 34.42 | 23.0 | 4166.0 | 756.0 | 2082.0 | 743.0 | 4.4107 | 213400.0 | <1H OCEAN | True | False | False | False | 1 |
| 3 | -122.24 | 37.79 | 27.0 | 1632.0 | 492.0 | 1171.0 | 429.0 | 2.3173 | 125000.0 | NEAR BAY | False | False | True | False | 1 |
| 4 | -121.45 | 36.86 | 11.0 | 1613.0 | 335.0 | 1617.0 | 342.0 | 3.1375 | 146200.0 | INLAND | False | True | False | False | 1 |
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | <1H OCEAN | INLAND | NEAR BAY | NEAR OCEAN | median_age_less_than_30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -118.07 | 33.87 | 28.0 | 2399.0 | 436.0 | 1613.0 | 429.0 | 3.6339 | 220100.0 | <1H OCEAN | True | False | False | False | 1 |
| 1 | -118.26 | 34.02 | 40.0 | 1259.0 | 362.0 | 1499.0 | 327.0 | 1.8382 | 126400.0 | <1H OCEAN | True | False | False | False | 0 |
| 2 | -118.51 | 34.16 | 23.0 | 11154.0 | 1995.0 | 4076.0 | 1809.0 | 5.4609 | 500001.0 | <1H OCEAN | True | False | False | False | 1 |
| 3 | -120.04 | 36.95 | 36.0 | 1528.0 | 347.0 | 1334.0 | 304.0 | 1.3594 | 48300.0 | INLAND | False | True | False | False | 0 |
| 4 | -117.91 | 33.65 | 24.0 | 1494.0 | 494.0 | 814.0 | 459.0 | 2.1074 | 181300.0 | <1H OCEAN | True | False | False | False | 1 |
Clustering (Unsupervised Grouping)¶
FWhy is important¶
- Partition data into groups of similar data points without labels.
- Discover latent structure not explicitly defined:
- Customer segments
- Fraud patterns
- User behavior
- Enable feature engineering via cluster-derived features:
- Cluster ID
- Distance to cluster centers
- Support data compression and summarization:
- Reduce large datasets to representative groups
- Enable anomaly detection:
- Identify points far from cluster structure (e.g., fraud, system anomalies)
When to Use¶
Apply clustering when:
- Labels are unavailable
- Data is expected to contain natural group structure
- Distance/similarity metrics are meaningful
- Use cases include:
- Customer behavior analysis
- Transaction pattern detection
- System monitoring
Use clustering outputs as features when:
- Enhancing downstream models:
- Linear models
- Tree models
- Need to introduce non-linear structure into simpler models
- (Clustering acts similarly to binning for linear models)
Combine with preprocessing when:
- Applying standardization (ensure valid distance computation)
- Applying PCA (reduce dimensionality before clustering)
Preprocess data¶
- Apply standardization (scale features)
- Optionally apply PCA (reduce dimensionality)
Key Constraints¶
- Clustering assumes meaningful distance/similarity metrics
- Performance degrades in:
- High-dimensional spaces (curse of dimensionality)
- Results depend on:
- Algorithm choice (K-means, DBSCAN, hierarchical)
- Data distribution and scaling
- Often requires:
- Standardization (mandatory for distance-based clustering)
- PCA (optional, improves clustering in high dimensions)
---------------------------------------------------------------------------ValueError Traceback (most recent call last)File ~/.cache/uv/archive-v0/Trc0P-FTOEtFceW2c_C0B/lib/python3.11/site-packages/IPython/core/formatters.py:984, in IPythonDisplayFormatter.call(self, obj) 982 method = get_real_method(obj, self.print_method) 983 if method is not None: --> 984 method() 985 return TrueFile ~/.cache/uv/builds-v0/.tmpAAwKXY/lib/python3.11/site-packages/plotly/basedatatypes.py:850, in BaseFigure.ipython_display(self) 847 import plotly.io as pio 849 if pio.renderers.render_on_display and pio.renderers.default: --> 850 pio.show(self) 851 else: 852 print(repr(self))File ~/.cache/uv/builds-v0/.tmpAAwKXY/lib/python3.11/site-packages/plotly/io/renderers.py:415, in show(fig, renderer, validate, **kwargs) 410 raise ValueError( 411 "Mime type rendering requires ipython but it is not installed" 412 ) 414 if not nbformat or Version(nbformat.__version_) < Version("4.2.0"): --> 415 raise ValueError( 416 "Mime type rendering requires nbformat>=4.2.0 but it is not installed" 417 ) 419 display_jupyter_version_warnings() 421 ipython_display.display(bundle, raw=True)ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed
array([[ True, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, True, False],
...,
[False, False, True, ..., False, False, False],
[False, False, False, ..., False, True, False],
[False, False, False, ..., False, True, False]], shape=(17000, 7))
Feature Selection (Subset Selection / Dimensionality Reduction via Selection)¶
Why is Important¶
- Reduce overfitting by eliminating features that enable memorization of noise.
- Improve model performance by retaining only informative features.
- Reduce computational cost (training/inference).
- Improve interpretability by simplifying feature space.
- Handle multicollinearity by selecting representative features and removing redundant ones.
When to use¶
Apply feature selection when:
- Feature space is large (> 50–100 features).
- Using models sensitive to:
- Linear relationships:
- Linear Regression
- Logistic Regression
- Distance/similarity metrics:
- KNN
- K-Means
- Using models sensitive to:
- Noise and dimensionality:
- SVM
- Interpretability is required.
Model-specific benefits:
- Linear / Logistic Regression:
- Improves stability and coefficient interpretability
- KNN / K-Means:
- Improves distance quality (removes irrelevant dimensions)
- SVM:
- Reduces noise and accelerates training
Methods¶
Filter Methods (model-agnostic):
- Compute statistical metrics per feature:
- Correlation
- Variance Threshold
- Mutual Information
- Remove features based on thresholds
Wrapper Methods (model-dependent):
- Iterate over feature subsets
- Train model and evaluate performance
- Example:
- Recursive Feature Elimination (RFE)
Embedded Methods (model-integrated):
- Perform selection during training:
- L1 regularization (Lasso): drives coefficients to zero
- Tree-based feature importance
- Boruta algorithm
| Method | What It Does | How It Works | Strengths | Limitations / Constraints | Best Use Case |
|---|---|---|---|---|---|
| Correlation (Pearson) | Removes redundant linear relationships | Computes pairwise correlation, drops one of highly correlated features | Simple, effective for multicollinearity | Only captures linear relationships | Linear models, multicollinearity control |
| Mutual Information | Captures dependency with target | Measures information gain between feature and target | Detects non-linear relationships | Slower, less interpretable than correlation | General feature relevance (non-linear) |
| RFE (Recursive Feature Elimination) | Selects optimal subset via iterative pruning | Trains model, removes least important features iteratively | High-quality subset selection | Computationally expensive | Medium feature sets, high accuracy requirement |
| L1 (Lasso) | Enforces sparsity (zero coefficients) | Adds L1 penalty → minimizes loss + λ∑|w| → coefficients shrink to zero | Efficient, built-in selection | Unstable with correlated features | Linear models, high-dimensional data |
| Tree Feature Importance | Ranks features by predictive contribution | Aggregates impurity reduction (Gini/MSE) across splits | Captures non-linearities, interactions | Bias toward high-cardinality features | Tree-based models (RF, XGBoost, LightGBM) |
| Permutation Importance | Measures impact on model performance | Shuffles feature → measures performance drop | Model-agnostic, more reliable importance | Computationally expensive | Post-training validation of importance |
| Boruta | Identifies all relevant features (all-relevant) | Compares real vs shadow features using RF importance | Robust, avoids missing weak signals | Expensive, slower | High-stakes feature selection |
Key Constraints¶
- Removing features may:
- Discard useful signal if improperly configured
- Wrapper methods:
- Require cross-validation to control overfitting
- Embedded methods:
- Depend on model assumptions:
- L1 => sparsity assumption
- Trees => split-based importance bias
- Feature selection must be:
- Applied consistently in training and inference pipelines