Car Price Prediction¶
Linear Regression
Dataset:
Install packages
Append notebooks directory to sys.path
Import packages
Utility scripts:¶
KaggleDataExtractor:
Create data directory
Download dataset from Kaggle
Data Preparation¶
Load dataset
| Make | Model | Year | Engine Fuel Type | Engine HP | Engine Cylinders | Transmission Type | Driven_Wheels | Number of Doors | Market Category | Vehicle Size | Vehicle Style | highway MPG | city mpg | Popularity | MSRP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | BMW | 1 Series M | 2011 | premium unleaded (required) | 335.0 | 6.0 | MANUAL | rear wheel drive | 2.0 | Factory Tuner,Luxury,High-Performance | Compact | Coupe | 26 | 19 | 3916 | 46135 |
| 1 | BMW | 1 Series | 2011 | premium unleaded (required) | 300.0 | 6.0 | MANUAL | rear wheel drive | 2.0 | Luxury,Performance | Compact | Convertible | 28 | 19 | 3916 | 40650 |
Clean column names
| make | model | year | engine_fuel_type | engine_hp | engine_cylinders | transmission_type | driven_wheels | number_of_doors | market_category | vehicle_size | vehicle_style | highway_mpg | city_mpg | popularity | msrp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | BMW | 1 Series M | 2011 | premium unleaded (required) | 335.0 | 6.0 | MANUAL | rear wheel drive | 2.0 | Factory Tuner,Luxury,High-Performance | Compact | Coupe | 26 | 19 | 3916 | 46135 |
| 1 | BMW | 1 Series | 2011 | premium unleaded (required) | 300.0 | 6.0 | MANUAL | rear wheel drive | 2.0 | Luxury,Performance | Compact | Convertible | 28 | 19 | 3916 | 40650 |
Inspect DataFrame types
make object
model object
year int64
engine_fuel_type object
engine_hp float64
engine_cylinders float64
transmission_type object
driven_wheels object
number_of_doors float64
market_category object
vehicle_size object
vehicle_style object
highway_mpg int64
city_mpg int64
popularity int64
msrp int64
dtype: object
Select only object type columns
Clean column values
Exploratory Data Analysis¶
Column summary
| column | dtype | sample_unique | n_unique | |
|---|---|---|---|---|
| 0 | make | object | [bmw, audi, fiat, mercedes-benz, chrysler, nis... | 48 |
| 1 | model | object | [1_series_m, 1_series, 100, 124_spider, 190-cl... | 914 |
| 2 | year | int64 | [2011, 2012, 2013, 1992, 1993, 1994] | 28 |
| 3 | engine_fuel_type | object | [premium_unleaded_(required), regular_unleaded... | 10 |
| 4 | engine_hp | float64 | [335.0, 300.0, 230.0, 320.0, 172.0, 160.0] | 356 |
| 5 | engine_cylinders | float64 | [6.0, 4.0, 5.0, 8.0, 12.0, 0.0] | 9 |
| 6 | transmission_type | object | [manual, automatic, automated_manual, direct_d... | 5 |
| 7 | driven_wheels | object | [rear_wheel_drive, front_wheel_drive, all_whee... | 4 |
| 8 | number_of_doors | float64 | [2.0, 4.0, 3.0, nan] | 3 |
| 9 | market_category | object | [factory_tuner,luxury,high-performance, luxury... | 71 |
| 10 | vehicle_size | object | [compact, midsize, large] | 3 |
| 11 | vehicle_style | object | [coupe, convertible, sedan, wagon, 4dr_hatchba... | 16 |
| 12 | highway_mpg | int64 | [26, 28, 27, 25, 24, 20] | 59 |
| 13 | city_mpg | int64 | [19, 20, 18, 17, 16, 26] | 69 |
| 14 | popularity | int64 | [3916, 3105, 819, 617, 1013, 2009] | 48 |
| 15 | msrp | int64 | [46135, 40650, 36350, 29450, 34500, 31200] | 6049 |
Price distribution
A long tail distribution is observed, with a few cars priced very high.
Convert to log scale to reduce skewness, when doing an Machine Learning model, a skewed target variable can lead to suboptimal model performance.
Missing values
Validation Framework¶
Set split sizes
- Training dataset: 60%
- Validation dataset: 20%
- Test dataset: 20%
Split DataFrame into train, validation and test sizes
| make | model | year | engine_fuel_type | engine_hp | engine_cylinders | transmission_type | driven_wheels | number_of_doors | market_category | vehicle_size | vehicle_style | highway_mpg | city_mpg | popularity | msrp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chevrolet | cobalt | 2008 | regular_unleaded | 148.0 | 4.0 | manual | front_wheel_drive | 2.0 | NaN | compact | coupe | 33 | 24 | 1385 | 14410 |
| 1 | toyota | matrix | 2012 | regular_unleaded | 132.0 | 4.0 | automatic | front_wheel_drive | 4.0 | hatchback | compact | 4dr_hatchback | 32 | 25 | 2031 | 19685 |
Define prediction target
Linear Regression¶
Linear Regression formula:
Choosing an observation
make rolls-royce
model phantom_drophead_coupe
year 2015
engine_fuel_type premium_unleaded_(required)
engine_hp 453.0
engine_cylinders 12.0
transmission_type automatic
driven_wheels rear_wheel_drive
number_of_doors 2.0
market_category exotic,luxury,performance
vehicle_size large
vehicle_style convertible
highway_mpg 19
city_mpg 11
popularity 86
Name: 10, dtype: object
Selecting features
Defining weights
- bias term (w0): Is the baseline of how much should cost the car if you do not know nothing about it.
Defining linear regression formula
Linear Regression Vector form¶
Defining weights
Select observations
Define linear regression formula to be applied in a vector
Training a Linear Regression model¶
Get observations
array([[ 148., 24., 1385.],
[ 132., 25., 2031.],
[ 148., 28., 640.],
[ 90., 16., 873.],
[ 385., 15., 5657.],
[ 170., 22., 873.],
[ 500., 14., 520.],
[ 315., 21., 3916.],
[ 543., 10., 67.],
[ 202., 13., 5657.],
[ 453., 11., 86.],
[ 182., 20., 1385.],
[ 162., 20., 436.],
[ 553., 16., 2774.],
[ 272., 18., 3105.],
[ 160., 29., 640.],
[ 348., 17., 1439.],
[ 151., 18., 436.],
[ 210., 15., 1851.],
[ 164., 18., 5657.]])
\(X^{t} \cdot X\)
Create an inverse matrix
Check if inverse matrix \((X^{t} \cdot X)^{-1}\) is an identity matrix
Calculate the weights (coefficients) of the regression model.
Separate the weights for each feature and the bias term
Define train function
Train to get the weights
Baseline Model¶
RMSE¶
Root Mean Square Error
- Get the diferences between each real actual and predicted value
- Square the difference
- Take average of the differences
Define RMSE function
Calculate the RMSE for actual and predicted values
Validating the model¶
Feature Engineering¶
Get current year to calculate car age
Prepare the features
Apply some feature engineering
Check if model has improved
Categorical features¶
Add number of doors
Check if RMSE had any improvement
Inspect make attribute
Add Make to preparation
Check if RMSE had any improvement
Add another categorical features
Check if RMSE had any improvement
Regularization¶
Add a penalty to the model so the coefficients don't grow too large
Extend linear regression function with regularization
Check if RMSE had any improvement
Tunning the model¶
Using the model¶
Concatenate train and validation datasets
| make | model | year | engine_fuel_type | engine_hp | engine_cylinders | transmission_type | driven_wheels | number_of_doors | market_category | vehicle_size | vehicle_style | highway_mpg | city_mpg | popularity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chevrolet | cobalt | 2008 | regular_unleaded | 148.0 | 4.0 | manual | front_wheel_drive | 2.0 | NaN | compact | coupe | 33 | 24 | 1385 |
| 1 | toyota | matrix | 2012 | regular_unleaded | 132.0 | 4.0 | automatic | front_wheel_drive | 4.0 | hatchback | compact | 4dr_hatchback | 32 | 25 | 2031 |
Concatenate train and validation targets
Train model with complete dataset (train + validation) then predict y_test
Inspect object
{'make': 'toyota',
'model': 'sienna',
'year': 2015,
'engine_fuel_type': 'regular_unleaded',
'engine_hp': 266.0,
'engine_cylinders': 6.0,
'transmission_type': 'automatic',
'driven_wheels': 'front_wheel_drive',
'number_of_doors': 4.0,
'market_category': nan,
'vehicle_size': 'large',
'vehicle_style': 'passenger_minivan',
'highway_mpg': 25,
'city_mpg': 18,
'popularity': 2031}
Predict its value
Compare with actual value