Credit Risk Score¶
- Binary classification
- \(1\): Default
- \(0\): No default
Dataset:
kaggle-credit-scoring or github-credit-scoring
Install packages
Append notebooks directory to sys.path
Utility scripts:¶
KaggleDataExtractor:
Create data directory
Download dataset from Kaggle
Pass notebook variables to shell command
"Status","Seniority","Home","Time","Age","Marital","Records","Job","Expenses","Income","Assets","Debt","Amount","Price"
1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
1,0,1,36,26,1,1,1,46,107,0,0,310,910
1,1,2,60,36,2,1,1,75,214,3500,0,650,1645
1,29,2,60,44,2,1,1,75,125,10000,0,1600,1800
1,9,5,12,27,1,1,1,35,80,0,0,200,1093
1,0,2,60,32,2,1,3,90,107,15000,0,1200,1957
Data Preparation¶
Load dataset
| Status | Seniority | Home | Time | Age | Marital | Records | Job | Expenses | Income | Assets | Debt | Amount | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 9 | 1 | 60 | 30 | 2 | 1 | 3 | 73 | 129 | 0 | 0 | 800 | 846 |
| 1 | 1 | 17 | 1 | 60 | 58 | 3 | 1 | 1 | 48 | 131 | 0 | 0 | 1000 | 1658 |
Inspect all columns at once
| 0 | 1 | 2 | |
|---|---|---|---|
| Status | 1 | 1 | 2 |
| Seniority | 9 | 17 | 10 |
| Home | 1 | 1 | 2 |
| Time | 60 | 60 | 36 |
| Age | 30 | 58 | 46 |
| Marital | 2 | 3 | 2 |
| Records | 1 | 1 | 2 |
| Job | 3 | 1 | 3 |
| Expenses | 73 | 48 | 90 |
| Income | 129 | 131 | 200 |
| Assets | 0 | 0 | 3000 |
| Debt | 0 | 0 | 0 |
| Amount | 800 | 1000 | 2000 |
| Price | 846 | 1658 | 2985 |
Data summary
| column | dtype | sample_unique | n_unique | |
|---|---|---|---|---|
| 0 | Status | int64 | [1, 2, 0] | 3 |
| 1 | Seniority | int64 | [9, 17, 10, 0, 1, 29] | 47 |
| 2 | Home | int64 | [1, 2, 5, 3, 6, 4] | 7 |
| 3 | Time | int64 | [60, 36, 12, 48, 18, 24] | 11 |
| 4 | Age | int64 | [30, 58, 46, 24, 26, 36] | 50 |
| 5 | Marital | int64 | [2, 3, 1, 4, 5, 0] | 6 |
| 6 | Records | int64 | [1, 2] | 2 |
| 7 | Job | int64 | [3, 1, 2, 0, 4] | 5 |
| 8 | Expenses | int64 | [73, 48, 90, 63, 46, 75] | 94 |
| 9 | Income | int64 | [129, 131, 200, 182, 107, 214] | 353 |
| 10 | Assets | int64 | [0, 3000, 2500, 3500, 10000, 15000] | 160 |
| 11 | Debt | int64 | [0, 2500, 260, 2000, 500, 99999999] | 183 |
| 12 | Amount | int64 | [800, 1000, 2000, 900, 310, 650] | 285 |
| 13 | Price | int64 | [846, 1658, 2985, 1325, 910, 1645] | 1419 |
Clean column names
| status | seniority | home | time | age | marital | records | job | expenses | income | assets | debt | amount | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 9 | 1 | 60 | 30 | 2 | 1 | 3 | 73 | 129 | 0 | 0 | 800 | 846 |
| 1 | 1 | 17 | 1 | 60 | 58 | 3 | 1 | 1 | 48 | 131 | 0 | 0 | 1000 | 1658 |
Decode number variables
Inspect decoding results
| status | seniority | home | time | age | marital | records | job | expenses | income | assets | debt | amount | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ok | 9 | rent | 60 | 30 | married | no | freelance | 73 | 129 | 0 | 0 | 800 | 846 |
| 1 | ok | 17 | rent | 60 | 58 | widow | no | fixed | 48 | 131 | 0 | 0 | 1000 | 1658 |
| 2 | default | 10 | owner | 36 | 46 | married | yes | freelance | 90 | 200 | 3000 | 0 | 2000 | 2985 |
| 3 | ok | 0 | rent | 60 | 24 | single | no | fixed | 63 | 182 | 2500 | 0 | 900 | 1325 |
| 4 | ok | 0 | rent | 36 | 26 | single | no | fixed | 46 | 107 | 0 | 0 | 310 | 910 |
Inspect values range
| seniority | time | age | expenses | income | assets | debt | amount | price | |
|---|---|---|---|---|---|---|---|---|---|
| count | 4455.0 | 4455.0 | 4455.0 | 4455.0 | 4455.0 | 4455.0 | 4455.0 | 4455.0 | 4455.0 |
| mean | 8.0 | 46.0 | 37.0 | 56.0 | 763317.0 | 1060341.0 | 404382.0 | 1039.0 | 1463.0 |
| std | 8.0 | 15.0 | 11.0 | 20.0 | 8703625.0 | 10217569.0 | 6344253.0 | 475.0 | 628.0 |
| min | 0.0 | 6.0 | 18.0 | 35.0 | 0.0 | 0.0 | 0.0 | 100.0 | 105.0 |
| 25% | 2.0 | 36.0 | 28.0 | 35.0 | 80.0 | 0.0 | 0.0 | 700.0 | 1118.0 |
| 50% | 5.0 | 48.0 | 36.0 | 51.0 | 120.0 | 3500.0 | 0.0 | 1000.0 | 1400.0 |
| 75% | 12.0 | 60.0 | 45.0 | 72.0 | 166.0 | 6000.0 | 0.0 | 1300.0 | 1692.0 |
| max | 48.0 | 72.0 | 68.0 | 180.0 | 99999999.0 | 99999999.0 | 99999999.0 | 5000.0 | 11140.0 |
Check series for large numbers
Replace values
Check if values were replaced
| seniority | time | age | expenses | income | assets | debt | amount | price | |
|---|---|---|---|---|---|---|---|---|---|
| count | 4455.0 | 4455.0 | 4455.0 | 4455.0 | 4421.0 | 4408.0 | 4437.0 | 4455.0 | 4455.0 |
| mean | 8.0 | 46.0 | 37.0 | 56.0 | 131.0 | 5403.0 | 343.0 | 1039.0 | 1463.0 |
| std | 8.0 | 15.0 | 11.0 | 20.0 | 86.0 | 11573.0 | 1246.0 | 475.0 | 628.0 |
| min | 0.0 | 6.0 | 18.0 | 35.0 | 0.0 | 0.0 | 0.0 | 100.0 | 105.0 |
| 25% | 2.0 | 36.0 | 28.0 | 35.0 | 80.0 | 0.0 | 0.0 | 700.0 | 1118.0 |
| 50% | 5.0 | 48.0 | 36.0 | 51.0 | 120.0 | 3000.0 | 0.0 | 1000.0 | 1400.0 |
| 75% | 12.0 | 60.0 | 45.0 | 72.0 | 165.0 | 6000.0 | 0.0 | 1300.0 | 1692.0 |
| max | 48.0 | 72.0 | 68.0 | 180.0 | 959.0 | 300000.0 | 30000.0 | 5000.0 | 11140.0 |
Check for status values
Remove unlabeled data
Verify removal
Split datasets
- 60% train
- 20% validation
- 20% test
Drop indexes
Encode labels to integer
- Default: 1
- Ok: 0
Drop target column status
Decision Tree¶
Example of a simple decision tree
flowchart TD
A[Start] --> B{Debt > Income?}
B -- Yes --> C[Default]
B -- No --> D{Income < 100?}
D -- Yes --> C
D -- No --> E[OK]
One Hot Encoding
Train Decision Tree
Validation
Prediction on validation dataset
Prediction on train dataset
Overfitting¶
Model memorizes the training data, but it fails to generalize. When new data comes, it does not now how to handle it
This can happen when our Tree is to deep, so the model can learn any possible combination
Train the tree with hyperparameter of max_depth
New prediction on validation dataset
New prediction on train dataset
|--- records=no <= 0.50
| |--- seniority <= 6.50
| | |--- amount <= 862.50
| | | |--- class: 0
| | |--- amount > 862.50
| | | |--- class: 1
| |--- seniority > 6.50
| | |--- income <= 103.50
| | | |--- class: 1
| | |--- income > 103.50
| | | |--- class: 0
|--- records=no > 0.50
| |--- job=partime <= 0.50
| | |--- income <= 74.50
| | | |--- class: 0
| | |--- income > 74.50
| | | |--- class: 0
| |--- job=partime > 0.50
| | |--- assets <= 8750.00
| | | |--- class: 1
| | |--- assets > 8750.00
| | | |--- class: 0
Decision Trees parameter tuning¶
max_depthmin_samples_leaf
Both max_depth and min_samples_leaf
| max_depth | min_samples_leaf | roc_auc_score | |
|---|---|---|---|
| 0 | 4 | 1 | 0.761283 |
| 1 | 4 | 2 | 0.761283 |
| 2 | 4 | 5 | 0.761283 |
| 3 | 4 | 10 | 0.761283 |
| 4 | 4 | 15 | 0.763726 |
Improve visualization
| max_depth | 4 | 5 | 6 |
|---|---|---|---|
| min_samples_leaf | |||
| 1 | 0.761 | 0.766 | 0.751 |
| 2 | 0.761 | 0.767 | 0.765 |
| 5 | 0.761 | 0.768 | 0.762 |
| 10 | 0.761 | 0.762 | 0.778 |
| 15 | 0.764 | 0.772 | 0.785 |
| 20 | 0.761 | 0.774 | 0.774 |
| 100 | 0.756 | 0.763 | 0.776 |
| 200 | 0.747 | 0.759 | 0.768 |
| 500 | 0.680 | 0.680 | 0.680 |
Visualize in a heatmap
Train with newer parameters
Ensembles¶
Combining multiple models together
Random Forest¶
Prediction with Random Forest
Get scores for Decision Tree
| n_estimators | roc_auc_score | |
|---|---|---|
| 0 | 10 | 0.774473 |
| 1 | 20 | 0.803532 |
| 2 | 30 | 0.815075 |
| 3 | 40 | 0.815686 |
| 4 | 50 | 0.817082 |
Plot parameters
Add depth parameter
| max_depth | n_estimators | roc_auc_score | |
|---|---|---|---|
| 0 | 5 | 10 | 0.787699 |
| 1 | 5 | 20 | 0.797731 |
| 2 | 5 | 30 | 0.800305 |
| 3 | 5 | 40 | 0.799708 |
| 4 | 5 | 50 | 0.799878 |
Plot parameters
Choose better parameter
Verifying best value for min_samples_leaf
| min_samples_leaf | n_estimators | roc_auc_score | |
|---|---|---|---|
| 0 | 1 | 10 | 0.791365 |
| 1 | 1 | 20 | 0.808496 |
| 2 | 1 | 30 | 0.811584 |
| 3 | 1 | 40 | 0.817839 |
| 4 | 1 | 50 | 0.817058 |
Plotting min_samples_leaf
Defining best value for min_samples_leaf
Training the model with new hyperparameter
Boosting¶
Training models sequentially and a model corrects the error of previous one
Train model
Check ROC AUC
Watchlist for evaluation
Get evaluation outputs from training
Parse xgb training metrics output
Parse metrics
| num_iter | train_auc | val_auc | |
|---|---|---|---|
| 0 | 0 | 0.86653 | 0.77999 |
| 1 | 10 | 0.95512 | 0.81115 |
| 2 | 20 | 0.97648 | 0.81877 |
| 3 | 30 | 0.98844 | 0.81613 |
| 4 | 40 | 0.99393 | 0.81407 |
Plot metrics
Plot only validation
Parameter tuning¶
Tuning Learning Rate (ETA)
Plotting results
Tuning Max Depth
Plot scores
Plotting filtered scores
Fixing best max_depth
Tuning Min child weight
Plot scores
Setting best min_child_weight
Selecting the final model¶
Decision Tree
Evaluate Decision Tree
Random forest
Evaluate Random Forest
XGBoost
Evaluate XGBoost
Train best model with all training data
Transform full training and test datasets
Create DMatrix for XGBoost
Train model
Evaluate chosen model