Training¶
A machine learning model is essentially a mathematical function that learns patterns from historical data so it can make predictions about new, unseen data.
It takes features (inputs) and produces a prediction (output).
Example:
| Age | Income | Loan Default |
|---|---|---|
| 25 | 3000 | No |
| 45 | 9000 | No |
| 30 | 2000 | Yes |
The model learns patterns like:
"People with lower income and certain age ranges have a higher probability of default."
After training, when a new example arrives, the model uses the learned patterns to estimate the outcome.
The dataset is typically split into three parts to ensure the model learns properly and generalizes to new data.
- Train
- Validation (optional but common)
- Test
Dataset Splits¶
Train¶
The training dataset is used to teach the model the patterns in the data.
During training, the model:
- Makes a prediction
- Compares the prediction to the correct answer (label)
- Calculates the error (loss)
- Adjusts its internal parameters to reduce that error
This process repeats many times until the model learns stable patterns.
Note
The model is not memorizing the data. It is adjusting parameters to minimize prediction error.
Example: A logistic regression model adjusts its weights so the predicted probability of default becomes closer to the real label.
Validation¶
The validation dataset is used to evaluate the model during training.
It helps answer questions like:
- Is the model improving?
- Is the model overfitting the training data?
- Which hyperparameters work best?
The validation set is never used to update the model weights.
It is only used to measure performance and guide decisions like:
- Choosing hyperparameters
- Early stopping
- Selecting the best model version
Example hyperparameters:
- learning rate
- number of trees
- tree depth
- regularization strength
Test¶
The test dataset is used for the final unbiased evaluation of the model.
It simulates real world unseen data.
Note
The test dataset must never influence training decisions.
Otherwise the model may leak information and the evaluation becomes unreliable.
The test metric represents the expected production performance.
Choosing the Algorithm¶
Choosing the algorithm means selecting the type of mathematical model that will learn from the data.
Different algorithms learn patterns in different ways.
Examples:
| Algorithm | Typical Use |
|---|---|
| Linear Regression | Predict continuous numbers |
| Logistic Regression | Binary classification |
| Decision Trees | Interpretable rule-based models |
| Random Forest | Robust general-purpose model |
| Gradient Boosting (XGBoost, LightGBM) | High performance tabular data |
| Neural Networks | Complex patterns like images, text |
How to Choose an Algorithm¶
1. Problem Type¶
The task determines the model family.
| Problem | Example | Model Type |
|---|---|---|
| Regression | Predict house price | Linear regression, XGBoost |
| Classification | Fraud detection | Logistic regression, Random Forest |
| Ranking | Search results | Gradient boosting |
| NLP | Text classification | Transformers |
| Computer Vision | Image detection | CNN |
2. Data Characteristics¶
The nature of the dataset strongly influences model choice.
Important questions:
- How large is the dataset?
- Are features mostly numerical or categorical?
- Is the data tabular, image, text, or time-series?
Example:
Tabular business datasets usually perform best with:
- Gradient Boosting
- Random Forest
Deep learning is often unnecessary.
3. Interpretability Requirements¶
Some domains require models to be explainable.
Examples:
| Industry | Requirement |
|---|---|
| Finance | Must explain credit decisions |
| Healthcare | Must justify predictions |
| Insurance | Regulatory transparency |
In these cases engineers may prefer:
- Logistic Regression
- Decision Trees
- Explainable Gradient Boosting
Instead of black-box deep learning.
4. Performance vs Complexity¶
Some models are simple but fast.
Others are powerful but complex.
Example trade-off:
| Model | Pros | Cons |
|---|---|---|
| Logistic Regression | Simple, interpretable | Limited complexity |
| Random Forest | Good performance | Larger model |
| Gradient Boosting | Excellent accuracy | Slower training |
| Neural Networks | Extremely powerful | Hard to interpret |
A good cientist starts simple first and increases complexity only if necessary.
Training the Model¶
Once the algorithm is chosen, the model must learn the parameters that best fit the data.
Training generally follows this loop:
- Input features into the model
- Model produces a prediction
- Compare prediction with the true label
- Compute a loss function
- Update model parameters to reduce loss
This process is called optimization.
Example:
Gradient Descent is the algorithm used to train many ML models by minimizing prediction error (loss).
Evaluating the Model¶
After training, the model is evaluated using metrics that match the business problem.
Examples:
| Task | Metrics |
|---|---|
| Classification | Accuracy, Precision, Recall, F1 |
| Fraud Detection | ROC-AUC, Precision-Recall |
| Regression | RMSE, MAE |
| Ranking | NDCG |
The goal is not only high performance, but stable performance on unseen data.
A model that performs extremely well on training data but poorly on validation/test data is overfitting.
Model Versioning¶
Every training run must be reproducible and traceable.
A training pipeline should version:
- Code (Git)
- Dataset (data version)
- Features / transformations
- Hyperparameters
- Model artifact
- Evaluation metrics
This guarantees that any model can be retrained or audited later.
Models are typically stored in a Model Registry, which tracks model versions and lifecycle stages.
flowchart LR
A[Experiment] e1@ --> B[Staging]
B e2@ --> C[Production]
C e3@ --> D[Archived]
%% Animation
e1@{ animate: true }
e2@{ animate: true }
e3@{ animate: true }
Automated Training Pipeline (CI/CD)¶
In production systems, training runs through an automated pipeline, not manual scripts.
Typical pipeline:
flowchart LR
A[Data] e1@ --> B[Feature Engineering]
B e2@ --> C[Train]
C e3@ --> D[Evaluate]
D e4@ --> E[Register Model]
%% Animation
e1@{ animate: true }
e2@{ animate: true }
e3@{ animate: true }
e4@{ animate: true }
Key practices:
-
CI (Continuous Integration)
- Validates training code
- Runs tests and data checks
- Ensures pipeline reproducibility
-
CD (Continuous Delivery)
- Promotes models automatically if metrics pass thresholds
- Registers model in the Model Registry
- Triggers deployment workflows
Example flow:
flowchart LR
A[Code change / new data] e1@ --> B[Training pipeline]
B e2@ --> C[Model evaluation]
C e3@ --> D[Metric threshold check]
D e4@ --> E[Model Registry]
E e5@ --> F[Deployment candidate]
%% Animation
e1@{ animate: true }
e2@{ animate: true }
e3@{ animate: true }
e4@{ animate: true }
e5@{ animate: true }
Summary¶
Training a model is not just running an algorithm.
It is a systematic process of:
- Preparing data
- Choosing the right algorithm
- Training and tuning the model
- Evaluating generalization
- Selecting the best version for production
This disciplined approach ensures the model performs reliably when deployed in real-world systems.