Machine Learning Design Principles¶

Ownership
Security controls
Fault tolerance
Recoverability
Reusability
Reproducibility
Resource optimization
CI/CD, CT (Continuous training)
Monitoring & Analysis
Sustainability (minimize environment impact)

Machine Learning Lifecycle¶

flowchart
  A[Business Goal] e1@--> B[ML Problem Framing]
  B e2@--> C[Data Processing]
  C e3@--> D[Model Development]
  D e4@--> E[Deployment]
  E e5@--> F[Monitoring]
  F e6@--> A

  e1@{ animate: true }
  e2@{ animate: true }
  e3@{ animate: true }
  e4@{ animate: true }
  e5@{ animate: true }
  e6@{ animate: true }

Business Goal¶

Discuss and agree on the level of model expandability
Monitor model compliance to business requirements
Validate the data permissions, privacy and license terms
Determine key performance indicators
Define overall return on investment (ROI) and opportunity cost

Machine Learning Problem Framing¶

Stablish Machine Learning roles and responsibilities
Prepare an profile template
- Document resources required
Establish model improvement strategies
- Experiments
- Hyper-parameter optimization
Establish a lineage tracker system
- Pipelines
- Feature Store
- Model Registry
Establish feedback loops across ML lifecycle phases
- Model monitoring
Review fairness and expandability
Design data encryption and obfuscation
- PII
- Masking
Use APIs to abstract changes from model breaking application consumption
- API Gateway
Adopt a machine learning microservice strategy
- Serverless functions
- Serverless containers
Define relevant evaluation metrics
Identify if machine learning is the right solution
Consider AI services and pre-trained models

Architecture:

flowchart

subgraph Process Data
    A[Collect Data]
end

subgraph Prepare Data
    B[Preprocess Data]
    C[Feature Engineering]
end

D[Train, Tune & Evaluate]
E[Deploy]
F[Monitor]
G[Alarm Manager]
H[Scheduler]
I[Model Registry]

subgraph Feature Stores
    J[Online Feature Store]
    K[Offline Feature Store]
end

A --> B
B --> C
C --> D
D --> E
E --> F
F --> |Detect Drift, etc|G
G --> |Run monitoring on schedule|H
G --> |Performance Feedback loop Ex: Adjust bias|B
G --> |Active Learning Loop Ex: New data|D
A --> |Store artifacts|I
D --> |Store Model Version|I
I --> |Fetch Artifacts|E
C --> |Store Features|J
J --> |Copy to Offline|K
K --> |Batch Inference|E
J --> |Fetch Features|D
K --> |Fetch Features|D

Data Processing¶

Profile data to improve quality
- Data wrangling
- Data exploration
Create tracking and version control mechanisms
- Model Registry
- Experiments
- Code versioning on Git
Ensure least privilege access
Secure data and modeling environment
Protect sensitive data
Enforce data lineage
Keep relevant data
- Remove PII
Use a data catalogue
Use a data pipeline
Automate managing data changes (MLOps)
Use a modern data architecture (data lake)
Use managed data labeling
Use data wrangler tools for interactive analysis
Enable feature reusability
- Feature Store
Minimize Idle resources
Implement data lifecycle policies

Data Collection¶

Label: Set target variable values
Ingest: Can be stream, batch, micro-batch, cdc, event-driven, api-based, log-based, manual or other methods
Aggregate: Data can come from multiple sources

Data Preparation¶

Data Preprocessing:

Clean: Missing data & Outliers
Partition: Partition by dimension to efficient access
Scale: Should use a distributed system like spark?
Unbias & Balance: Deal with over representation of classes
Augment: Add new or additional data

Feature Engineering:

Feature Selection: Which features are most important
Feature Transformation: Normalization, Encoding, etc
Feature Creation: Transform the features you have in another ways
Feature Extraction: Extract information from an address field

Model Development: Training and Tuning¶

Automate operations through MLOps and CI/CD
Establish reliable packaging patterns to access approved public libraries
- Container registry
Secured governed Machine Learning environment
Encrypted service communications
Protect against data poisoning threats
Enable CI/CD/CT automation with traceability
Ensure feature consistency across training and inference
Establish data bias detection and mitigation
Optimize training and inference instance types
Establish a model performance evaluation pipeline
- Model Registry
Establish feature statistics
- Model monitoring
- Experiments
Perform a performance trade-off analysis
- Accuracy vs complexity
- Bias vs variance
- Precision vs recall
- Tests with experiments
Detect performance issues hen using transfer learning
Select optimal computing instance size
Use managed build environments
Select local training for small scale experiments
Select an optimal ML framework (PyTorch, Tensorflow, scikit-learn)
Use automated machine learning
Use distributed training
Stop cloud resources when not in use
Start training with small datasets
Use warm-start and checkpointing hyperparameter tuning
Define sustainable performance criteria
Select energy-efficient algorithms
Archive or delete unnecessary training artifacts
Use efficient model tuning methods
- Bayesian or hyperband, not random or grid search
- Limit concurrent training jobs
- Tune only most important hyperparameters

Deployment¶

Establish deployment environment metrics
- Event buses
- Topics
- Monitoring tools
Protect against adversarial and malicious activities
Use an appropriate deployment and testing strategy
- Blue/Green deployments
- Canary deployments
- linear deployments
- A/B Testing
Evaluate cloud vs edge options
Choose an optimal deployment in the cloud
- Real-time, serverless, asynchronous, batch
Right-size model hosting instance
Align SLAs
- Latency vs serverless/batch/asynchronous deployments

Monitoring¶

Enable model observability and tracking
Synchronize architecture and configuration, and check for skew across environments
Restrict access to intended legitimate consumers
- Secure inference endpoints
Monitor human interactions with data for anomalous activity
Allow for automatic scaling of the model endpoint
Ensure a recoverable endpoint with a managed version control
- Git
- Container Registry
Evaluate model explainability
Evaluate data drift
Monitor, detect and handler model performance degradation
Establish an automated re-training framework
- jenkins
- github actions
- aws step functions
Review updated data/features for retraining
Include human-in-the-loop monitoring
Monitor usage and costs by ML activity
Monitor return on investment for ML models
Monitor endpoint usage and right-size the instance
Measure material efficiency
- Measure provisioned resources / business outcome
Retrain only when necessary