Machine Learning Design Principles¶
- Ownership
- Security controls
- Fault tolerance
- Recoverability
- Reusability
- Reproducibility
- Resource optimization
- CI/CD, CT (Continuous training)
- Monitoring & Analysis
- Sustainability (minimize environment impact)
Machine Learning Lifecycle¶
flowchart
A[Business Goal] e1@--> B[ML Problem Framing]
B e2@--> C[Data Processing]
C e3@--> D[Model Development]
D e4@--> E[Deployment]
E e5@--> F[Monitoring]
F e6@--> A
e1@{ animate: true }
e2@{ animate: true }
e3@{ animate: true }
e4@{ animate: true }
e5@{ animate: true }
e6@{ animate: true }
Business Goal¶
- Discuss and agree on the level of model expandability
- Monitor model compliance to business requirements
- Validate the data permissions, privacy and license terms
- Determine key performance indicators
- Define overall return on investment (ROI) and opportunity cost
Machine Learning Problem Framing¶
- Stablish Machine Learning roles and responsibilities
- Prepare an profile template
- Document resources required
- Establish model improvement strategies
- Experiments
- Hyper-parameter optimization
- Establish a lineage tracker system
- Pipelines
- Feature Store
- Model Registry
- Establish feedback loops across ML lifecycle phases
- Model monitoring
- Review fairness and expandability
- Design data encryption and obfuscation
- PII
- Masking
- Use APIs to abstract changes from model breaking application consumption
- API Gateway
- Adopt a machine learning microservice strategy
- Serverless functions
- Serverless containers
- Define relevant evaluation metrics
- Identify if machine learning is the right solution
- Consider AI services and pre-trained models
Architecture:
flowchart
subgraph Process Data
A[Collect Data]
end
subgraph Prepare Data
B[Preprocess Data]
C[Feature Engineering]
end
D[Train, Tune & Evaluate]
E[Deploy]
F[Monitor]
G[Alarm Manager]
H[Scheduler]
I[Model Registry]
subgraph Feature Stores
J[Online Feature Store]
K[Offline Feature Store]
end
A --> B
B --> C
C --> D
D --> E
E --> F
F --> |Detect Drift, etc|G
G --> |Run monitoring on schedule|H
G --> |Performance Feedback loop Ex: Adjust bias|B
G --> |Active Learning Loop Ex: New data|D
A --> |Store artifacts|I
D --> |Store Model Version|I
I --> |Fetch Artifacts|E
C --> |Store Features|J
J --> |Copy to Offline|K
K --> |Batch Inference|E
J --> |Fetch Features|D
K --> |Fetch Features|D
Data Processing¶
- Profile data to improve quality
- Data wrangling
- Data exploration
- Create tracking and version control mechanisms
- Model Registry
- Experiments
- Code versioning on Git
- Ensure least privilege access
- Secure data and modeling environment
- Protect sensitive data
- Enforce data lineage
- Keep relevant data
- Remove PII
- Use a data catalogue
- Use a data pipeline
- Automate managing data changes (MLOps)
- Use a modern data architecture (data lake)
- Use managed data labeling
- Use data wrangler tools for interactive analysis
- Enable feature reusability
- Feature Store
- Minimize Idle resources
- Implement data lifecycle policies
Data Collection¶
- Label: Set target variable values
- Ingest: Can be stream, batch, micro-batch, cdc, event-driven, api-based, log-based, manual or other methods
- Aggregate: Data can come from multiple sources
Data Preparation¶
Data Preprocessing:
- Clean: Missing data & Outliers
- Partition: Partition by dimension to efficient access
- Scale: Should use a distributed system like spark?
- Unbias & Balance: Deal with over representation of classes
- Augment: Add new or additional data
Feature Engineering:
- Feature Selection: Which features are most important
- Feature Transformation: Normalization, Encoding, etc
- Feature Creation: Transform the features you have in another ways
- Feature Extraction: Extract information from an address field
Model Development: Training and Tuning¶
- Automate operations through MLOps and CI/CD
- Establish reliable packaging patterns to access approved public libraries
- Container registry
- Secured governed Machine Learning environment
- Encrypted service communications
- Protect against data poisoning threats
- Enable CI/CD/CT automation with traceability
- Ensure feature consistency across training and inference
- Establish data bias detection and mitigation
- Optimize training and inference instance types
- Establish a model performance evaluation pipeline
- Model Registry
- Establish feature statistics
- Model monitoring
- Experiments
- Perform a performance trade-off analysis
- Accuracy vs complexity
- Bias vs variance
- Precision vs recall
- Tests with experiments
- Detect performance issues hen using transfer learning
- Select optimal computing instance size
- Use managed build environments
- Select local training for small scale experiments
- Select an optimal ML framework (PyTorch, Tensorflow, scikit-learn)
- Use automated machine learning
- Use distributed training
- Stop cloud resources when not in use
- Start training with small datasets
- Use warm-start and checkpointing hyperparameter tuning
- Define sustainable performance criteria
- Select energy-efficient algorithms
- Archive or delete unnecessary training artifacts
- Use efficient model tuning methods
- Bayesian or hyperband, not random or grid search
- Limit concurrent training jobs
- Tune only most important hyperparameters
Deployment¶
- Establish deployment environment metrics
- Event buses
- Topics
- Monitoring tools
- Protect against adversarial and malicious activities
- Use an appropriate deployment and testing strategy
- Blue/Green deployments
- Canary deployments
- linear deployments
- A/B Testing
- Evaluate cloud vs edge options
- Choose an optimal deployment in the cloud
- Real-time, serverless, asynchronous, batch
- Right-size model hosting instance
- Align SLAs
- Latency vs serverless/batch/asynchronous deployments
Monitoring¶
- Enable model observability and tracking
- Synchronize architecture and configuration, and check for skew across environments
- Restrict access to intended legitimate consumers
- Secure inference endpoints
- Monitor human interactions with data for anomalous activity
- Allow for automatic scaling of the model endpoint
- Ensure a recoverable endpoint with a managed version control
- Git
- Container Registry
- Evaluate model explainability
- Evaluate data drift
- Monitor, detect and handler model performance degradation
- Establish an automated re-training framework
- jenkins
- github actions
- aws step functions
- Review updated data/features for retraining
- Include human-in-the-loop monitoring
- Monitor usage and costs by ML activity
- Monitor return on investment for ML models
- Monitor endpoint usage and right-size the instance
- Measure material efficiency
- Measure provisioned resources / business outcome
- Retrain only when necessary