Skip to content

Machine Learning Design Principles

  • Ownership
  • Security controls
  • Fault tolerance
  • Recoverability
  • Reusability
  • Reproducibility
  • Resource optimization
  • CI/CD, CT (Continuous training)
  • Monitoring & Analysis
  • Sustainability (minimize environment impact)

Machine Learning Lifecycle

flowchart
  A[Business Goal] e1@--> B[ML Problem Framing]
  B e2@--> C[Data Processing]
  C e3@--> D[Model Development]
  D e4@--> E[Deployment]
  E e5@--> F[Monitoring]
  F e6@--> A

  e1@{ animate: true }
  e2@{ animate: true }
  e3@{ animate: true }
  e4@{ animate: true }
  e5@{ animate: true }
  e6@{ animate: true }

Business Goal

  • Discuss and agree on the level of model expandability
  • Monitor model compliance to business requirements
  • Validate the data permissions, privacy and license terms
  • Determine key performance indicators
  • Define overall return on investment (ROI) and opportunity cost

Machine Learning Problem Framing

  • Stablish Machine Learning roles and responsibilities
  • Prepare an profile template
    • Document resources required
  • Establish model improvement strategies
    • Experiments
    • Hyper-parameter optimization
  • Establish a lineage tracker system
    • Pipelines
    • Feature Store
    • Model Registry
  • Establish feedback loops across ML lifecycle phases
    • Model monitoring
  • Review fairness and expandability
  • Design data encryption and obfuscation
    • PII
    • Masking
  • Use APIs to abstract changes from model breaking application consumption
    • API Gateway
  • Adopt a machine learning microservice strategy
    • Serverless functions
    • Serverless containers
  • Define relevant evaluation metrics
  • Identify if machine learning is the right solution
  • Consider AI services and pre-trained models

Architecture:

flowchart

subgraph Process Data
    A[Collect Data]
end

subgraph Prepare Data
    B[Preprocess Data]
    C[Feature Engineering]
end

D[Train, Tune & Evaluate]
E[Deploy]
F[Monitor]
G[Alarm Manager]
H[Scheduler]
I[Model Registry]

subgraph Feature Stores
    J[Online Feature Store]
    K[Offline Feature Store]
end

A --> B
B --> C
C --> D
D --> E
E --> F
F --> |Detect Drift, etc|G
G --> |Run monitoring on schedule|H
G --> |Performance Feedback loop Ex: Adjust bias|B
G --> |Active Learning Loop Ex: New data|D
A --> |Store artifacts|I
D --> |Store Model Version|I
I --> |Fetch Artifacts|E
C --> |Store Features|J
J --> |Copy to Offline|K
K --> |Batch Inference|E
J --> |Fetch Features|D
K --> |Fetch Features|D

Data Processing

  • Profile data to improve quality
    • Data wrangling
    • Data exploration
  • Create tracking and version control mechanisms
    • Model Registry
    • Experiments
    • Code versioning on Git
  • Ensure least privilege access
  • Secure data and modeling environment
  • Protect sensitive data
  • Enforce data lineage
  • Keep relevant data
    • Remove PII
  • Use a data catalogue
  • Use a data pipeline
  • Automate managing data changes (MLOps)
  • Use a modern data architecture (data lake)
  • Use managed data labeling
  • Use data wrangler tools for interactive analysis
  • Enable feature reusability
    • Feature Store
  • Minimize Idle resources
  • Implement data lifecycle policies

Data Collection

  • Label: Set target variable values
  • Ingest: Can be stream, batch, micro-batch, cdc, event-driven, api-based, log-based, manual or other methods
  • Aggregate: Data can come from multiple sources

Data Preparation

Data Preprocessing:

  • Clean: Missing data & Outliers
  • Partition: Partition by dimension to efficient access
  • Scale: Should use a distributed system like spark?
  • Unbias & Balance: Deal with over representation of classes
  • Augment: Add new or additional data

Feature Engineering:

  • Feature Selection: Which features are most important
  • Feature Transformation: Normalization, Encoding, etc
  • Feature Creation: Transform the features you have in another ways
  • Feature Extraction: Extract information from an address field

Model Development: Training and Tuning

  • Automate operations through MLOps and CI/CD
  • Establish reliable packaging patterns to access approved public libraries
    • Container registry
  • Secured governed Machine Learning environment
  • Encrypted service communications
  • Protect against data poisoning threats
  • Enable CI/CD/CT automation with traceability
  • Ensure feature consistency across training and inference
  • Establish data bias detection and mitigation
  • Optimize training and inference instance types
  • Establish a model performance evaluation pipeline
    • Model Registry
  • Establish feature statistics
    • Model monitoring
    • Experiments
  • Perform a performance trade-off analysis
    • Accuracy vs complexity
    • Bias vs variance
    • Precision vs recall
    • Tests with experiments
  • Detect performance issues hen using transfer learning
  • Select optimal computing instance size
  • Use managed build environments
  • Select local training for small scale experiments
  • Select an optimal ML framework (PyTorch, Tensorflow, scikit-learn)
  • Use automated machine learning
  • Use distributed training
  • Stop cloud resources when not in use
  • Start training with small datasets
  • Use warm-start and checkpointing hyperparameter tuning
  • Define sustainable performance criteria
  • Select energy-efficient algorithms
  • Archive or delete unnecessary training artifacts
  • Use efficient model tuning methods
    • Bayesian or hyperband, not random or grid search
    • Limit concurrent training jobs
    • Tune only most important hyperparameters

Deployment

  • Establish deployment environment metrics
    • Event buses
    • Topics
    • Monitoring tools
  • Protect against adversarial and malicious activities
  • Use an appropriate deployment and testing strategy
    • Blue/Green deployments
    • Canary deployments
    • linear deployments
    • A/B Testing
  • Evaluate cloud vs edge options
  • Choose an optimal deployment in the cloud
    • Real-time, serverless, asynchronous, batch
  • Right-size model hosting instance
  • Align SLAs
    • Latency vs serverless/batch/asynchronous deployments

Monitoring

  • Enable model observability and tracking
  • Synchronize architecture and configuration, and check for skew across environments
  • Restrict access to intended legitimate consumers
    • Secure inference endpoints
  • Monitor human interactions with data for anomalous activity
  • Allow for automatic scaling of the model endpoint
  • Ensure a recoverable endpoint with a managed version control
    • Git
    • Container Registry
  • Evaluate model explainability
  • Evaluate data drift
  • Monitor, detect and handler model performance degradation
  • Establish an automated re-training framework
    • jenkins
    • github actions
    • aws step functions
  • Review updated data/features for retraining
  • Include human-in-the-loop monitoring
  • Monitor usage and costs by ML activity
  • Monitor return on investment for ML models
  • Monitor endpoint usage and right-size the instance
  • Measure material efficiency
    • Measure provisioned resources / business outcome
  • Retrain only when necessary