The purpose of MLOps is to solve the challenges in Machine Learning related projects.
- New product/capability
- Automation/assistance of existing manual tasks
- Replacement of existing ML system
-
Scoping
- Define Project
- Decide on key metrics
- Model Accuracy
- Latency
- Throughput(QPS, query-per-second)
- Cloud/Edge/Browser computing?
- Cloud
- flexible computing power
- Edge
- Lower latency
- Offline-processing (network-incident-proof)
- Cloud
- Real-time/Batch
- Logging
- Securiy
- Privacy
- e.g. patient record
- Estimate required resources and timeline
- Decide on key metrics
- Define Project
-
Data
- Defining the data and establishing the baseline
- Labeling & Organizing the data
- Data label consistency
- e.g. Audio transcribtion
- Um, today's weather/Um...today's weather/today's weather?
- Volume normalization
- silence before/after each audio clip?
- e.g. Audio transcribtion
- Data label consistency
- Validating the data
-
Modeling, Model Development
- Two types of AI/ML development
- Model centric (tends to be research and academia)
- Data centric (tends to be production system)
- Challenges in model development
- Doing well on training data set
- Doing well on dev/test data set
- Doing well on business metrics/project goals
- Inputs
- Code (algorithm/model)
- Hyperparameters
- Data
- In Research/Academia, adjusting code/hyperparameters is reletively emphasized.
- In production system, focus is more on the data
- Two types of AI/ML development
-
- Performing error analysis 1.
-
Deployment
- Deploying the model to production
- Gradual ramp up of traffice
- Rollback possibility
- ML System Deployment patterns
- Shadow deployment possiblity
- gradual replacement of manual process, by implementing ML system in parallel
- e.g You’ve built a new system for making loan approval decisions. For now, its output is not used in any decision making process, and a human loan officer is solely responsible for deciding what loans to approve. But the system’s output is logged for analysis.
- Canary deployment
- Small portion of a single process to be done by the ML system to avoid significant negative impact
- Blue-green deployment
- Old(Blue) New(Green)
- Let Blue version running and switch traffics to Green version (Gradual or all at once)
- quick rollback to previous version by keeping the Blue version
- Shadow deployment possiblity
- ML System Deployment patterns
- Recommended way is to gradually automate manuall process. Full automation to be reached after a certain maturity level is reached.
- Deploying the model to production
-
Monitoring
- Spotting data/concept drift
- e.g. Audio recognition ML project
- Users gets older and the voice changes => emotion detection may become inaccurate
- e.g. Audio recognition ML project
- Spotting issues in data pipeline
- Pipeline monitoring
- if there are multiple ML models deployed as micro-service, then the output of ML model 1 affects the output of ML model 2
- Methods
- Use Dashboard (most common)
- Software Metrics
- e.g. Traffic load monitoring
- Input metrics
- e.g. avg. audio length, image brightness, number of missing values
- Output metrics
- e.g. frequency of recognition error, inaccurate recommendation
- Spotting data/concept drift
-
Maintaining
- improving/fixing the model based on the new data coming in from the production environment
- Configuration
- Data Collection
- Feature Extraction
- Data Verification
- Machine Resource Management
- ML Code
- Analysis Tools
- Process Management Tools
- Serving Infrastructure
- Monitoring
Ref. D.Sculley et al NIPS 205: Hidden Technical Debt in Machine Learning Systems
is in short, input data distribution change (tendency changes)
https://towardsdatascience.com/machine-learning-in-production-why-you-should-care-about-data-and-concept-drift-d96d0bc907fb https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/ https://youtu.be/06-AZXmwHjo
https://youtu.be/06-AZXmwHjo http://arxiv.org/abs/2011.09926 https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
X -> Y changes. The underlying theory changes.
e.g.
- Light in a room changed, so that the photos taken are overall brighter, which hides the scratches on the objects.
- Data Provenance