What is MLOps? A quite controversial definition
What is MLOps? Is it only DevOps applied to machine learning? I don’t think so. The Ops part isn’t as important as we believe it is. It’s funny, but the machine learning part of MLOps isn’t essential either. What?! How is that even possible?
Why Do We Need MLOps?
Before we start defining what MLOps is, we have to think about the reason why it exists. According to Brian T. O’Neill, 85% of big data projects fail. In this article, we see more terrible news. Only 22% of companies successfully deploy machine learning in production. So why do we need MLOps? We need it to get business results from machine learning. The goal is more complex than merely getting ML into production.
MLOps consists of five parts
I think MLOps consists of five parts. In my opinion, MLOps is a mix of business knowledge, knowledge about data quality testing, machine learning, distributed systems, and Ops practices.
We aren’t doing machine learning for the sake of doing machine learning. We want business results. Even the “AI startups” whose founders need AI only to look better while talking to investors need it only because of business results—in this case, getting funding. I think the fundamental requirement for MLOps engineers is understanding the business context.
What are you trying to achieve?
It looks like a trivial question. However, I have seen large teams working for months and not knowing why they are doing it. There is nothing more demotivating than wondering whether you are doing useless work. Without a clearly defined goal, the team will eventually doubt the usefulness of their project. They may be right.
What is the model supposed to automate? What is the current process or the equivalent manual process?
Can we do business without the machine learning model? How is it possible? Most likely, there is a manual process, and the team is trying to automate a part of the process. Make sure you understand which part of the process has to be automated.
How will you know that the model has achieved the business goal?
How would you know whether you have finished the task? What is the metric used to evaluate the model? What value is good enough to deploy the model in production?
What is the cost of wrong predictions of the model?
Every machine learning model makes mistakes. What happens when it returns a wrong classification label or predicts an incorrect numeric value? Don’t stop thinking about it when you find the immediate consequence. We may use the incorrect inference to make subsequent decisions. What is the second-order consequence of the mistake?
In the talk “The Only Truly Hard Problem in MLOps”, Todd Underwood says that data quality is the only problem in MLOps because subtle changes in the distribution of the data have a tremendous impact on the quality of the model.
I see data quality as the issue killing most of the machine learning projects. It happens for two reasons.
We obtain the input data for training at the beginning of the project, and the project runs for months. Because of concept drift, the model can no longer deal with the real data. The model learned old patterns, but the current reality looks differently. We can mitigate this problem by retraining the models regularly.
It is similar to concept drift because the distribution of values between production and the training set may happen due to outdated training data.
However, quite often, the training data is a hand-picked, perfect set of values that don’t cover all cases existing in reality. MLOps engineers should calculate distributions of training data and regularly compare them with the production data. When the distributions are no longer similar, it’s time to retrain the model. If the distributions of the values differ before you start training, you don’t even have enough data to begin training the first version of the model.
Eventually, we have to deal with machine learning issues. Of course, machine learning engineers who train the model will take care of most problems. However, there’s still work for MLOps people because some issues occur only in production.
Is the Model Numerically Stable?
What does numerical stability mean? It is a fancy name to say that the model is resilient to noise in the input data. Does a slight change of input lead to a completely different prediction? Such a model is numerically unstable. What we want are models which can deal with slightly damaged data.
Why does it matter to MLOps engineers? Because noisy data will happen in production. What can we do about it? Test. We can generate some noisy test data and test the model before deployment.
Does your model discriminate against a group of people? Is it biased?
What can you do? You can slice your test data into smaller datasets using the most likely features to reveal bias. In the next step, you can calculate the positive/negative rate, true/false positive rate, true/false-negative rate, and AUC for every slice separately. The goal is to have identical (or similar enough) results across all portions of data.
Is this enough? No. But if your model doesn’t pass such a simple test, you can be sure it is unfair.
Most teams are “lucky” and never have this problem. Of course, I’m sarcastic now. It’s not “luck,” but if you never deploy a model in production, you won’t need to worry about regression testing. MLOps engineers should implement a test that compares the prediction of a newly deployed model with the predictions of the old model. The inference quality shouldn’t decrease. The new model should be at least as good as the previous one.
Preventing Data Leakage
MLOps engineers are the people who know what data is available in production. We should use this knowledge to detect data leakage.
Is the machine learning engineering team training the model using data unavailable at the time of inference? The training dataset may contain values “from the future.” The production data won’t. Guess how it will affect your model. Of course, to spot such a problem, we need to know the business process and understand how the system produces the data.
Most likely, the machine learning model isn’t deployed as a module of another application. Usually, we deploy them as external systems accessed through an API. This approach decouples the model from other production code, but it also creates a ton of new problems.
End-to-end Request Tracking
How will you debug your machine learning system if there are no correlation ids and you can’t track requests across all services? Make sure you pass a request id between services and log them.
Shadow Deployments and Canary Releases
How will you test a freshly-trained model in production before the new model handles all traffic? There are two ways to do it. We can do a shadow deployment. In this case, the model processes all requests, but the model output isn’t used. Instead of that, we log the values returned by the model and review them.
A canary release is a technique of directing a small percentage of traffic to a service. We slowly increase the share of requests handled by the model and monitor the situation. If we are satisfied with the result, we switch to the new model entirely or revert to the previous implementation.
At some point, you will have to switch off the model urgently. For example, you may want to prevent propagating garbage downstream when you get a batch of input data that causes erroneous predictions. Can you just undeploy the model? What will happen to the client services?
You may think it isn’t your problem because you aren’t responsible for the client services. True. It isn’t your problem until you become a root cause of a larger problem.
Finally, we can talk about the Ops part. It may be a surprise, but this part is easy. It is easy because we won’t try to reinvent the wheel. The Ops people and the DevOps culture produced solutions to all problems we can encounter. The only thing required from us is learning about those solutions and applying them. Their best practices work. Seriously. We don’t need to invent anything new.
What do we need in the Ops area?
Ironically, machine learning teams are doing so much manual work. It doesn’t mean we should automate the decision about deploying a new model. A human should click the button starting the deployment, but everything else must happen automatically.
Resource Usage Metrics
It would be good to know when the model suddenly starts using twice the usual amount of RAM. It would be great if you were automatically alerted when it happens.
Computation Parity Between Training and Inference
I’m not sure whether I should list this requirement in the Ops part or the Machine Learning part.
Does your production service preprocess input data exactly as the machine learning engineers preprocessed the data during training? Do you use the same libraries? Do you use identical versions of those libraries? Do you know how a dependencies upgrade affects the model?
Besides the areas mentioned earlier, we have many tools which are MLOps specific such as Feature Stores, Model Registries, ML Experiment Tracking software, etc. Sometimes you may see people claiming that such tooling is the only problem of MLOps Engineers. They want to sell you their ML platform ;)
Do you think it is too much? Am I exaggerating? Well, all of those tasks must be done by someone. Why an MLOps engineer? Because we’re the people who should understand all of those areas.
You may also like
- How to add a new dataset to the Feast feature store
- How to deploy MLFlow on Heroku
- Shadow deployment vs. canary release of machine learning models
- Building and deploying ML models using Qwak ML platform
- How to deploy a Transformer-based model with custom preprocessing code to Sagemaker Endpoints using BentoML