Deploying a machine learning model is way different than deployments of any other kind of software.
Because machine learning models learn general rules from specific examples, we are never sure how they will behave in production. Of course, proper training set preparation diminishes the risk. Nevertheless, we can always be surprised by the results.
To mitigate the problems, typically, we rollout machine learning models in three deployment stages.
First, we deploy the model in the shadow mode, which means the model generates predictions, but we don’t use them for anything.
Shadow deployment is a crucial stage because it is the first time we see how the model will perform in production. The only people seeing the results are the developers from the ML team.
Typically, we duplicate the requests and send all production traffic to both the currently deployed model, and the model tested in the shadow mode. We log all requests and the model predictions. We will review them later and look for surprising predictions. To make the result analysis more manageable, I recommend having a correlation id for production and shadow model predictions. Because we duplicate 100% of production traffic, we can also verify the model performance. After all, a perfect model which delivers forecasts too late is useless.
During the analysis, we don’t need to review all logged predictions. It is sufficient to pick a subset of model predictions randomly and compare only those results. Of course, we must pay attention to unusually high/low forecasts. In general, if something looks too good to be true, it is, most likely, wrong. Sorry to disappoint you ;)
If we are satisfied with the model performance during the shadow deployment, we can show the results to the users.
In software engineering, a canary release is often a synonym of A/B testing.
During an A/B test, we show a different product version to a small percentage of users. In machine learning, we do the same.
At this stage, we want to slowly increase the amount of traffic handled by the new model to see if anybody complains about the results. We start with handling 1% of traffic with the new model. It is too little to test anything because the users won’t even notice the difference. We do it anyway to check whether we configured everything correctly. Later, we increase the traffic to 20%, 50%, and finally to 100%.
In our case, a canary release takes a few days. Every day, we ** analyze the predictions generated by the new model, make sure the users are satisfied with the results, and decide whether we should increase the percentage of traffic**.
Finally, we switch off the old model and start working on yet another improvement.
After the canary release stage, we have the model deployed in production and handling 100% traffic.
We usually keep the old, inactive version for at least a few days, so we can easily roll back to the previous model. However, if you analyzed the results correctly, such a need will never arise.
The only thing I have to say about the production deployment stage is a reminder to ** clean up the old model versions. You can keep the files and the code, but don’t keep them running forever.** If variant B handles 100% of traffic and you have already started working on variant C, it is time to retire variant A.
What about deploying the first model?
The process described above works well when we deploy a new, improved version of an existing model.
What if we deploy the first model ever? What if we can’t compare the results with anything because nothing else exists?
The process doesn’t change in the shadow deployment stage. After all, we don’t show the results to the users, so they shouldn’t even notice we are testing anything. During the canary release, we can compare the results with a baseline. It is always good to tell the users you are trying something new that may not work correctly.
For example, in the case of recommendations, we can use TOP10 products as the A variant and the newly created machine learning model as the B variant. If we have a classification model, the A variant may be a constant value. Of course, sometimes using such baselines is unacceptable for business reasons.
You can always introduce ML as a new feature and call it a beta-version or early-access version. It is the first step anyway. You don’t need to overthink it.