Drifting Away: Testing ML Models in Production

May 27, 2021 11:35 AM (PT)

Download Slides

Deploying machine learning models has become a relatively frictionless process. However, properly deploying a model with a robust testing and monitoring framework is a vastly more complex task. There is no one-size-fits-all solution when it comes to productionizing ML models, oftentimes requiring custom implementations utilising multiple libraries and tools. There are however, a set of core statistical tests and metrics one should have in place to detect phenomena such as data and concept drift to prevent models from becoming unknowingly stale and detrimental to the business.


Combining our experiences from working with Databricks customers, we do a deep dive on how to test your ML models in production using open source tools such as MLflow, SciPy and statsmodels. You will come away from this talk armed with knowledge of the key tenets for testing both model and data validity in production, along with a generalizable demo which uses MLflow to assist with the reproducibility of this process.

In this session watch:
Chengyin Eng, Data Science Consultant, Databricks
Niall Turbitt, Senior Data Scientist, Databricks



Chengyin Eng: Hi, everyone. Welcome to our talk, drifting away testing machine learning models in production. By the end of this talk, you will be anchored with a suite of tests and open source package options to test both your model and data validity in production. Let me start by introducing myself. My name is Chengyin Eng. I’m a data scientist at Databricks. I implement data science solutions for clients and also teach machine learning. My background was in computer science, statistics and environmental studies. Prior to Databricks, I was in life insurance industry.

Niall Turbitt: Hi, everyone. My name is Niall Turbitt. I’m also the senior data scientist at Databricks. Similar to Chengyin, I work on the EMEA ML Practice Team based out of Europe. With this, my time is spent working with customers to build and deploy scalable machine learning solutions, as well as deliver classes, which focused on data science and machine learning with Spark. Prior to Databricks, my background has involved building scalable data-driven and machine learning solutions across range of domains, such as supply chain forecasting, logistics optimization, and recommender systems.

Chengyin Eng: Cool. For this talk, we are going to discuss the machine learning life cycle, why we should care about drifts and tests, and not only what we should monitor, but also how we should monitor. Lastly, we will also be showing you a demo of the workflow that we have implemented. Currently we see a lot of companies adopting machine learning in that business, but according to gardener analysis, a surprising 85% of the data science projects actually fail. Another technology research analysis shows that only 4% of companies have successfully deployed their machine learning models. The percentages of failures here are quite jarring. Let’s dig into why. There are a lot of reasons why machine learning projects fail, here we focus just on the production phase. The top reason why machine learning models fail to perform in a long-term production setting is that most data scientists neglect the importance of maintaining machine learning models, post deployment.
Many of us think that our jobs are complete after we ship a model to production. Hence we overlook the importance of retraining and testing our models and data in production to ensure that the model quality is consistent across time. Perhaps it’s not as surprising after all, considering that ML ops, which is a set of best practices of machine learning, operationalization is a relatively nascent view. There’s a lot of confusion around where to apply the statistical tests. And in particular, what tests to use. This is the area, our top hopes to shed light on. We hope to answer your first question here. What are the statistical tests that we should use to monitor the models in production? There are some proprietary software that partially addressed this question, but we want to pull back the layers on proprietary softwares and examine from a statistical perspective.
And the second question we would like to answer is what tools can I use to coordinate the monitoring of data and models, and we would like to demonstrate how you can set this up using only open source tools. For this talk we’ll be focusing on testing tabular data in a batch scenario, but a statistical tests are relevant across the streaming and also real time scenarios as well. You can also borrow the same overarching framework to adapt to other types of data, such as images and text. In this talk we will not be covering any strategy of model deployment. We will not be covering any unit tests or integration tests for your code. And because testing and monitoring is so unique to each use case and domain. This talk is by no means a prescriptive model of when to update your model. Instead we’ll be covering how to detect those feature changes and motor stillness. Before we dive into the nidicridy about tests, Niall, can you talk a little bit about the MLS cycle and why we should monitor it?

Niall Turbitt: Absolutely. So before we can address the topic of how to implement monitoring, I think we need to first establish where does monitoring come into play in the overall life cycle of the machine learning system. If we start from the foundations where we are first presented with a business problem, we might have a business stakeholder who comes to us with a problem that they believe can be solved with machine learning. We’re going to work with the team to identify what success looks like for this project. And in particular, establish a measurable business metric to evaluate the success of our model. With this scoping done then we want to start with working with our data team to establish well, if we do have the necessary data in place to solve such a problem. After some work, we have necessarily data pipelines in place to start doing some exploratory data analysis, or we’re going to start to create some features and then think about, well, do we have some predictive features that enable us to predict the target?
We’re then going to get into the fun part of modeling our data and enter this iterative feedback loop of fitting a model to your data, evaluating our model performance. And as we start to look at results, we want to iteratively adjust our data collection procedure, we might want to update our feature generation process and adjust their model hyper-parameters to reach a final model that we’re finally satisfied with. At this stage then we probably feel like we’re nearly there. We have a model that’s performing well, it’s performing well against our held eye test dataset. And it’s just a matter of pressing a deploy button. However, in some respects, this is really only the beginning of a model’s life cycle. The following components of model deployment and model monitoring are often what people refer to as MLS as Chengyin mentioned. But in the fact, this is asking, well, how do you actually operationalize an ML model? How do we decide on when and how to deploy a model is going to be highly dependent on a multitude of factors.
From how quickly does new data arrive to how long does our model take to train? And there’s really no silver bullet for all machine learning deployments. It’s ultimately going to be very demeaned and problem specific. When developing a model to solve a given business problem, how a model is maintained once that model is in production is often drastically overlooked. If there’s one thing that I would urge you to take away from this talk in particular, it would be to take considerable time when developing an MLS solution to think about higher productionized version of that solution will be measured and monitored. Why monitor? Model deployment is really not the end. If we think about the pyre of a predicted model is a result of a model’s ability to identify patterns between some set of input features on a prediction target, our model therefore will only perform well on new data as long as that new data is somewhat similar enough to the data that it was initially trained on.
Even if we train a model on a suitable data set more often than not, the distribution of incoming data will change over time. These changes can arise from an infinite number of sources. But just to name a few, if we think about errors, upstream errors are inevitable. A large proportion of changes in data will be a result of some change in an error that comes from an upstream data generation process. Externally as well we can think about market changes, our human behavior change, or really any external factors that might influence the underlying data collected. Something like changing your user preference over time would have a huge impact on our model. This plethora of sources of change can really impact how our model then performs in an ultimately gear to model performance degradation. Models will degrade over time, rather than it being a question of if a model will degrade. It’s more question of when a model will degrade and if we can identify this before or when it happens, and this is the chief motivation for continuously monitoring a model once it’s deployed to production.
I started to allude to it, but one of the core reasons for model degradation is a phenomenon called drift. And in order for us to understand what are the statistical tests we should have in place for a production model, we first must understand the various types of drifts that can occur. And the most prevalent forms of drift are the following, feature drift where you might have often heard this referred to as data drift or covariate shift, which often occurs when the underlying input data has changed in some way. The distribution of features that we trained our model on has significantly changed. The label drift then is going to be where the distribution of our label has significantly changed related to some outside influence. This can be typically caused by a shift in the actual underlying features itself. Prediction drift, then highly related to label drift. But instead of being related to an outside influence, it’s actually directly related to a feature that’s part of the model that then is a shift in the predictions coming from our model.
Concept drift then will arise from an effect that happens as a result of some outside influence and changes the kind of underlying pattern that our model has learned and it’s no longer than valid. So the underlying nature between our features and our label that we’re trying to predict has evolved over time. Just to then double-click on some of this. Looking at feature label and production drift, these can all rear their head in very similar circumstances and often occur in a confining manner. So one can cause the other. Some literature often buckets these different forms of drift under the same umbrella of data drift. And Chengyin will walk through how we can use the same set of tests to identify these different forms of drift, but just to illustrate what something like feature drift looks like let’s first look at the following two feature distributions. Feature drift in the context of a categorical feature may be that the observed distribution of instances per class differs from what we expect.
If we then have a look at the context of a numeric feature, feature drift can become apparent if we have a feature like age where we find that the mean and the variance of this input feature changes over time. On the other hand concept drift can manifest itself in a number of different manners and ultimately each of these different forms requires a different method to detect it, which makes it somewhat tricky. Each of these if we have a look into them, so sudden concept drift will be where the data drifts, sorry, where the drift abruptly happens due to some unforeseen circumstances. A black Swan event like the COVID pandemic last year will be a prime example of this. Gradual or incremental concept drift would be where our data gradually evolves over time. And lastly, recurring concept drift will happen somewhat periodically, potentially at a certain time of year, something like a black Friday and retail being a canonical example of a recurrence concept drift.
Now that we’re armed with this knowledge of the various forms that drift can take, probably the most important question that we want to ask ourselves is what action do we want to take as a result of identifying drift? Feature drift, where the underlying input data has changed may warrant investigation into the actual feature generation process. We may want to retrain our model to capture this change in the underlying data, liberal drift, where we see significant change in the growing truth of our label, maybe a flag to investigate the label generation process and then additionally investigate if there’s underlying change in the input features as well. Prediction drift where the distribution of predictions from our model has significantly changed over time might warrant, then some investigation into both the model training and data generation processes, and then pending that assessing what the impact would be if a shift in predictions would have on the business itself.
And then lastly, concept drift where there are some external factors that have been impacted the relationship between our input features on our level. We might want to have a look at how we can incorporate maybe some additional information in the form of new and doing that through some additional feature engineering, or we may have to consider a completely alternative approach to the problem in extreme cases I might, or additionally then could get away with doing some retraining or tuning off her model. Once we know the different types of drift that we want to identify and the actions that we want to take, if we spot them, what Chengyin will now do is take us through what, what are the statistical tests that we can utilize to identify these.

Chengyin Eng: Now let’s look at the actual test and the statistical checks that we should have in place. For the model features and target, we should monitor their basic summary statistics and also their distributions. We should also monitor the business metric as well, just in case the business metric changes will affect the relevancy and impact of the model. And last but not least, we should also monitor the model performance and only replace the existing model in production. If the new candidate model performs at least equally good or better. Now let’s move on to the specific monitoring tests on data. We should first identify which features are numerical or categorical. For numeric features, we can compute the median or mean, minimum maximum, percentage of missing values. The statistical tests that we can apply to check the means are the two sample Kolmogorov-Smirnov, which is also known as KS test with Bonferroni correction. The Mann-Whitney test. And the test for variance here is Levene test. Before we cover any of those tests for the category of features, we’ll go through the basic concepts of this test that we’ll be incorporating in our demo.
First up, let’s examine the KS test with Bonferroni correction. KS test is useful when we want to compare two continuous distributions. The null hypothesis here is that distribution X and Y come from the same population. If distributions X and Y are different enough at a P value of a certain alpha significance level, usually it is 0.05, then we reject the null hypothesis and conclude that distributions X and Y do not come from the same population. Sometimes we might encounter type one error, which is when we reject the null hypothesis when it is actually true. As we repeat this KS multiple times for N number of feature comparisons. then the rate of false positives actually increases. This is why we need Bonferroni correction to adjust that alpha value by the total number of feature comparisons to reduce the family wise error rate. Note that you will see this framework of hypothesis testing repeating for other statistical tests as well.
Levene tests specifically compares the variances between two continuous distributions. It has the hypothesis, null hypothesis that both distributions come from populations with equal variances. If the Levene statistic has a P value lower than a preset alpha level of say 0.05, then we reject the null hypothesis and conclude that both populations or both distributions rather have different variances. Now let’s pivot to the test for categorical features. Similar to numerical features we can also compute the percentage of missing values and other summary statistics that we can include our mode, which is the most frequently occurring value and the number of unique levels. In terms of statistical tests, we can leverage the one-way Chi-square tests.
The one-way Chi-square test compares the expected column distribution and observed column distribution of the categorical variable. The null hypothesis is that the observed distribution, which is the incoming data is equal to the expected distribution, which is your production data. Similar to before, if the Chi-square statistic has a P value lowered at alpha 0.05, then we can reject the null hypothesis. So far we have been talking a lot about the test that we should use for monitoring the data. In the next slide, we’ll be discussing the monitoring tests that will be useful for the models. Regarding the models, there might be a lot of different aspects that one might care about. First could be the relationship of or between the target and the features. We may want to investigate the correlation changes between the inputs and also the targets.
For a numerical target, we can use a Pearson coefficient to calculate the coefficient, to calculate the correlation. For categorical targets, we can use frequency tables, or also known as contingency tables. The second piece is model performance, which is what may be the most obvious to many of you. For regression models we can check the mean square errors, error distribution plots, R-square et cetera. For a classification models, we can look at the ROC, the confusion matrix, F1-score, and so on. If we want to assess a model in more granular terms, we can also investigate the model performance on just particular slices of data, maybe by month by product type and you name it. Lastly, we should also be cognizant of the time taken to train a model as well. Even new model is taking triple the time to train compared to before then this could be a telltale sign that something is fishy.
Let’s look at the tools that we can use to measure and wanting to production. Measure and monitor of models in production. Currently there is no single open source solution that provides a robust mean for us to do this. We have decided to pull some from several open source libraries, infamous some of the tests on our own and incorporate this test in our demo workflow. In particular, we used MLflow for model tracking and Delta for data tracking. For statistical tests, we used the two sample KS test, Levene tests and Chi-square tests from SciPy and statsmodels respectively. For visualizations we use seaborne library. For those of you who are new to MLflow, we will actually briefly introduce what MLflow is. It is an open source tool that helps with MLOps.
There are four components altogether tracking, project, models and modern registry. And all of those components can help with the reproducibility of our ML project. In the demo, we’ll be using the tracking models and model registry components. They help us to keep track our model parameters, metrics performance, and artifacts. They can help us save models along with the dependency requirements and help us manage the model life cycle. All of this together you will see in our demo that allows us to reproduce the results and retrieve historical rates.

Niall Turbitt: Great. Thanks Chengyin. What we have here are going to be a series of notebooks in which we’re going to simulate a scenario where we want to deploy and maintain a model in production. And this is going to be required to make predictions on a monthly basis. Because we’ve been in lockdown for a while, we wanted to see where we could take our next vacation. We’ve chosen to use a data set containing Airbnb listings in Hawaii, where our aim is to predict the price of a new listing, given attributes such as number of bedrooms and the property, the neighborhood that a property is in. There is a few things to set up here. What we have included is a number of notebooks, including this one that we’ll be walking through under this following bitly link. So if you do want to run this yourself and take a look through the code in further detail, go to bit.ly/dais_2021_drifting_Hawaii.
And you’ll find this notebook along with the two associated setup notebooks, which will include utility functions to do our monitoring, and then the actual training setup itself. The training setup will involve some training functions that we use to actually train the model itself. In addition to actually creating the various data sets that we use to replicate different months of data coming through. There’s a few requirements that we’ve also outlined here. And in particular, we’re testing this on Databricks runtime it out to ML. Everything that you see here, we’ll be using open source libraries and packages. We’re using MLflow to actually track the various parameters, delta to actually version our data itself. And then the actual underlying statistical test that we’ll be doing, we’ll be using SciPy in particular. What we’re going to do here is simulate a batch inference scenario where we’re going to train, we’re going to deploy and then maintain an actual model in production to predict those highest listing prices in Hawaii on a monthly basis.
That data arrives monthly and what our work flow is going to be once we have an initial model deployed is loading a new month of incoming data, apply any incoming data checks, and this will be doing the kind of tests that Chengyin had mentioned around doing error and drift evaluation. We’re going to then identify and address any errors in that data itself. We’re then going to train a new model and then apply any model validation checks that we will necessarily want to apply before moving a model to production. If those checks pass, then we’ll deploy the new candidate model to production. If those checks fail, then we won’t deploy the candidate model. And like we said, throughout, we’ll be using MLflow Delta in addition to SciPy to do the actual testing and also versioning of our data and models itself. And although we’re specifically doing this in a batch setting for a supervised ML problem itself, these same tests in terms of the statistical tests that we’re using are applicable to streaming and real time settings.
A couple of the other notebooks contained, like I said, adjusting the setup and actual instantiating of the various methods that we’ll be using throughout. I’ll open these up in a new tab as well, just so that we can have a look at those as we use throughout. I am going to be creating some widgets. These are Databricks utilities that we can use to parameterize a notebook in particular. And we’ll be using these variables further down, which I’ll explain in more detail. Let’s simulate the first month of training. What’s going to happen is that we have an initial month of data come in and we want to train a model and then deploy that model to production. Because we have no historic data to compare against our existing models in production to compare a new candidate model against. We won’t go through your kind of robust set of checks and balances before pushing that model to production. But in a real world setting, you would obviously want to ensure that that model is robustly tested before moving it to production.
Let’s create this incoming month data and start model training. In particular throughout, what we’re going to be doing is appending to this gold delta table. This is going to be the Delta table that we’re going to use for training itself, and that we’re happy has cleaned processed data itself. In particularly we have this gold delta path, where we’re moving, not just in case it’s already there to create a clean new version of that Delta table. And we’re going to then load in this month zero delta path. Throughout we’ll be using some of these variables, which have been created in the training set up. Just to show you what that does look like. You’ll see a various number of imports, some notebook conflicts. Everything that is executed here, which has been done through these first couple of commands is basically instantiating those variables in this notebook setting.
We’ve loaded that month of data, we’ve saved it out to this gold delta path. And what we’re going to do is actually triggering just our first month of data and I’ll break down what exactly is going on here. And the room meet of actual tracking channel flow and training of our model itself comes from this train sklearn RF model method. And in particular, what this is going to take is a run name, which is going to be the MLflow run. It’s going to take a gold delta path, which is going to be the path to this Delta table. It’s going to take model parameters, which is going to be a dictionary of parameters that we wanted to necessarily fit into the random forest sklearn regressor model that we’ll be using. And this misc parameters is going to be any arbitrary parameters that we ultimately want to track out and use for pre-processing when we’re using this train sklearn method.
Let’s have a look at what’s actually contained in that method to uncover it, in particular, what it will be calling is this create sklearn RF pipeline, which is just going to build that sklearn pipeline, which contains numeric stages, categorical stages, combines them together, adds on a random forest progressive stage with our model parameters that we’re going to be fitting into our function. And in particular, this will be called within our train sklearn RF model. Like I said, the runium for the MLflow run, the delta path for that gold delta table that we’ll be using, model parameters and any additional arbitrary parameters that we ultimately want to use to track out and then additionally, to feed into our pre-processing stage. I’m not going to go through this line by line, but I should just kind of capture what are we generally doing?
And throughout the monitoring notebook in particular, what I’ll be guiding through is not line by line what the code is doing, but conceptually, where you would apply the tests, where you do training and what tests you would do at each stage. In particular, we’re going to be using MLflow to track out and monitor. Track out and log our model parameters. Any artifacts that we may necessarily wants to use after we have initially trained the model itself. We’re also going to enable MLflow auto logging, which is basically going to strip out any parameters that we necessarily want to track to MLflow. Additionally, it’s going to track out the actual model artifacts themselves, and then we’re going to basically track out some additional parameters that we want to use down the line.
Firstly, what we’re going to do is load in the actual Delta table itself and importantly here what we’re going to log as the delta path. What is the path to that gold delta table that we’re using? What is the version of that Delta table that we’re using? This will be important as we go through subsequent months of data, because ultimately what we want to say is given this version of data, what did that data look like? What was the distributions that we’re looking at here as we come through subsequent months, can we compare those two things? Additionally, what this is then doing is logging out various parameters that we may necessarily want to see online and have a record off. The number of instances, what was the month? What was the number of training instances used? What was the number of test instances used?
But importantly here, what is going to happen is we’re tracking out all this additional information. We’re creating our random forest pipeline. We’re fitting the model itself to our training data. Here we’re also making a record off well, what was the actual schema that was used to fit that model? What does that model expect at prediction time? And then additionally, evaluating the model itself. Can we record metrics, which will be important again, down the line to be able to say, okay, this model performed this way in this certain month. And as we go through those months, being able to see, well, what was the performance through time?
Let’s execute this cell, which will, if I firstly, just set up this training. Let’s trigger the actual run itself. This, like I said, we’ll start to create that MLflow run and trigger the actual model training itself. Once that model is executed, what we want to do is utilize MLflow model registry. And in particular, what this will do is enables us to track the lineage of our models. You’ll see, whilst this model is training what’s going to happen is that it appears as a run in MLflow itself. Everything we’re doing here is using open source tools. Albeit MLflow is stitched in quite nicely to the Databricks workspace, such that whenever we execute an MLflow run, we see it now appear as run in our sidebar. If I click on this, I can go out to the actual MLflow experiment.
And in particular, we see that we now have this first MLflow run. Clicking into this run where we see the run name is month zero, what we can see is I have my various parameters that I’ve logged out. And a lot of these are automatically stripped out when we were saying Mlflow.autolog. But in particular, some of those custom parameters that we have logged out is the delta path and the delta version. And this will become important as we go through time to say, this run was created on this month of data. So this in particular month, what was the version of data that it used? Additionally, we’ve got the various metrics for our tests on our training. And importantly, we also have the actual model artifact itself. In particular, this will be the pipeline that we’ve created with the fitted random forest model.
And we can then load this in a later stage to see, well, how do these historic predictions do against future versions of this model? Other things that we have also logged out was this CSV, which is capturing some summary statistics about the incoming data. We have done this in terms of just creating a ponderous data frame, setting it out as the CSV where you see, we have the likes of the mean, the marks, the various summary statistics that we might ultimately want to use down the line. Processing parameters, Jason is just a dictionary that we created logged as a Jason, but just contained some arbitrary miscellaneous parameters that we might want to use down the line, such as what was the month, the target column over the various categorical columns and numeric columns used at that point in time.
Coming back to our actual notebook itself, we have a model that’s been trained, what we’re going to use is this model registry. And like I was saying, this tracks the lineage of our models. We can say this model is in staging. This is in production. And being able to see, well, where does a model actually reside and help manage that workflow of models through their life cycle. In particular, what we’re going to do is create this version one in the model registry. If I actually navigate to the model registry itself, what I will see is I have a version one that’s just recently been created and what we can do is migrate this to staging or production, and then start working with our model once it’s in those various stages. In particular for this first month, what we’re going to be doing is just transitioning to production straight away.
Again, what you would want to do is kind of robustly check that this model is performing as expected, perform against some of those data slices that Chengyin had mentioned, but for the purposes of our demo, we’re transitioning straight to staging such that in later months, we’ll see that we can compare against our production model. Let’s replicate going through to month one, when new data is arriving. And in particular, what we are trying to simulate here is some real world scenarios where you might have upstream data issues, or you might actually then have feature drift as well. And we’ve actually combined two things here. Notably we have re recreated some upstream data cleansing process changes where our neighborhood cleanse is missing some entries. In particular, some neighborhoods which should be present are now not present. And additionally, what we’re doing is recreating maybe some upstream data generation procedure has introduced a scaling issue with one of our features.
In particular, we have review scores reading. What is the overall rating of the reviews for a given listing which is previously bounded between zero and 100. What we’ve done in our notebook setup is actually rescale to be between zero and five to recreate a new star rating system and what we aim to do with some checks such as those distribution checks and error checks that we were mentioning previously is can we detect this before we even start model training? And the workflow that we are going to doing here is firstly, ingesting data, applying some feature checks to that new incoming month of data. Once we have resolved any issues, we’re then going to append that to our gold delta table, we’re then going to do model training. Once we have the model itself, we can then do actual model checks to see, well, how does that new model compare against the older historic model?
And then if that given model passes all checks and balances, then we migrate it to production itself. Before we start with that, I do want just to go back to our models, we have this current model. If I refresh this, you’ll see that we have this version one, which is not in production. What are we simulating? In this first month for simulating feature drift. Because our feature historically was bounded between zero and 100 is now drifted to be between zero and five and then upstream data errors. Some cleansing issue had gone wrong with high neighborhoods were being presented. In terms of the various future checks that we’re going to be using, we’re going to be breaking this down into your various subset, so checks that we can apply to all features such as missing this checks, and we’re going to be applying well, what are null checks triage. What are we expecting in terms of null values, some numeric feature checks, and this is going to be having a look just to some naive kind of summary statistics. How do these differ from the previous months of data that we’ve seen.
Some distribution checks, coming back to those the Kolmogorov–Smirnov test that Chengyin had walked the street, and then also the Levene test that we’d also seen for continuous variables on the categorical side of things. We’re going to want to do a Chi-square test to see, well, does the expected count for each level between our historic data on our new data? How does that compare? And then also checking what was the historic mode? And now what is the new current mode? What we’re going to be doing is loading in this month one error delta path. We’ve actually created this delta table in the setup, which recreates some of these errors. We’re going to load that in and we’re going to create a summary statistics. And this is actually interesting, the same summary statistics function that’s used within the training function itself, that we’re then logging out. And what we’re going to do is we’re going to pick up the run from the current production model. What does that mean? Again, if I go back to my model registry where I see this first version, if I click into this first version, all that the MLflow registry is it’s really tracking the lineage of models.
Within this, this is a pointer back to the original run that we first logged for month zero. Again, we see what the delta version was, what the delta path was, month zero. And ultimately what we can do here is say, get the underlying run from the production model. We can then say, well, I can load the delta table given that I know the delta path and the delta version. And then I’m going to compute some summary statistics about that historic data that we use to train that previous model on. The thing that we’re comparing here is I have a new month of data come in, but I want to compare that against the historic data that we previously trained on. Let’s start doing some checks itself. These are just going to be very basic checks. Seeing that, what is the proportion of nulls in this new incoming month of data, and we’ve set this null proportion threshold. In particular, this check null proportion is a method that we have defined in our training set up notebook. Again, I’m not going to go through, well, what is really going on in the underpinnings here, but we’re just going to loop through each of the different features and say, does the amount of nulls exceed a certain threshold?
This threshold we’ve just arbitrarily set 2.5 in a real world setting, you would want to kind of do trial and error as to what would be a significant amount of nulls to be able to flag just through a very kind of arbitrary threshold. And what we’re seeing is that neighborhood plans is displaying nulls, a kind of a proportion of nulls that exceeds 0.6. And now that would really mean that we would want to look a bit further into what is this neighborhood cleanse? Thinking back to what we’re trying to replicate here, some upstream data issues, we’re seeing that neighborhood cleansed, we had inserted missing values to these. We’ve really been able to catch that straight away. Secondly, what we’re going to do is apply some statistic checks, and this is going to say, what is my historic data? What is my new incoming data? And I’m going to basically say for my numeric columns, according to some statistical threshold limit, that is to say for all of the various statistics in this list, what I want to check is, does this new incoming data significantly increase or decrease by a certain threshold limit compared to what we were previously seeing?
Just to break this down, we’ve got summary statistics from this new incoming month. We’ve got our current production model that the summary statistics that, that model was trained on. We’ve got some numeric columns that we want to necessarily live through. And in particular, this target column is one that we want to include in our analysis. And we’re saying, if anything exceeds this 50% bound of what had previously seen, flagged out to us. In particular, this is a kind of naive way of doing this. You could be kind of more smart in how you implement this, but this is really just a basic check to see what was the previous summary statistics? What is the new summary statistics? And is there a significant increase just in terms of a very arbitrary threshold here? What’s really standing out here? I know a couple of things standing out just terms of the median max for bedrooms has increased, the bedrooms and maximum nights already the thing that would flag an error here, or really warrant further investigation would be for review score rating and price. And basically we’re saying the price, the median, these have all for review score rating. These have all decreased substantially, from 95 to 3.48.
On the price front as well. What we’re seeing is that there is a significant change as well, so that there’s increase across the board. What we would want to look at is, well, what is the distributions for some of these features that are displaying these significant changes? And we’re going to create some box plots, which will just create a very automatic way of detecting visually. Is there a significant difference? We see previously for review score rating, which was for our current production data that we’d used between zero and 100. And now for this new incoming data, this is between zero and five. This is a very kind of quick test to say something is not correct here. We can further validate this by running our Levene test on our KS tests. These are the tests that Chengyin walked us through. In particular Levene test to say, is there a significant difference in our variances and the KS test with the Bonferroni correction to say, does the Kolmogorov–Smirnov tests detect a significant change in those two distributions? Just to run through this, what we’re going to do is use the current production model, what is the data frame that it used versus the current incoming month of data? What are the numeric columns to basically look through and say, do you detect any difference?
And this P threshold is being set at the top. And what that saying is what is the significance level that I would like to set that as. Similar with the Kolmogorov–Smirnov test, what we’re saying is compared those to the historic data versus the new incoming data, what are the numeric columns that I want to live through? And again, what is that threshold that I want to check out? And one thing that comes up here is for the KS test, with a Bonferroni corrected alpha level, we see that the review score rating has a significant value. And what that saying is that the new review scores rating distribution has significantly changed from the previous. This would very much warrant us investigating well, what is the new distribution? How is that data generated upstream? This stage would be cause for concern to see, well, what is the upstream data? Could there be an upstream data issue? On the categorical feature side of things, we’re also going to implement a Chi-square test to see, well, do the levels of our categorical columns statistically, are they statistically different? In terms of expected count, we see that host is super host is significant, but only just significance.
Again, would warrant some investigation, but it’s not actually as a result of anything that we have done or that we’ve actually synthesized. The option here would really be to initially we have caught these issues. We will have to work with our data team to say, okay, is there any upstream data processing the significantly changed? In this case, what we’re going to replicate is, okay, we have resolved these issues for review scores rating in the neighborhood cleansed. We’re happy with the changes made, and we now have this month one fixed delta path, which will load. And we’re going to then append that to our gold delta path. What we now have is this gold delta table with this new amount of data, and we can now undertake a second month of training. What’s going to happen is we have our month one, we’re going to use the exact same model parameters again, the same miscellaneous parameters.
And what you’ll note is we have this second run that has been added to our experiment. What I’ll first do, however, is register this model and then transition it to staging. And then I can show you what that looks like. If we, again, go through our experiments, this is just bringing me to a separate tab in the MLflow UI. And in particular, what we see is we have the second month, if I navigate into that, we have the delta path, the delta version, what was previously zero is now one, because we’ve used a new version of our table for month one, all the parameters is asked we’ve seen before. Metrics, what we’ll see is the test, the test R-squared is actually slightly different and we still have those same artifacts that we previously seen before. In particular, what we’ll now see in the actual model registry, if we navigate to that, is that we have a version two of this, which is nine staging.
And this is ultimately going to be the thing that we want to compare. I have a modeling production from our previous month. I have a new model currently in staging. Can I compare those two models before I migrate the new candidate model to production? I should run it through a series of tests in terms of validating, well, what are the predictions? How does it perform against the current production model? There are a multitude of ways to actually do this. We’re going to apply a very naive test here in saying that I have my current staging run, which I can get again, because we can track the lineage directly back to the underlying MLflow run. I can get the underlying MLflow run entity and say, pick up the model that’s in staging. What is the run and compare that run against my current production.
What this is taking is two different MLflow runs. It’s going to apply a minimum model R-squared threshold. And that’s to say that compare, according to this metric, across these two different models, apply this metric to check if that model exceeds by a certain threshold the what we’re going to say is you can proceed the transitioning from staging to production. What we necessarily want to catch here is if there is significant degradation in that model performance. If there was something like prediction drift that we wanted to catch where this predictions had significantly shifted since what we’ve previously seen. And ultimately what we want to catch is before we move a model through to production, we wanted to say, well, how is our actual metric performing over time? Necessarily you would want to track this over time. Given that we are seeing that our staging run, although it’s not a great, our spread score, we’re really just trying to show the methodology as to how you would kind of orchestrate models themselves.
We see that it is marginally better according to our threshold, which we set initially according to the widgets at the very first and what we’re going to do is transition our model to production, given that it’s now performing better. If I go back to my models in the registry, what we’ll now see is that version two will now be migrated to production. And the previous production model will be archived. If I navigate into that, I can go back to my version two. And again, what that is pretty shocking back to is my month one, and ultimately feeding back to, well, this underlying model itself that I can then load back in. Let’s move through to our last scenario where months two some new data arrives. And in this case, what we’re replicating is where our price has significantly changed, but our features have not.
And what we’re trying to really synthesize here is some kind of seasonality. We’re saying some new month of data contains listing entries, which were recorded during peak vacation season. As a result, the price for every listing has been increased by some arbitrary amount. What do we simulate in here? We’re trying to simulate some label drift in the sense that previously our label had a certain distribution, but now it’s shifted over a bit and also concept drift in the sense that the underlying relationship between features and label has changed due to that seasonality that we’re trying to introduce. Coming back to what we were just discussing around different flavors of concept drift that recurrent concept drift that can happen in some kind of cadence.
We’re going to be applying the same feature checks as before. That incoming data, before we even start model training, we want to run it through a series of checks to try and catch before we instigate training. Is there any differences with the incoming new data? Again, what we’re going to do is load in this second month of data, we’re going to create some summary statistics about that. We’re going to get the underlying MLflow run from the current production model. We’re going to load in the delta table that was used to train the previous month of data. From that month one, and we’re then going to load in the summary statistics that we have from previous to four.
In particular, we’re going to kick off with our missing this checks again, what is the proportion of nulls? We see that nothing’s really well, nothing is exceeding this null-proportional threshold. Again, this is something that you could arbitrarily choose on the numeric side of things. What we want to check is for all of our numeric features, is there any glaringly obvious changes in the mean, the median and the standard deviation, the min and the max and notably what you’ll see is the mean and both the median price have significantly changed in terms of their fit. They’re greater than 50% higher than they were in the previous month. All across the board, this would be kind of ringing alarm bells to say, okay, I should look into this price because that distribution has really changed since what I’d previously trained the previous month model off.
Looking at a box thought of that, it will quickly become apparent that for all of the various features that did exceed that threshold, what we’re seeing is that where price was previously between zero and kind of around 25,000, it’s now vastly increased from between zero to 75. That distribution even just eyeballing it has changed. We can apply some statistical rigor to that by buying some variance checks. And ultimately the KS test you would hope would detect that distribution shift. The KS test is basically saying these two distributions, is there a significant change between the two? And we’re saying that price, there is indeed statistically significant difference between those two distributions. At this stage, it would be kind of cause for concern to say, okay, there is some difference in this data. Ultimately, what you could choose to do is actually go forth and say, okay, if I retrain the model, can I incorporate this new shifts in the data and learn something from it?
Additionally, what you could also do is try to embark on some new feature engineering where you incorporate in some seasonality features to say, okay, this data was recorded during this month, et cetera, to integrate some seasonality information to fit it into your model itself. On the categorical side of things, we see that there are some categorical difference that there are category variables are statistically different expected kinds. However on looking at this, it wasn’t anything to concern, but again, it goes to show you that sometimes you raise these errors and it might not necessarily be significant change, but again, you should address these things and treat them accordingly. What the action would be here is that okay we’ve observed this label shift and what we want to do is actually just retrain the model and initially see how does our performance do?
We’re going to go forth and say, add that new month of data to our gold delta table, we’re then going to train a new model, albeit on month the field kind of months of data. We’ve got month zero, month one, month two, now coming in, and now we see that we have a new model in our experiment. Again, what we’ll do is firstly, register that model we’re then going to transition that model to staging just again to illustrate that I’ll come back to the actual registered models itself will be nicely that we have a version three in staging and applying those same model checks that we previously saw. We’re going to get the underlying MLflow run from the current staging model. We’re again going to apply the same statistic checks to say, well, how does the performance of the new candidate model?
How does that R-squared test R-squared compare against the previous R-squared value? We see that there is a significant under-performance here, albeit we’re comparing them against different. They’ve been applied against different test sets. We see that test that at the time is significantly different just in terms of a percentage reduction. Again, this would be very much cause to look in and investigate, well, why is a new model underperforming? And additionally, could I then address this? Can I retrain the model in terms of with different hyper-parameters and doing some hybrid primary training? Or would I then incorporate some additional new features that would capture that seasonality? In this case, we’re saying the current staging model is underperforming, therefore investigate further. This really shows you the kind of end to end process of how you can use open source tools, such as MLflow and delta to it’s about reproducibility and really capture three time on being able to go back in time and say, okay, this is my model.
At this point in time, this was the version of data that I used, but then also on the testing side of things and the monitoring side of things, how you can use packages such as SciPy to say, okay, apply these tests to these datasets through time and be able to say, okay, there is some change here and there are some errors apparent to my data. Hopefully that shows you kind of simple walkthrough of how do you incorporate these things and I’ll pass it back through to Chengyin to wrap us up with a conclusion. Thanks.

Chengyin Eng: Thanks Niall for the demo. Hopefully the concepts we have covered in this slide that became clearer through the demo. To recap, just remember that a model deployment is not the end. Continuously monitoring and measuring your models in production is the key to ensure that your machine learning models are relevant and valuable to the business. Secondly, there is no one size fits all solution to monitoring and testing. Hopefully you will be able to take our example framework to apply to your own domain and use case. And you should also add other relevant tests to capture the metrics that you’re interested in as well.
Lastly, regardless of the use case you are building always make sure that you are able to track and reproduce your model results. It is critical to keep a record of historic performance as a baseline, and in the event of a new deployment failure, your system should be robust enough to enable rollbacks. If you’re interested in learning more about this topic, we have included several literature and also package options for your reference after the talk. Hopefully you have learned something helpful from our talk today. And don’t forget to give us feedback and tell us what’s your biggest takeaway from our talk today. Hope to see you again.

Chengyin Eng

Chengyin Eng is a data science consultant at Databricks, where she implements data science solutions and delivers machine learning training to cross-functional clients. She received her M.S. in Comput...
Read more

Niall Turbitt

Niall Turbitt is a Senior Data Scientist on the Machine Learning Practice team at Databricks. Working with Databricks customers, he builds and deploys machine learning solutions, as well as deliver...
Read more