For any organization whose core product or business depends on ML models (think Slack search, Twitter feed ranking, or Tesla Autopilot), ensuring that production ML models are performing with high efficacy is crucial. In fact, according to the McKinsey report on model risk, defective models have led to revenue losses of hundreds of millions of dollars in the financial sector alone. However, in spite of the significant harms of defective models, tools to detect and remedy model performance issues for production ML models are missing.
Based on our experience building ML debugging and robustness tools at MIT CSAIL and managing large-scale model inference services at Twitter, Nvidia, and now at Verta, we developed a generalized model monitoring framework that can monitor a wide variety of ML models, work unchanged in batch and real-time inference scenarios, and scale to millions of inference requests. In this talk, we focus on how this framework applies to monitoring ML inference workflows built on top of Apache Spark and Databricks. We describe how we can supplement the massively scalable data processing capabilities of these platforms with statistical processors to support the monitoring and debugging of ML models.
Learn how ML Monitoring is fundamentally different from application performance monitoring or data monitoring. Understand what model monitoring must achieve for batch and real-time model serving use cases. Then dig in with us as we focus on the batch prediction use case for model scoring and demonstrate how we can leverage the core Apache Spark engine to easily monitor model performance and identify errors in serving pipelines.
Manasi Vartak: Hi everyone. Good afternoon. My name is Manasi Vartak and I am the founder and CEO of Verta. We are a machine learning operations company, and today I’m super excited to be talking about a key part of machine learning operations or MLOps, which is model monitoring. As we put more and more models into production, we want to make sure that they’re working as expected, and that’s what I’ll be talking about today. In particular, I’m excited to highlight a upcoming launch of a new framework we’re releasing from Verta for model monitoring. And today I’ll be talking about how that can help you monitor your Spark ML workflows, other types of ML workflows that you might have. So before I go there, let me start off with a little bit of background about myself. I have a super technical background. Did my PhD in computer science at MIT.
I was actually at Spark summit in 2017 where I released ModelDB, which was a result of my PhD, the first sort of model management experiment management software. And that has now inspired a lot of other tools of that sort. Today I’m going to be talking about yet another tool and hopefully that one will be even more impactful than ModelDB was. So that’s a bit about me. At Verta, as I mentioned, we work on machine learning operations, which means we take models from training all the way to production, we run them there, make sure they don’t fall off the rails. And we do that at scale. We’re also super privileged to be serving models for some of the largest tech companies in the valley and companies that are in insurance and finance. And you’ll see some examples of the work that we’ve done as we go along.
So what will we cover today? I will start with why model monitoring? Why is now the time to be thinking about model monitoring? What is model monitoring? There was a lot of different terms used around monitoring. So I’ll define some of those. I’ll then describe the framework that we have built. Then of course, the demo, which is going to be the meat of this talk. We’ll monitor some cool pipelines in Spark ML, and look at the kinds of analysis that you can do there once you have monitoring set up. And then we’ll wrap up with some key takeaways. So with that, I’m going to leave that there. Do Share your feedback. For me, it’s even more fun, if you can put your questions and answers into the chat area so that we can have a conversation. And then of course, feel free to follow up with me after. All right, with that out of the way, let’s dig in.
Why do we care about model monitoring? ML models are used across all functions in a business. And I’m going to highlight a few that come from the financial sector because that’s where kind of finance rubber hits the road. AI ML is being used to make decisions automatically about where investments should be made. So these are your robo advisory firms. They decide where you should put money for your 401(k) and so on, hard earned dollars. ML is being used to detect fraud, to detect anti-money laundering, to make better lending decisions. And you’ll see an example of how when market dynamics change, the lending models also need to change. And then there was a whole new area around robotic process automation that tries to remove sort of humans in the loop and optimize the processes. So here, this is just a sampling of finance of companies that are actually pretty small startups, all the way to Fortune 1000 companies that are using AI and ML in really interesting ways.
So while these models are being used kind of every day, they don’t always work as expected. Here are two examples from the media. The first one is about lending decisions. So when COVID hit, the lending models use a particular feature, which is a FICO score, but with loan deferment and the stimulus packages, the FICO scores for a lot of people actually went up instead of going down. And so FICO seized to be a good indicator of someone’s credit worthiness. And so the key lending models stopped working and then had to be reassessed in pretty short order. Another example that is probably very familiar to this audience is Apple Card. The credit limits that were assigned to men and women were wildly different. And it took a huge social media star for apple to actually see that this was going on. So those are just a couple of examples.
These are mainly from the media. So let me match them up with what we hear when we’re talking to AI and ML teams across the board. So here’s a few examples. The first one is from a ad tech company. They actually lost revenue of 20K in 10 minutes and had no idea how. That’s just mind blowing. If this was a database system or even a DDoS attack that was going on, you would be able to identify things pretty darn quickly. In their case, they had to dig through all kinds of logs and databases to figure out what the hell went on that caused that huge debt and revenue. The second example is from a major Silicon Valley company. And the observation from this team manager was that more and more… It so happens that now the people building models are very different from the people using them.
And so the software engineers who are using the models typically have no insight into how the model works or how it was built. And so it’s hard for them to develop an intuition for when a model is behaving as expected and what guard rails they would need to put into place. The final example here is from e-commerce, a big e-commerce vendor in the US. Models are often used to make pricing decisions for them. And so one bad decision on pricing can lead to a significant revenue loss. And so preempting bad decisions can actually save them a fair amount of money. So this is a sampling of what model monitoring translates to in the field and why it’s becoming so important as more and more models are used to run businesses. So how do we solve these problems? I’ll be alluded to it. That’s the topic of my talk, model monitoring.
And we’ve been doing monitoring and software for a while. It has been APM, it’s observability, now there’s logging software and so on. What’s special about models and how can we actually solve this problem? That’s where I’m going to go to next. What is model monitoring? There are a few different terms that are used interchangeably, whether it’s AI assurance or it’s performance monitoring and so on. So all present how we think about model monitoring and what parts of monitoring are part of that definition and what parts are not. So at a top level, the goal of model monitoring is ensuring that model results are always of high quality. You’ll notice that this really talks about model results. There is a different set of tools that you would use to monitor a service health, they didn’t see through, but all of that, that’s not the topic of my talk.
Today I’m focusing on model result quality. And there are three things that we need to do for model monitoring in this realm. The first one is fairly basic, know when your models are failing. And this turns out to be super hard. And I’ll talk about each of these in turn. But once you first know that your model is failing, second, figure out why your model is failing. That’s kind of root cause analysis for you. And then once you do know why a model is failing, help with fast recovery. And so that’s the last bit. Let’s look at each of these in turn. So how do we know that a model is failing? If this was regular software and you had a bug in your software, you’re going to know there’s a stack trace, there’s an incorrect value that is getting produced and so on.
With models, it’s very different. And that’s for two reasons. One is, the way that we figure out if a model result makes sense is through ground truth. So we’re getting some input to the model, we’re producing the output, which is a prediction, but we need an external system to actually tell us whether that prediction was correct or incorrect. And most times that feedback is not instantaneous. It’s going to take you several weeks or months to figure out if your model made the right prediction. And in that time, you’re going to make tens of thousands, if not millions of predictions. And so you can’t wait that long to figure out if your model actually worked. That can be an additional signal you can put in once you get the data, but that’s too reactive and too late in the cycle to actually make a difference.
The other unique thing with models is that the model total is going to be one part of your decision chain. So after a model makes a prediction, that could be used as input to another data pipeline, maybe even involving some business or legal logic that’s going to figure out the ultimate outcome. And so a lot of times you actually don’t know how your model performed. And so you need to fall back to proxies where you can figure out, based on the data that you are seeing, what is your best estimate of whether the model will fail. And I already alluded to this a bit earlier when I said data, how do we guesstimate whether a model is failing? There are a few telltale signs. If you think about what a model is, you’re taking and training data. There are assumptions about the domain.
There are assumptions about the underlying data generating process that actually produced the model. And so the thing that you can actually monitor is the data and an argument can be made that the data captures your assumptions. But putting that aside, let’s look at a training data. Let’s look at a test data. All of us have heard of drift. Here it concretely means that you’re looking at any data about the model that you can get your hands on and you’re going to figure out if that data looks different from what you expected. So this could be the output distribution. Are you giving more loans to applicants now than you were before? That should be a warning sign. You could be looking at the input here, so every feature across as appropriate to figure out if the distribution of the features has changed in a way that’s maybe alarming and could cause the model to have issues downstream. You could be looking not only at the raw data, but the featurized data.
So think about text analytics. You could have raw data, you’re going to produce N grams, maybe you’re going to embed it. You could actually be looking at the distribution of N grams. You could be looking at what your embeddings look like. And so if you introspect the model, there is a variety of different datasets that you can look at for distributions and see if those distributions have changed. So turns out that data is a pretty good proxy to know if a model is failing. And that’s an area that I’m going to focus on a ton as I describe the framework, but that’s a quick preview. To know whether a model is failing, we’re going to look at the data going in and out of it and see if it matches up with historical trends.
All right. The second challenge that I had mentioned earlier was once you know that a model is failing, how do you figure out why it’s failing? So this is root cause analysis. Going back to my setup earlier here, your model might be prediction three, but it actually turns out that there are three different pipelines that are feeding it. And so if you find that your prediction three is looking a bit different than what you expect, any of these boxes could actually be the reason why your model is actually failing. And so you need fairly sophisticated machinery to figure out which of these boxes is actually the root cause of the failure that you are seeing. And then the third part of what we think about as model monitoring is closing the loop.
Because model monitoring is great, it’s going to tell you, there is a failure, here’s where it is, but you want to be able to take action. And we think about this in two ways. One is, know about the problem before it happens so you can take action. So if there’s a missing feature and you detect it, you want to have some ways to compensate for it. Similarly, if we are in the pipeline situation from the previous slide, we can then… If we detect an error on the upstream data, then we can accurately stop the error from propagating downstream. The other part of closing the loop is integrating with the rest of the ML pipeline. So this means if we see an error, we want to retrain the model. If we see an error, maybe we want to go to the data labeling software or fall back to a previous version. So it’s not only important to detect an error, but we want to take actions to remedy it.
So that’s the third piece. So for us to do model monitoring, you need to know when a model is failing, figure out why it’s failing and then take actions to remediate it. You might ask, what’s the alternative? What do people do today? And we talked to a lot of folks who are using this type of setup. They have their model, it’s getting some input, it’s producing output. Those get dumped to logs because that’s the easiest thing to do. And then you have a set of analysis pipelines, whether in Spark, Airflow, just SQL queries, that are going to produce reports for you. Now, this is okay when you have one model where it really breaks down as when you have lots of models. So one of our customers actually had boards of 35 notebooks to just keep track of the results of these model pipelines, the quality of data and so on.
And the maintenance burden, every time you need to onboard a new model or you change a model, it’s just massive. In addition, if you think about the pipeline jungle from before, it’s difficult to get a global view of what’s changing if you are only looking at the data set or the model level. And the most important one, because time is probably our most precious resource, it took them about a quarter to get something reasonable set up and going, whereas a actual monitoring system can get you there in less than a day. So that’s the alternative that people are doing, but monitoring can be better. And to solve the monitoring problem and why I’m excited about it, there are some very meaty problems that we want to figure out. First one, how do you measure quality in the absence of ground truth? We talked about this before.
Second one is actually customization. Every model is different. It has different quality metrics. So what is a way to make your monitoring flexible enough so that it can be used for a variety of models? The third one is pipeline jungles. We’ve talked about this sufficiently. ML model lineage data pipelines are just very convoluted. And so you want to have good support for that. Accessibility, because more non-experts are using models, we want to make sure that any monitoring tool can actually help them get started and make sense of what the tool is generating. And the last one is scale, that’s where Spark really shines. A lot of the data sets you’re going to be using are large-scale data. And so Spark ends up being a very good fit for monitoring your ML pipelines. All right. So at this point, I’ve covered why you need model monitoring and what is model monitoring to us.
Next, I’m going to talk about the framework that we built at Verta to solve this problem. And as I mentioned, this is part of an upcoming release, and I’m going to put a link at the end of my talk, where you can go check it out. ML monitoring is a fairly greenfield and we’re breaking new ground. So we’d love to get the community’s input on how the abstractions resonate, what features would be interesting and so on. All right. So we had a few goals when we built the system. The first one was to make it flexible. We want to monitor models running on any platform. We want to monitor any kind of model, not just relational, not just CSB or tabular model, but text models, image models and so on.
Second, we want to monitor data pipelines. We want a monitor batch in live models. They’re all models and part of the ML ecosystem, we want to monitor them all. Second, we want to make it customizable. If you want to use metrics out of the box, that’s great, but you might have some very particular model that needs a specific statistic, and you should be able to plug that in. And then finally, we want to close the loop because that’s where the rubber meets the road. Monitoring is helpful in so much as it helps you solve a problem. And so we want to make sure that we can automatically recover and also resolve alerts.
Okay. So how does the system work? So here’s a quick screenshot of the flow. I’m going to start at the left. You have a bunch of data generating elements. These can be models, these can be model pipelines. There’s also some process that’s capturing ground truth. We capture that data raw as well as statistics and we put it into a time series database that is built for statistics. So an example of the statistic might be a histogram, it can be your min-max values for a particular column, it could be N grams for your text data and so on. All of them go on love here. You can configure how these statistics are created. You can configure alerts. And then once that data is in this database, we can do rich analytics on top of it. So you can visualize, you can debug, you can get notified when something happens. And of course you want to take automated actions.
So that’s at a high level, how the system operates. The most interesting bits for the audience here today is probably going to be, how are we capturing this information? What does data ingest look like. So I’ll cover that briefly and then we’ll switch to the demo. So there are key abstractions that we use. This is just your data. It could be a data frame, it could be a column as you wish. These could be many batches if it’s a real-time system. We have this concept called profilers. These are functions you run on your data to compute a statistic. So an example of a profiler, as I mentioned earlier, would be a histogram profiler, something that computes a histogram on top of your data. What that’s going to produce are summary samples. So these are statistics. An example would be one summary sample, maybe your histogram of age for the data from last week.
Another summary sample might be same histogram of age, but for the previous week. Another one might be histogram of age, but a year back. All of those would be summary samples and they fall into the bucket of the age histogram summary. So a summary is a collection of summary samples, and those summary samples are created by profilers. Now, as you have more data, you could define new profilers, you could use old profilers, you can apply multiple profilers to the same data as you wish, there’s a lot of flexibility. But once you have these samples in the system, then you can go off and do interesting analysis on top of them.
Before I jumped to the demo, I’ll just note that the same framework works, whether you are in batch or live. The only thing that really changes is how do you do aggregation of these samples? And we’ve come up with a few clever ways of doing that. In the demo, I’m going to be focusing on the but setting for ease of use, but happy to discuss real time during the questions. Okay. So a bit of preview on what we’re going to do for monitoring Spark ML pipelines. When you think about a typical Spark ML workflow, let’s assume that you have a batch prediction pipeline with Spark. So we have data coming in, a bunch of ETL, we get a prediction. And assume that new data arrives every day. And at some point, this is the thing, the last prediction is what you care about. At some point you want to figure out what is going on with your prediction, is your prediction not behaving as expected? Is something going off in another part of the system?
And so we’re going to monitor a Spark ML pipeline and see what we can learn about telltale signs that the model is not behaving correctly. The specific demo setup that I’m going to be using is a CSV data dataset. So it’s a tabular dataset. I’m going to apply three string indexers. I’m going to assemble the vectors. I’m going to apply a GBDT, and then we’re going to make a prediction. So fairly straightforward, and we’ll walk through some interesting analysis that we can do. So with that, let me switch to the demo. All right. Okay. So let’s dig and to the demo part of the talk. So what I have going on here is, I have some training data that, as I mentioned, is tabular format. This is an insurance cross sell use case. So I have bunch of attributes of customers and I’m figuring out whether to cross sell a new insurance product to them, and the response is yes or no, whether I should.
So let me get to the meat of it. I have some Verta setup going on here. Then I am adding Spark here, balancing the data to get some interesting results for the demo. Here I build a Spark pipeline. As I mentioned earlier, it’s a fairly simple pipeline. I have a gender column indexer, vehicle age indexer, vehicle damage indexer, there is an assembler and then a GBDT classifier. Tried LR, didn’t quite work, and therefore, GBDT it was. That’s a simple pipeline. I’ve done fit the pipeline to my Spark data frame. So that’s just your usual stuff, nothing monitoring related has happened quite yet. Now we get into the Verta monitoring framework. Here, I’m defining what we call a monitored entity. That’s just really some name and metadata that you were attaching to the thing that you’re monitoring. So I’m calling it the Spark monitoring demo.
And then I actually start monitoring the pipeline. So what I’ve done is I’ve defined this convenience function, which is monitor this. I’m giving it the fit of pipeline, the data that I’m going to be running through it. Some metadata I want to associate with the statistics that are coming out off the pipeline, and then when was that logged? This could be monkey patched, but I actually want to walk through what the code is doing for it to make sense. So I have this little profile function written here, and what it’s doing is it’s determining the column type. So for continuous columns, it’s going to compute a continuous cystogram, for distribute columns it’s going to do a binary histogram. It’s going to… For all the columns, is also going to compute missing values. And so depending on the column type I have chosen to profile my data in a particular manner, depending on your model, you might choose to do it differently.
So that’s very domain specific. And that’s part of the reason why we’ve built the framework as an API. You can pick and choose what profilers you want to apply to your data. So what monitor this is going to do is it’s going to run the training data frame through the fitted pipeline and it’s going to collect a whole bunch of summary statistics that we can then visualize and analyze. So suppose I’ve run this and it takes a while, so I’m not going to do that in real time. Let’s go look at the data that a system has captured. For Spark monitoring demo, here are what we call the samples that we’ve captured. And you’ll see, this is for classification model. The column is vintage, and I’m computing the missing values. It’s telling me that it’s a discrete histogram. The buckets are something present, as something missing and you’ll see that across the board.
And so if we look at say annual premium histogram, which I know is a column in my data, I can actually go look at what the distribution of that column looked like over time. And the way to read this chart is, these are the buckets off your histogram and the X axis is time. So over a time, it’s telling you, how did your distribution change or evolve? Let’s look at another one. This is the number of missing values of vintage missing. So you’re looking at how the missing values change over time. I see no change. Wonderful, wonderful. This is another example of the age histogram. And notice that all of these are generated just by running this monitor, this function. So it’s taking in a data frame, it’s applying profilers and then creating this whole list of summary samples for us. So that’s the ingest part.
The data is only as useful as the alerts that it can help you define and manage. And so I’ve gone off and I’ve created alerts on every summary that I’ve defined, and I have to find a lot of summaries. So if you look here, I have 380 summary samples. I have actually built alerts on every summary and I have 134 alerts going on that I have defined. So it’s going to tell me when anything in my pipeline changes at all. So next, let’s go and actually run some data through the pipeline, and I’ll give you a sneak preview. I’ve actually sub-selected the data. So we see some interesting distributional shifts and that has to do with just what are the samples we’re sending in. So once I sent that in, I actually saw that I started getting active alerts. Active alert is something that is currently firing and it’s telling me something needs my attention.
And you will see that I already have 10 alerts on this fairly simple pipeline. Now, the alert that I really care about, because that is the output, is the prediction alert. And so that’s what we’re seeing here. And immediately you can see that there is a distributional shift, which merits my attention because I was predicting zeros and ones, and now I’m not cross-selling to anyone. So that’s terrible for my business. I want to figure out what’s going on. If you notice here, there’s 10 alerts. So the first thing I want to do is I want to go look at what’s the particular alert that I care about. So I have pulled it up here. I’m getting the specific alert. It’s telling me that it came from this monitored entity. When was it evaluated?
Notice that it’s a reference alert in our system. What that means is, you’re comparing the new distribution to an old golden distribution. So in this case, it’s comparing every new distribution to what it expects those distributions to look like. And I’ve put a threshold on that. And that’s how the alert is getting generated. I really want to know why it alerted though. It’s great to know that the prediction is off, but why is it off? That is where getting the context around the alert is very, very useful. There’s a lot to unpack in this function, but I’ll throw out the highlights. Because of the way that we have built the framework and the rich metadata gathering system underlying it, you can actually look at what are other alerts that are going on on that data set. So this is telling me not only is the prediction column alerting, the vehicle damaged column is also alerting.
And oh, by the way, if I look at the ancestors on this chain, which is my Spark ML pipeline, I can see that the previous dataset had alerts on vehicle damage previously insured. If I go back one more step, I find that there are alerts on vehicle damage previously insured. You’ll start seeing the trend. As you keep going further backwards, you see that previously insured is alerting, previously insured is alerting. Very quickly, you’ve been able to identify that, yes, you’re seeing the issue on predictions, but what’s actually causing the issue is all the way back here, which is the data that there’s something off with a previously insured column. And if we look at that column, you’ll actually see that I’ve only sub-selected the previously insured users and that’s why it’s telling me that distributional shift has happened. So lot more to talk about here, but for the sake of time, I’m going to wrap up the demo.
I’m happy to take questions about the demo after the talk. And of course, when folks try out the framework, do let us know if we can add other things here. So I’m going to go back to slides now. Awesome. So that was a very brief demo off our framework. I’m going to wrap up with some key takeaways and I’ll drop in a link for where you can try it out. If you take nothing else out of the talk, you should remember this, ML models are now driving key user experiences business decisions.
And so we want to have guard rails in place that will let us know when the model is not behaving as expected. Model monitoring ensures that result quality is always high. And when done right, model monitoring can actually save you revenue, 20K in this case in 10 minutes. Identify failing models before social media does as in the Apple Card case and safely democratize AI. We are super excited to be releasing this framework for everyone to use. And so do sign up on that website and check it out. We would love to get the community involved and hear whether this framework can solve your ML monitoring needs. Thank you.
Manasi Vartak is the founder and CEO of Verta, an MIT spinoff building an open-core MLOps platform for the full ML lifecycle. Verta grew out of Manasi's Ph.D. work at MIT on ModelDB, the first open-so...