The presentation introduces Intuit AI Model Monitoring Service (MMS). MMS is an in-house Spark-based solution developed by Intuit AI to provide ongoing monitoring for both data (statistics of model input/output etc.) and model metrics (precision, recall, AUC etc.) of in-production ML models. The project is soon to be open-source. MMS aims to tackle multiple challenges of in-production ML model monitoring:
– So, before we begin, we just want to thank you all for attending this session. So we hope you are all staying safe wherever you are during these uncertain times because of the ongoing pandemic, we are glad you could make it to this virtual session.
All right, let’s get started.
So I’m Sumanth Venkatasubbaiah, I have Qingbo Hu co-presenting with me here. So we are both part of Intuit. For some of you may not know what Intuit does, but what the company is all about. So we are the company behind TurboTax, the most widely used income tax application software in the US. So Intuit is also the parent company of QuickBooks, the most widely used accounting software geared mainly towards small and medium-sized businesses, accountants and individuals. Both me and Qingbo are part of the Intuit AI services team. Our primary mission is to build scalable AI services so that we could accelerate the adoption of AI across our Intuit’s products and services, and help Intuit become an AI driven expert platform.
Here’s the agenda. So we plan to cover quite a few things. So we have structured the talk in such a way that we’ll go with a high level overview to begin with, and then dive deep into the internals. So most importantly, this being Spark and AI summit, so we will go over in-depth as to how we are leveraging Apache Spark to monitor our in production ML models.
So first let’s get started with machine learning at Intuit. So briefly, machine learning use cases at Intuit. So at TurboTax, as I mentioned earlier, so it is the most widely used tax filing software in the US. Our customers have a diverse background, and we understand each of our customers tax filing situation is different. So we want to but ensure the tax filing experience is completely personalized. And we use a lot of machine learning models to personalize this tax filing experience. Also, security, risk and fraud: Well Intuit being in the financial services industry, as you can imagine, we deal with a lot of highly sensitive data. As it turns out, there are a lot of bad actors out there as well, who are trying to commit fraud. So we heavily rely on machine learning models to keep these bad actors out of our systems and also to ensure our customer data is safe and secure. So we also have a lot of other ML use cases, I’m not getting the details of each one of these, but we have just for… And starters, we have a credit card recommendations, we have small businesses, cash flow forecasting, we use machine learning for image recognition, or text extraction.
Some of the features in our services is customers can take a picture and upload the receipts. In the backend, we have to extract information from these receipts and forms, the tax forms so that information can seamlessly be passed in the backend. It’s really exciting to see how machine learning is transforming our customer experience overall.
All right, so let’s talk a bit about challenges in monitoring the production, machine learning models. So first of all, why do we even need to monitor ML models once they’re deployed in production? Well, for starters, the input a data distribution can change for a period of time. And, hence, the accuracy of the machine learning models will change with the change in the input data as well. So we should monitor all the models in production as much as possible.
So just to give you a brief sense of ML landscape at Intuit. So it’s pretty diverse. So there are a lot of data science teams, as you can imagine, and each having its own standards and processes. So we are in fact, in the process of standardizing the entire model development lifecycle as we speak, but the current status, it’s pretty diverse, models are deployed across in a variety of ways. Some of them are batch models, some of them are online, the models are deployed in Kubernetes, are easy to just to name a few.
Also, all this leads to a lot of fragmentation, a lot of diversity in the corresponding underlying data as well. The model inputs, the labels and the prediction data. So our data is scattered all across various different locations, for example, S3, Vertica, Hive, you name it. And also, the file formats of these data varies a lot. The data can be in Parquet, can be in JSON, AVRO, tabular form. And also, there is this encryption. So because we Intuit deals with a lot of highly sensitive data, so we do encrypt most of our highly sensitive data, like sometimes the entire data set might be encrypted, or in some cases, certain fields within the data sets might be encrypted, or I mean case of highly sensitive data, both the outer data set as well as some of the internal columns might be encrypted. So this just shows how diverse the data is. And also, the labels are delayed. Sometimes it can take up to a few days to a few weeks for us to get the labels. Another challenge is the data quality. So most of the ML data is not clean. A lot of cleansing and transformation needs to happen before any of this data could be used effectively for machine learning purposes and for monitoring internally And lastly, the scale at which Intuit works, there are millions of customers millions of data points collected every single day, and the size scales to several terabytes, and it’s really hard to process all this data without a distributed framework. So because of all these challenges, so when we first encountered, how do we monitor all these models?
So as expected, because there is a lot of diversity, a number of Intuit teams had built all this point solutions to monitor their machine learning models. So this led to both parallel efforts, as well as the resulting code could not be reused. And it was not scalable in some cases, as well. So to address all these issues so our team, we went ahead and built this model monitoring service with the sole purpose of being able to continuously monitor the in production models for data drift and model DK for a period of time.
So MMS, as we call it internally, enables data scientists and model owners to configure their monitoring pipelines with a variety of model inputs, labels, and predictions completely in a configurable way, so that they don’t have to write any code to use some of the built-in components of MMS. MMS also provides a default metrics library, which could be used by the different data science, or the modern data scientists, or the model owners to compute a variety of metrics. And we have a suite of library, which is extensible, which could be contributed back to by the teams if they find some metrics are not already present or implemented. So there is also an API using, which these metrics could be extended. MMS provides a total abstraction for model owners from the underlying processing infrastructure. And we plan to open source soon. So the core components of MMS is built using Scala and Spark. And we plan to open source MMS soon.
All right, so let’s get a quick understanding of how MMS leverages Spark. So if you’re familiar with Spark, this is a no brainer, given all the challenges that we discussed in one of the previous slides. So Spark has native support to read a wide variety of file formats, including Parquet, JSON, CSV, and so on.
So we use Spark to infer the schema of the generic data sources. And also, we use Spark SQL, which provides a clear a query interface for any structured data format. So now to address scalability, as it turns out, Spark is highly scalable. So just for illustrative purposes, we’ve just put up a table here, which shows the scale of some of our sample data sets. This in no way compares to some of the production data sets that we have, but just for illustrative purposes. So if we were to run the processing jobs in an EMR instance, let’s say for example, M5 X large nodes, we could compute thousands of metrics for a variety of data sets, which has been millions and billions of rows with thousands of columns in reasonable tabs, and we could easily scale out or scale down these EMR notes to compute to have these computed within, or within our given SLS.
All right, so I now hope you have a brief understanding of what MMS is. But let’s do a deep dive of the internals of it, and Qingbo will walk you through the internals. Thank you. – Thanks so much for the introduction, brief introduction about what MMS is, and what the challenges Intuit AI is facing to monitor in production models. So next, I will dive deep into the architecture of the model monitoring service.
So to start with the customer of our services actually can be a data scientist, can be a machine learning engineer, or can be a data analyst. So in that case, I think the level of their coding skills ranges a lot. So that’s why when we start like design the system, we wanted to be starting from point that can only use like the config, and then using API to use the system. So basically how the user uses as a user interacted with our API layer, and then it will create a workspace of the modern monitoring service. And the different pipelines in the workspace will take the different inputs, and they also generate the metrics, which is the output and then store it. So there’s no coding required if you only use existing metrics or segmentation logic. I will explain what the segmentation and metrics means in the later part. And also, if you’re advanced user of the system, so you can use our standard programming APIs to create your customized metrics and segmentation logic. And then whatever you created there can be also seamlessly used in the system. So model monitoring service, the workspace consists of two types of a pipeline, first one is called the data pipeline, and then the second one is called the metric pipelines. So, in the later part, we will dive deep into each of them and see what they are and how do you use them.
So before you use the system, we asked this user to basically write their config file to register your model. So you can see here this is example of what the user have to register our model in the system. So if you take a look at it, it contains some model metadata. For example, like the models game, models version, which business unit that you come from, and also some additional like a tax all this different things. And also, it can contain some optional field for example, predictive value field. This is one configuration to tell the system which field in your data frame contains your predictive value. And also, the actual value field that it contains your label, right? But all these fields are optional ones. So when user provider config like this, the model monitoring service system will recognize each of the field and then register your model.
And then after that, model monitoring service system basically will create a two types of pipeline. One is called the data pipeline and another one is called the metric pipeline. I think Sumanth previously mentioned, one of the problems we see is that the data in order to monitor in production model the data come from different data sources, and varies a lot from different data ranges. So one of the problems when we build the system and trying to survey existing solutions, we find out there’s no unified way to both allow you to integrate different multiple data sources, and also compute the metrics based on it. That’s why we introduced the first pipeline called a data pipeline. So the functionality of the data pipeline is basically trying to load the multiple data sources, from different storages and in different formats. And it can come from different date ranges. For example, you can load today’s label, but then you load like a 30 days yours prediction result. In that case, you can join them, integrate them and then calculate your accuracy precision. Let’s say it’s a binary classification problem for this model, and then in order to integrate the data sources, so we provide an easy way for them to specify the logic, how to integrate all this different data sources. That’s why we’re using Spark SQL because SQL is a very widely used language in pretty much like in many different like people, and then the results of the integration cycle will be standardized and into a standard data format, and then store them in storage. So the standard data format allows that any output of the data pipeline will have the same schema. In that case, we can provide a standard API for user how to specify, how to calculate the metrics. In that case, all the metrics can be reused for any model. And the second pipeline is called the metric pipeline. So the metric pipeline is basically generated those metrics you need to monitor and then store them. So the metric pipeline can load from two different data sources. First one is the output of the data pipeline, which is essentially the integration result of your multiple input data sources. And then it can also load from the output of other metric pipeline, which means they can generate metrics based on previous generated metrics. So metric pipeline also support a metric dependency so you can generate the metric based on the other metrics within the same execution context, and also it can generate metrics from previous executions of other or maybe the same metric pipeline. So in addition to metrics, we also support segmentation logic. So that allows you splitting a global data set into multiple segments. And all the metrics used in the global data set can be reused in the segment. So you do not need to write any new code. And we also provide kind of this API to write a customized metrics and a segmentation.
So let’s first talk about the architecture of the data pipeline, how data pipeline works. So as you can see in this figure, basically it uses the user provided configs, trying to create a different data sources here. So for example, we have different data sources in S3 and then each of them will load one data set from the S3, and then user can specify the SQL statement. And then we will integrate all this data source, and then standardize them into a standard schema, what we call the relational event. The naming for that one is basically saying the data have certain internal relation between them after you integrate them. And then we store this standard data into either HDFS or S3.
So from the users perspective, how can they write data pipeline? So as I mentioned, there’s no coding required only configs. So what they need to write here in the config is basically a bunch of sources and also this integration cycle. So the first data source we listed here is actually representing the models, input and output data. So as we can see here, they need to provide the source type in this case it’s S3 data per se. And also the file format is a JSON. And then besides that, they can provide a database containing the date as the partition.
You also need to require that to give a name for this source. So in the later part in your SQL, you can use this name to refer to this source. As also you can see here for the model input and output, it also has a special field called a date range. So this is saying basically, I need to load the one days ago model input and output data. So this is a way to integrate the data sources from multiple data ranges. So in the second data source is let’s suppose that’s your label. So it’s your newly created label from today, but previously, we loaded one day so because a model input and output. We also loaded the labels collected by today, as well as some additional dimensional data, which is not a part of your models input or output, but it can be essential to analyze your models performance.
So later in the final part is the integration cycle. So this cycle is basically we use the entity-id this field from all these three sources, and then join them. And from the models input output source, we take out its output let’s say one field is called output, which is a pulling value. And from the labels this source we take another pulling value, let’s say that’s the label. And also, from the dimensional data, we have integer field called account age, which is the age of the account let’s suppose that’s the field that we need to analyze the data.
And then after we join, the result of this integration signal will be standardized and stored in the storage and then let’s take a look at the…
In the next step what the metric pipeline will look like. So the metric pipeline, where essentially we can have multiple metric pipeline, they can load from the relational event, which is the output of the data pipeline, can also load it from previous result of other metric pipeline, which is called a monitoring event in this case, so basically can load from these two types of data sources,
and then it can go to the metrics need to be monitored through the configs user provided, and then register all these metrics need to be monitored and then store them.
And for each of the metric pipeline, so basically, it looks like this. So either of the two data types we load, we actually generate this purple figure here, the purple shape here is basically to generate the view based on the data structure. So the view is more user friendly, which is used in the API. So you can write your own metrics. And each of the orange boxes here is actually workflow. And each of the module in the workflow can be either a metric module, which is to generate some metrics to be monitored and also can be a segmentation module. So as you can see here, it’s in sequential fashion, which means the workflow will be executing in sequentially and this ensures any metrics get generated from previous module can be seen from the later modules. So this ensures metrics dependency within the same workflow can be resolved. So besides the metrics you can see another one is the segmentation module, which is the middle orange box here leads to different branches. So this will actually separate the relational event,
the output of your model into different segments, and each of them will have the same data structure. So that’s why you can reuse all your metrics in your global data set. Also, in each of the segment.
So let’s give an example what the user needs to write for a metric pipeline. So as you can see here, they can define this pipeline, which to specify a source. So in this case, is basically saying I need to load from the latest result from the data pipeline, which is a model event source, and name in the next step, so they can define a workflow, which contains multiple modules. In the first module in this case, we’re actually generating a metric called the confusion matrix for binary classification problem.
And besides the module type equals to metric, which defines this is a metric module, and also the class name to define where is the logic for this metric, it can also allow the user to further configure this, it’s basically in this config field of this module. So as you can see here, user can provide a sequel like a condition to tell the system, what is a true positive example, and what is a false positive example. In our previous example, we see that we take out the pulling value of the prediction value, and also the pulling value of the labels. So as you can see here, if there’s a true positive, the prediction value should equals to true and the label should also equal to true something like that. So in the second module, we calculate the accuracy after this confusion matrix. So as you can see, there’s no configuration need to be set up here. The reason is that it can also take advantage of the metric dependency. So basically this module will be looking for the result of the confusion matrix. And they automatically get all this count, all these entries in the matrix and then compute the accuracy for you. So this saves you a lot of time because you do not need to recalculate all this counting the confusion matrix in order to generate the accuracy.
In the third module, we listed here as a segmentation. So this segmentation logic is based on a numerical field, which is the account age, the account age field we extracted from our dimensional data. So we further split our data set into six different buckets according to the account age, for example, it can be account age less than one and also between one, two, three and all the way to any account age having a value greater than nine. And then inside this module, you can define a sub workflow which specify what are the metrics you need to generate for each of the segment. So as you can see, what we can do here is basically copy paste all the confusion matrix definition, the configuration and the accuracy configuration directly here. So they have the exact the same configuration. In that case, we can generate confusion matrix and the accuracy for each of these segments again. So you do not need to rewrite your code on anything basically, they can be reused.
I think that covers all the content we want to introduce today. Thank you very much for attending the session.
Qingbo Hu is a machine learning engineer from Intuit AI Engineering team and the acting tech lead for Intuit's Model Monitoring Service project. Before joining Intuit, he worked as a data scientist in LinkedIn. Qingbo earned his Ph.D from University of Illinois at Chicago, majoring in computer science.
Sumanth Venkatasubbaiah is a Senior Engineering Manager at Intuit, where he is responsible for building scalable AI services and capabilities aimed at helping Intuit to become an AI-driven expert platform. He has previously built Big data and ML systems at Verizon and Apple. Sumanth holds a Masters in Computer Science from the University of North Carolina at Charlotte.