Productionalizing Machine Learning Solutions with Effective Tracking, Monitoring, and Management

May 26, 2021 04:25 PM (PT)

Download Slides

Intuit products increasingly rely on AI solutions to drive in-product experiences and customer outcomes (a realization of Intuit’s AI-driven expert platform strategy). In order to provide complete confidence to Intuit customers through reliable and predictable experiences, we need to ensure the health of all AI solutions by continuously monitoring, managing and understanding them within Intuit products. 

At Intuit, we have deployed 100’s of Machine Learning models in production to solve various problems as below:

  • Cash Flow forecasting
  • Security, risk and fraud
  • Document understanding
  • Connect customers to right agents

With so many models in production, it becomes very important to monitor and manage these models in a centralized manner. With very few open source tools available to monitor and manage ML models, data scientists find it very difficult to properly track their models. Moreover, different personas in the organization are looking for different information from the models. For example, the DevOps team is interested in operational metrics. Financial analysts are interested in determining the operational cost of a model and the legal and compliance teams might want to find if the models are explainable and privacy compliant.

At Intuit, we have designed and developed a system that tracks and monitors ML Models across the various Model development lifecycle stages. In this Summit, we will be presenting the various challenges in building such a central system. We would also share the overall architecture and the internals of this system.

In this session watch:
Sumanth Venkatasubbaiah, Senior Engineering Manager, Intuit
Pankaj Rastogi, Developer, Intuit



Sumanth Venkata…: Thank you, everybody for joining the session. So we hope you and your loved ones are staying safe from the ongoing pandemic. And we’re just glad that you could make it to the session.
I am Sumanth Venkatasubbaiah and I have with me, Pankaj here, my colleague. So we are both part of the AI Services group at Intuit. So we are focused on building capabilities to help in the variety of stages of modern development lifecycle. Our team is also involved in building general purpose scalable AI Services to help accelerate the adoption of AI across the company, across the variety of products and services within the company.
All right, so here’s the Agenda. So we plan to cover quite a few things. So we have structured the talk to begin, but to give you some high-level context and then we’ll get into the details of the maintenance of the system. So we’ll just be sharing our experience in building the variety of capabilities, which will help in a variety of modern development lifecycle stages. So we don’t want to be prescriptive. It’s just our take on how to solve some of the challenges that we have encountered at Intuit.
So about Intuit. So for some of you who may not know about what Intuit does or who we are. So we are the company that are developing some of the financial software, which are most widely used from both within the U.S. and outside the U.S. For example, so we have within our portfolio of products with TurboTax, QuickBooks, Mint, Credit Karma. So TurboTax is the most widely used tax proficient software within the U.S. QuickBooks is the accounting software geared mainly towards small and medium-sized businesses of accountants and individuals. And we also have Mint and Credit Karma, which have both served the personal finance management aspects of in our lives. So our mission is to power prosperity around the world. And the way we have decided to do this is with the help of an A1H expert platform. And we’ll be touching upon some aspects of this, our journey so far in the next few minutes in the session.
All right, so I’ll go with some of the ML machine learning driven experiences within some of our products. For example, in TurboTax. So it’s the tax preparation software. So we have a variety of personal users, users with variety of backgrounds using the software. And each of their experience using the product is unique onto themselves. So we do a lot of personalization behind the scenes. For example, we use a variety of NLP based models to provide contextual help for some of the questions that our users might have. In QuickBooks there’s some statistics which goes by. So within the U.S., 20% of the small businesses they fail within the first one year. 30% of them fail within the first two years of their beginning. And up to 50% of the businesses fail within the first five years.
Right? And one of the reasons why they fail… most of these small businesses fail is because they don’t have much clarity into some of the future of cashflow forecasting for their businesses. So now, we use a variety of techniques, ML driven techniques to help them uncover, or help them understand some of the forecast, some of the cash flows so that they are successful within their businesses. So Mint can categorize a variety of transactions from the accounts that are linked to Mint. So now these transactions are automatically categorized based on the type of the business. For example, be it a restaurant, airline or a movie. So we have variety of ML techniques which enable us to automatically attempt categorise some of these transactions.
At Intuit, we want to power prosperity around the world with the help of A1H expert platform. And so this platform centric approach is helping us solve many of the common problems that we encounter each and every day. For example, we have built a document to understand the platform, which allows our users to literally take a picture of some of the receipts or forms and then just upload to our app. This platform helps, like it can pass the forms. It can literally understand what’s in the receipt of the form, and then put it back in the application.
We could generalize this. We could reuse some of the capabilities built here, within some of the other product offered. So we have an ML platform which is built mainly on top of Amazon web services by leveraging a variety of native or managed services. Plus we also have built a lot of abstraction on top of AWS services, which primarily powered the whole ML ecosystem within the company. So we have close to 400+ moderates production by which are deployed in production, with 25+ trainings which are run on a daily basis. We have a built feature store, which houses machine learning features. We have close to 8,000 features there, with close to 15 billion updates every single day. And we use a variety of services like SageMaker, Kubernetes. We also use some of the non-AWS services like Argo or Workflow, which is built or bind into it and open- sourced. And we leverage spark and beam for some of the data processing aspects of this journey.
ML platform provides capabilities to train and deploy models. However, what we realized over a period of time was there is a need to track the various stages of modern development lifecycle. And some of the lifecycle stages are exploration or data transformations for radio pump. The experiments that are on was part of training or monitoring the predictive performance of the model once it’s deployed to production. And all these things have to be done in a consistent way. And there are challenges in building the central system. For example, even though we have an ML platform, not 100% of the models are leveraging this platform due to a variety of historical reasons. So now for us to be able to track across all these models, which are in production. So there are a lot of challenges. As you can see, there are a lot of standards which are specific to each teams.
There is lot of variations in how each team accesses data. For example, data can reside in S3, or it can be in a database, or it can be a variety of formats. Like JSON pocket. The cleaning environments can be different and there are just numerous modern development, deployment and normants as well. Although you’re trying to converge into a single platform, we are not there yet which really makes it really challenging to build a central system. And why do we need a central system? Because, we need a consistency in the way of how we track, how we collect some of the metadata.
So if there is no consistency and it does manual collection of material that it doesn’t scale. And it’s not updated by model owners. So how often would you want to interrupt your workflows to just come and update some of the metadata on a regular basis? Our model owners doesn’t like it. So there has to be an automated way to achieve this.
So as such, we started to think how we can address some of these challenges. At the same time, we want our customers to have complete confidence in their product experiences. For example, let’s say there’s a transaction which is in Mint. This transaction is at a restaurant. But if Mint, for whatever reason, if it categorizes this as an airline transaction, then our customers will lose confidence in the system. For example, at the same time for our customers to have these confidence in our systems, we have to build capabilities of which will let our data scientists, our model owners, to have complete visibility into a variety of stages as part of the modern development lifecycle. Also, there are a variety of stakeholders in this space, for example, Execs wants to know of different things, get a scientist, or analyst. Just to provide observability, and visibility into all these stages. So as we set out to build systems of these were some of the goals, and I hope you’ve got some context. Pankaj, my colleague will explain our approach in solving these challenges in the next few minutes of the session. Thank you.

Pankaj Rastogi: Thank you Sumanth for taking us through audios journey towards becoming A1 expert platform company. Hello guys, I am Pankaj Rastogi. I work as staff engineer at Intuit. Let’s continue our discussion on gay monitoring and management. Why do we need it? Who is asking for it? What is the customer benefit? The last thing that you want to build in your company is a system that nobody wants to use. So the answer to these questions depend on your role in the organization.
So let’s say you are a financial analyst in budgeting growth. So your goal is to plan for next quarter and next year. So you need to know how much it costs to train a model. How much it costs to host a model? That you can effectively plan for the next quarter or year. If you are from operations team, you want to know the CPU utilization, the instance type used for training this model or hosting these models. That way you may want them to buy reserved instances for this instant if they are in cloud.
If you are a compliance officer, you want to make sure that your models are explainable. You don’t want to deploy black box in the production where you cannot explain the inference. If you are a consumer of these models, you want to know the uptime. Let’s say you have a model in prod, which does not have four nine as uptime, you don’t want to use that model in your real time scenario. Or you want to know the latency of the model. If the prediction from a model comes within a few milliseconds, then you may want to use like a real-time scenario. But, if the prediction takes few minutes, then you can use it in batch mode. Executives in the all are looking for answers like how many models are in production, how many models are in pre-prod, who is consuming these models, number of calls to these models?
Machine learning engineers and data scientists are always looking to tune their models. They are looking at various training runs to see how their training runs compare. Which one to deploy to prod. So, as you can see, there are so many personas in our company and they’re all looking for different answers. So there are multiple solutions. Either you can build different solutions that address their concerns, or you solve it for all in a single system. So act into it. We are trying to solve this problem holistically. And that’s the reason we are building AI monitoring and management.
You can think of the system as consisting of three primary modules, track, monitor, and manage. In the track module, what we have done is we have built a Python package using ML flow open source framework. So it gives our customers that capability to push data into central tracking server. So we have provided APIs that are very similar to ML flow APIs, where you can push a hyper parameters. You can push model artifacts like your config file. Your training file. Your model file. Or other metrics that you want to collect during different stages. You can think of this module as a push based approach, where the customer is responsible for pushing data to the central tracking server.
A second module is monitor module. This module consist of various monitoring services, like model monitoring service, where we monitor the input features to the model and the predictions from the model. It also has web book. Even handle that are listening to events like Git Commit, Git Merge. And then we have schedulers that go to data lake and pull the information that we are interested in. So now what do we do with the information that we have collected in our central database? And that’s where the third module manage comes into picture.
So this model provides various insights to the customer and let them take meaningful action. One example would be that the monitoring service detects model decay or data drift, and this helps the model owner to take a decision. Whether to re-train a model or not. And he or she can take that decision in the manage module.
We want to track the model throughout its lifecycle. What it means is that we have to build lot of integrations. So in the early stages, when you are doing data exploration, you have to evaluate different data sources. You are looking for statistical significance or different features. You are discarding feature that this time. So you want to keep track of all of these decisions. Few months later when somebody asks you, why did you pick a certain feature or why you did not pick a certain feature? You should be able to go back to that decision and know the reasons for taking a certain decision. Later on when you move to experiment stage, where you are experimenting with different ML frameworks or different algorithms. You want to keep track of all these experiments. And you will be doing testing and training later on where you will be tuning your model. You want to compare these models. So you need to make sure that you have captured all this information in the central tracking server.
And then once the model is in prod, you want to make sure that you are monitoring the operational metrics, like your uptime, your SLA. And at some point you will start to monitor the features to the model and prediction from the feature. So as you can see that our clients, they use tracking module to push data to the center tracking server and all the monitoring services. They also use the same pattern package to push data to the central tracking server. So there is only single way to push data to this database. And how do we retrieve data is through rest APIs. In our managed module we have built our own dashboards, which provide interesting insights to the users. But then clients are open to use the same rest APIs to build their own dashboards.
So let’s go a little deeper into these modules. The first one is model monitoring service. So let’s say you have a model deployed in prod, which provides inference in less than few milliseconds. And it is up 24 by 7. And one of the feature that it uses is income field and the upstream system sends you income as biweekly income. So you train this model on this feature and deployed in products running perfectly fine, providing inferences few milliseconds. Now one fine day the upstream system start to send this feature as monthly income. So your model will still provide inference. It will still be up 24 by 7. Would you call this model as a healthy model? No, because the inference will be all wrong because one of the input to the model has changed dramatically.
And that’s the motivation for building this service. What it does is it computes various metrics on the input features and the prediction. Like your main max count, count distinct, median, which help us detect such changes. So this service is mostly conflict driven, where our customers have to provide two conflict files and it has two pipelines. The first pipeline is data pipeline, which reads the first conflict file. And it reads the data sources to go to to read the input data.
And the second pipeline is the metrics pipeline, which computes all the metrics that I just mentioned. And the output is integrated with our own anomaly detection service. So if we detect any anomaly in any of the metrics we alert the user and the user can now take some action.
In the tracking module, as I said, we have used ML flow. So if you see, the APIs are very similar to what ML flow offers. But we have made certain modifications to the API to meet our own needs. And some of them are listed here. Like we have made this service auto discoverable that with any team within Intuit can use it. We have added authentication and authorization to all the APIs that we offer using our own gateway server. And then the third API, the log artifact API in ML flow. It assumes that the client has the ability to write to the S3 bucket. So we have added IAM Assume role capability to our client code. That way this permission issue is handled automatically.
And there are some changes that we are working on in this package, which is how do you package this client and server code. If you look at ML flow, the both server side and the client code, they are packaged together. Which makes it very heavy, I think it is around 250 MB. So we are looking at ways where we can make it a thin client. That way it is easy for our customers to use.
And we have also expanded the schema of ML flow. As I said that we want to track the model throughout its lifecycle. So we want to capture model metadata. If somebody is consuming this model, we want to capture the customer information like who’s using this model. And if there are multiple versions of the same model deployed in prod, we want to get metadata about these versions as well. And we want to capture if the model has gone through different review process. So that’s why we wanted to expand the schema so that we can capture all this additional information.
And third module, the manage module, it is mostly work in progress, where we present insights to the customer through various dashboards. And we let them take some meaningful actions. And some of the actions are listed here. Let’s say you detect model DK or data drift. This will help you retrain or take the decision to retrain your model. And second would be, let’s say you have a few training runs done in pre-prod and you know which model to deploy to prod. So you can pick that particular model and push it to prod. Or let’s say you have a model running in prod and nobody’s consuming it. Do you want to keep running that model? No. So you can de-commission that model from this module. Or let’s say you have multiple versions deployed in prod, and now you want to retire one of the older versions. So these are few of the used cases that we are working on in our manage module.
So as I mentioned that we provide insights to our customer using various dashboards. And all these dashboards are powered by rest APIs. A few of them are mentioned here. Like, we can give you a dashboard which gives a number of models by model type. So let’s say you are a data scientist and you’re planning to build multi-class classification model. You can come to this dashboard, you can double click on it. You find all the model owners of these models. And you can find out what features they have used. You can find out their training runs, the hyper-parameters. Maybe you can partner with them. Maybe you can see that the problem that you’re working on is already solved by some other team. That is the reason of building these dashboards, that it will help us speed up the development of lifecycle.
And some of the changes that we plan to work in future are adding ethical AI. So let’s say you have a model in prod and the feature that it uses is income. But in your training data, you do not have equal representation from all the income groups. Let’s say you’re missing out on low income group data. So when you train this model, it will not provide a right inference or for this belonging to low-income group. That’s what we want to address. And we want to make our models explainable. And that means we capture the future importance for each production. And if there is any personally identifiable information used for any of the model inference, we want to make sure that we have done proper risk assessment. So these are few of the things that we are going to work in the future. If any of the work that we mentioned today excites you, please join us. We are hiring. And if you have any questions, do let us know. Thank you so much for joining today. Thanks.

Sumanth Venkatasubbaiah

Sumanth Venkatasubbaiah

Sumanth Venkatasubbaiah is a Senior Engineering Manager at Intuit, where he is responsible for building scalable AI services and capabilities aimed at helping Intuit to become an AI-driven expert plat...
Read more

Pankaj Rastogi

Pankaj Rastogi

Pankaj Rastogi is a Staff Software Engineer at Intuit with 10+ years of experience in building frameworks for Big Data and Machine Learning pipelines. At Intuit, he is responsible for the design and d...
Read more