Building ML models is a time consuming endeavor that requires a thorough understanding of feature engineering, selecting useful features, choosing an appropriate algorithm, and performing hyper-parameter tuning. Extensive experimentation is required to arrive at a robust and performant model. Additionally, keeping track of the models that have been developed and deployed may be complex. Solving these challenges is key for successfully implementing end-to-end ML pipelines at scale.
In this talk, we will present a seamless integration of automated machine learning within a Databricks notebook, thus providing a truly unified analytics lifecycle for data scientists and business users with improved speed and efficiency. Specifically, we will show an app that generates and executes a Databricks notebook to train an ML model with H2O’s Driverless AI automatically. The resulting model will be automatically tracked and managed with MLflow. Furthermore, we will show several deployment options to score new data on a Databricks cluster or with an external REST server, all within the app.
Elena: Hey, everybody. My name is Elena Boiarskaia, and I’ll be presenting with my colleague, Eric Gudgion from h2o.ai. I’m excited to talk with you about how to accelerate your machine learning pipelines with AutoML and MLflow. Our agenda today is to talk about some challenges that you typically face with machine learning, and then I’ll propose a solution that integrates a couple of different machine learning solutions, as well as some solutions from MLflow and Spark. I’ll present a data scientist’s workflow with a live demo, and then we’ll hand it off to Eric to talk about DevOps and deployment of models.
Some challenges that we’ll face with machine learning is typically the first question is feature engineering. How do you select the right features? How do you transform those features to make essentially a numeric matrix to create a model, but furthermore, how to transform those features effectively to get the most performance out of the data that you have? Ultimately, you may need to reuse those features that you’ve engineered further down in your pipeline. Feature engineering is really a huge effort, and then when it comes to model training, you’re going to have to decide what scoring metric to select the algorithms that are going to work, which is non-trivial. You’ll need to perform some experimentation there, then tune those algorithms with hyper parameter tuning and then ultimately decide, do you want to use ensemble methods to improve performance?
Perhaps you’ll want to be more considerate of interpretability of the model and so on. Lots of questions there. Finally, when you do come up with a good model that seems to perform as expected, you’re going to decide about model deployment. How do you track the experiments you’ve created? How do you decide which experiment to ultimately put into production and how to keep that model object that can be pretty complicated with all the featured transformers needed, with all the different algorithms you’ve selected, for deployment and ultimately how to deploy those models in different environments.
When it comes time to deploying models, that’s not where the process stops. You’re going to need that model to present results maybe to the business, or really look at the model results in real time and look at some dashboards and track the metrics of the model to see how it’s performing over time. Lots to think about there, as well, because how do you basically live with that model once you’ve put it into production? Our solution today focuses on integrating H2O’s driverless AI with MLflow and H Tools wave. I’ll be showing you all of the pieces shortly, but just a quick high level, driverless AI offers automated and machine learning platform that includes automated feature engineering feature selection and actually gives you some options for custom feature transformers and models and scoring metrics, as well.
It’s really an extensible platform that will also include quite complex advanced feature engineering right off the bat. Through driverless AI, you’re going to have a genetic algorithm that runs, that includes algorithm selection, hyper parameter tuning of those algorithms. Again, feature engineering, feature selection, and then ultimately provide a model that is explainable. It has an explainability component using Shapley values and others, including reason codes. Then it will produce a standalone model object. That model object, we call it mojo, and you’ll hear us refer to it. It contains all of these pieces in a standalone matter. It’s going to have the required feature engineering, the required algorithms for model scoring. Whatever the data you put into driverless AI, the model object will contain the necessary feature transformers as well as the model scoring piece to put it in production and score new data without you having to re-engineer features and so on.
The main part of the integration that I’ll be showing you is combining driverless AI mojos, the result of the driverless AI run into MLflow. Once we put that model object into MLflow, we get the full power of using MLflow to track the experiments, to look at projects, to share with the team and ultimately to deploy models, to register models for versioning and tracking purposes. Finally, that model object can also be used inside Wave, which is our solution for a rapid Python based app development to create visualizations dashboards. It can be connected to real time data streams and really produce beautiful results that can easily be shared with anyone again, in this app based environment. You’re really flexible there, whether you want the data scientists to benefit from this application or you want to share this application with the business user and just have them look at the dashboards that they need to see when it comes time to consuming the results of this model.
I’ll be showing you the data scientists workflow from my perspective as a data scientist. Typically, the start of my end to end pipeline will be working with my data and that data, I will leverage Databricks Delta to store and manage my data, to prepare the data ultimately for modeling. The data preparation will happen for me on Spark using Databricks node books, which I’ll show you. Then I can send that data over to driverless AI, which is the AutoML machine learning platform. Once that result comes in, I can log that model using MLflow. Then I can, in my environment, inside my Databricks node book, I can take that model and score new data using Spark as a a fully scalable solution. Then I can even update my Databricks Delta tables.
This workflow is a pretty typical workflow for a data scientist. However, sometimes I don’t necessarily want to write all of that code. My workflow is typically the same throughout. We’ve actually added an application through Wave that allows you to have a sample workflow that will set you up for much quicker experimentation. One option that we’ve done is. Import the data, set up the driverless AI experiment, again sending that data from Databricks Delta over to driverless AI, and then generating a node book that will run and log the experiment. This node book can actually be sent directly to a Databricks cluster. I don’t have to necessarily write the code from scratch. I can directly get this Wave app to generate this code for me. It looks something like this, right? this is just our version of a Databricks node book generator, where I say where my data is stored, and this could be an S3 location or a Delta table query, specify the address of where my driverless AI instance is, and then set up the experiment, so I can specify the target.
This is all in a nice user interface. If I’m somebody who doesn’t write code or who prefers not to write code, I can actually leverage this to generate the node book and go from there. Then, it’ll generate this node book and actually give me the option to specify my Databricks instance and send the node book to it. Let me show you that live. Here’s an example of the generated node books created through that Wave app. This code was generated for me based on my input parameters. It allows me to start from importing the data, whether it’s again, stored in an S3 location, or again, I can use a query, connect through JDBC, and actually run a query through a Delta table to give the data set that way. We offer a Python client and an R client, as well.
This example is in Python. To connect to driverless AI directly from your Databricks instance, this allows me to interface with driverless AI through code, and I can set up my experiment. I can set up my data path and then view the data sets that are already available on that instance through driverless AI, and then set up the experiment run. Something like this where I’m previewing the experiment and I’m seeing we’re going to test a variety of models. In this example, light GBM and XG boost models will be tested. We’ll see if we want to create an ensemble model, whether that gives us a good result or not. We may end up with a single model or an ensemble, and then here are some feature engineering transformers that will be tested. In this example, we’re trying several different categorical encoders, including one hot encoding frequency encoding interaction terms, weight of evidence transformers.
If we have text data, it’s going to use some text transformers. Really depends on the data set. It does not have to be fully numeric. It’s going to test appropriate transformers given the types of columns that your data set has. It’ll use a genetic algorithm to identify what works best, given the optimization metrics that I’ve chosen, which is in my case AUC. Once I set this experiment to run, I can actually go to my instance of driverless AI. This is what it looks like. Look at the experiments that I’ve created and actually interface with them using point and click interface. It does have a nice UI and this UI maps one-to-one with the Python client. I can always do either or, and so once I set up this experiment, I can always revisit it.
This is a visualization of the genetic algorithm that I was just mentioning. Each dot represents a single model. Light GBM model 27 features were selected. We’re seeing if we can do better. We can look here and we see light GBM model 28 features, and you can see the features changing in the middle of the screen. Each dot is a separate model, and we’re going along until we identify the best performing model. The best performing model has identified a couple of new features for me. Driverless AI created new features, which has interaction terms in this case. It found several interaction terms that would benefit my predictions. Again, optimizing by AUC and I can review the ROC curve and so on. I can do a lot of my diagnostics this way. This is again, always available to me, whether or not I’m going through a node book or going through directly through the UI.
Here I can now look at the mojo scoring pipeline, as I was mentioning. That contains the interaction terms needed for me to deploy this model, as well as the light GBM model with the hyper parameters. This is that object that I can either download and deploy myself, or this is where I’ll take it back to my node book and, and log it through MLflow. Here’s the example of that. I take that model object, the mojo pipeline. It’s an artifact of the experiment, and I logged that artifact using the log model interface from MLflow. This is the main part of the integration that we’ve created. Then I can also actually walk the metrics that are put out by the model. Here’s the example of that.
A lot of the metrics that are put out are up to me, what to log. Driverless AI produces a lot of different metrics for me to review. Here’s an example of if I wanted to keep more metrics to review for later. For example, you can see here that there’s quite a lot of different parameters that were used to set up the experiment itself and also different metrics. Even though I optimized by AUC, I also have access to many different other metrics to compare my experiments. Then, here’s the model artifact itself. The pipeline mojo is going to be stored as an artifact inside MLflow. Then when it’s time for me to use this model as a data scientist, I may just stay in the same node book, take a new Spark data frame, and then maybe use this model from MLflow to score it. Very simple interface here. I just called the model that I stored in MLflow and apply it to a new data frame, a new Spark data frame.
You can see my predictions here. This is where maybe my workflow is a data scientist will end, but once I’ve selected a model, that mojo pipeline, I can actually also send it over to a Wave app. That’s the wave app that I mentioned that we might use for say, monitoring this model in real time. This is a drift monitoring app. It takes the model and it looks through, these are the different features that are inputs into the model. As they’re coming in, in real time, we’re reviewing them. How are they doing? Are there any alerts that I should be aware of? Is there some drift happening?These are dashboards that are built completely custom for use cases that are highly specific to the customer. This is specific for this model that is underneath this application, and then I can review the features and how each feature is performing as we’re getting new data coming in. We can even set up alerts that let us know if there’s something that we should be aware of, or maybe even take a different version of the model and put it back in production.
This is where we can add all of these pieces. Now, I’ll hand it over to Eric to discuss the DevOps workflow for once the model is created and ready to be deployed.
Eric: Great. Thanks, Elena. When we take the models into production, we’re going to consider a couple of different things. The first is really how these models are going to be consumed. Would they be batch orientated? Would we run them as Elena showed us through a node book on the Costa, or would they be deployed in some external batch mechanism or maybe they’ll be used for the time. The thing is that the model that Elena created can use any of those ways. We just have to pick the one that’s right for our type of workload. One of those considerations of course, is the latency that we expect to deliver to the requester. For example, if it’s a real time application, do we need to get predictions done and in a second and give that response back, or do we have a batch window that might be several hours long?
The data size is important. When we look at using Databricks, we’re able to scale across the cluster and ingest massive amounts of data and score that on the workers. That’s an important consideration to look at what type of data and those SLAs. When we actually go into production, one of the things that’s really important is to understand if there are errors that are occurring at scoring time or other types of metrics. How long is a prediction taking? Is the system up and running? Those types of things, and getting those back to the right people, getting it back to the data scientists so they can understand if the data’s changed are some things that we’ll look at. We saw an example of the drift application that Elena showed us. That’s an example of where the data scientists can see if the data in production is different from what’s happening in the modern real time.
We start off by also auto-generating the node book. The idea behind this auto generation is to make it super simple, to consume these models that have been created. What we do is we have several different ways we can consume a model, but the idea of importing the node book and making it simple to use is a cornerstone to this. There’s four steps that we’ll show you, that we go through to deploy this.
The first one to actually, the first way to actually do this is to take the node book and then use the C++ library. You can see in the first cell here, we actually give you a list of different Python versions to install the module based on the type of scorer that you’re going to run. Then what will happen is that you load the data the same way that Elena did earlier. What we’re going to end up doing is define the path to the model that we want to use, and then go ahead and execute it. Those steps there are the same for all these models that we’re going to score in production.
The first one we refer to as the internal scorer, and this is where we’re going to use the power of the Databricks cluster to run this model. The idea here is that we can run and scale across those work nodes and use our API through this node book. You can do things like scheduling. You can do any of those operations that you can do in the node book, just for that UI.
Once we have actually scored all the data, what will happen is that a data frame will be returned with the predictions and they can then be used in downstream processing. Here’s the example execution. It’s fast. If we look at that cell five, we can see that we scored just over 39,000 records in about 1.18 seconds. What’s happened is that the predictions are returned into a data frame so that if you’re using the predictions later in your pipeline, they just look like a data frame that you can consume within your process. Everything that we’re doing here, it doesn’t change depending on the model that you’re using, because we specify the model. We will do all the processing and all the creation of these target features and the results for you just by calling it that function.
If you look at the other ways to score, the external rest API is another way to do this. What this does is it has an external entrant server that the Databricks worker can connect to. The reason that we’ll use this instead of using an internal scorer is, perhaps we have an entrant server that is being used by other applications or other users on the organization. This gives you a way to have one simple model that’ll be consumed by all sorts of different applications. That central point also gives you the ability to call the inference server, pass a data frame, but also pass a list of models that you want to use.
That’s really handy when you want to do things like what does my new model versus my old model make. You can now in one really easy step inside of Databricks make that call and get back the new predictions and the old predictions. A really easy way to maybe test models before they go into production, as well. Again, it’s the same idea. It returns a data frame with each of the predictions for the different models that you specified. You can use those further down in your pipeline.
Here’s an example of that execution. In the first cell, what we do is we have one model listed, the pipeline.mojo, but you could have other models, as well. What we will do is, we will go and make those calls and then give you back the results. We see here the performance difference as you would expect, is a little different, but you’re able to get some additional flexibility by using these models in different parts of the organization, as well, and have one single focal point to call. The nice thing here is that the model execution feeds back into all the model monitoring that we talked about earlier, things like drift detection, some of the advanced model monitoring, and this enables us to give all the visibility to the data scientists and also to the DevOps team that deployed them models.
The other way that we see customers use Databricks is, sometimes they’ll just use Databricks as a data source for a particular model that was created. An example of doing this as things like using Delta tables. If you have a big data, set of tables that you want to use as delta lakes, then what you’re able to do is leverage those through a JDBC call into the warehouse, grab the data, score the data, and then write them back either to tables directly in Databricks, or use the Databricks bulk uploader to upload those results back into the system. That might feed other downstream processes, as well. Either way, you have multiple ways to consume the models that Elena showed us how to build and use them in all sorts of different pipelines.
Now, when we go to complete our set of scoring in our pipeline, there’s a few things we want to think about. We’ve talked about monitoring to make sure the data’s okay and the systems are up and we’re using the right type of latency goals for the deployment we want. What would you do with the results? We probably don’t want to update the existing table, right? That’s the gold copy of your customer file. For example, probably what you want to do instead is create a new table and insert all the results with some key that you can do a join on. You can do all those types of operations inside of a node book and Databricks to join those tables together. That means that the existing customer table is preserved.
It also means that you can age those things, as well. If you use some of the features inside of Databricks to have different snapshots over time, you can refer back to them. One really easy way to do this, of course, is if you use the bulk upload at the end of doing all the predictions or if you’re using the internal scorers, actually do that directly to the data frame as a bunch of rows or a single row. Of course, it’s really important to always get that feedback on what’s happening with data. Has the data changed? Is the model having a different set of transforms being done to it at execution time? Are latency changes being noticed, things like that?
Recording those metrics back into the system as Elena showed us using MLflow is an ideal way to do that. If you have an external inference server, then using some of the drift detection side of MLflow is a way to do this, as well. With that, I just want to show you an example of one of the tables that we sometimes create, where we do the results in an external table. You can see, we actually use the target variables as a way to save the predictions, but we have a key. That key is that ID that allows us to join it back to the original table, using a node book inside of Databricks.
I’m going to pass it back to Elena now to wrap up the presentation. Thank you.
Elena: Thanks, Eric. Just to summarize, we have an integration with Databricks and H2O AI that offers a really end-to-end pipeline from data management all the way to model deployment, and actually consuming the model results. Through leveraging the power of MLflow and Driverless AI, we get both highly scalable model training and scoring as well as advanced automated machine learning features that include feature engineering, feature selection, and producing really highly accurate and explainable models. I’m very excited to share this with you. I do look forward to your feedback from our session. Thank you for attending.
As a Senior Solutions Engineer with H2O.ai, Elena is passionate about helping customers solve advanced data science problems while maximizing business value. Coming from a diverse quantit...
Eric is a Senior Principal Solutions Architect, he is passionate about performance and scalability. Eric’s role enables him to help customers adopt H2O.ai within their enterprises.