ML development brings many new complexities beyond the traditional software development lifecycle. ML projects, unlike software projects, after they were successfully delivered and deployed, cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. In most ML use cases, we have to deal with updates of our training set, which can influence model performance. In addition, most models require certain data pre- and post-processing in runtime, which makes the deployment process even more challenging. In this talk, we will show how MLflow can be used to build an automated CI/CD pipeline that can deploy a new version of the model and code around it to production. In addition, we will show how the same approach can be used in the data training pipeline that will retrain model on arrival of new data and deploy the new version of the model if it satisfies all requirements.
– Hi everyone. Today Michael and I will be talking to you about running continuous integration and delivery on your ML pipelines using a new open source tool, CICD Templates from Databricks Labs.
So Michael and I both work at an internal analytics team within Databricks.
So today we’re gonna be talking about three things. So first of all, give a little bit of background into what sort of challenges ML teams face when trying to build out robust pipelines. Next, I’ll be introducing CICD Templates, the new open source project from Databricks Labs. And then finally, I’ll pass it over to Michael to share a demo of how this actually works all together. He’ll be walking you through how to get started with CICD Templates and then showing you what the end state looks like on an end to end ML pipeline.
So I wanted to start with this quote from Sato Wider and Windheuser from last year. They had this great article talking about what they called CD for ML or Continuous Delivery for Machine Learning. And I think it really encapsulates a lot of the problems and goals that teams run into when trying to actually build out machine learning pipelines at scale. So one of the big challenges that teams run into is that unlike traditional software engineering, where you want to separate out code data and models, with machine pipelines, machine learning pipelines, that is you actually have all these three things combined. And separating them out is leads to actual problems. Big reason for that is because with machine learning, unlike traditional software, engineering, code, data and models are all coupled, basically as a feature instead of a bug. By changing one you change all of the other components. So making sure that you have a system that addresses that coupling explicitly is a critical difference between Continuous Delivery for Machine Learning pipelines versus traditional CICD. So ultimately, the goal here is how can we get machine learning running on live customer data in a way that we can have tests to make sure that the machine learning code is working exactly the way we think it is. And if there are any sort of problems, we can revert back to a working build, as well as add small safe increments of new code or new models that can be tested and then reproduced over time.
So what are the challenges that ML teams run into when actually trying to implement a robust pipeline?
The first one is that if ML teams are used to using traditional tooling, then traditional tooling usually relies on executing tests. Including unit tests or integration tests in a single virtual machine. But when working with models that use GPUs or when working with large distributed data sets, oftentimes a single VM doesn’t cut it. So teams will turn to programs like Databricks that give you distributed computing and scalable computing in order to work with these large data sets and these more complex models. This works really well for basic prototyping and data engineering pipelines. But as systems get more and more complex, there’s a need to bring back a lot of those traditional CI tools as well as incorporate local IDs above and beyond notebooks. Essentially, teams have this problem where they feel like they have to choose either they get their traditional CICD tools, but they don’t failed to deal with scale effectively. Or they choose things like Databricks, notebooks, which are useful, but then don’t give them a lot of the benefits of the traditional CICD tools. So we actually ran into this problem internally, building out ML pipelines. Which led us to come up with the idea about CICD templates.
So how does this actually work?
The idea behind CICD templates is really to help people merge traditional CI workflows, while still being able to use Databricks as an abstract compute resource. So what I mean by that is, CICD templates allows you to use your existing tooling, but then all of the tests and deployments run directly on Databricks. In general, when people are trying to solve this problem, they run into three issues. The first is how to actually scale and reproduce their code. The second is how to fit things like Databricks in with their existing workflow. And then the third is actually how do they put all this together without having to make a ton of Glue Code and difficult to maintain Structural code So CICD templates really aims to solve all three of those problems by giving you a best practices template that already has all the integration setup between GitHub actions and data breaks, as well as an easy way to personalize it for your project. And then a bunch of hooks also into things like ML flow, so you can track progress either on GitHub or on Databricks.
How this works in practice, is that a user would start by running the command cookie cutter, followed by the CICD Template GitHub repo. This would download it from GitHub, and then ask you a series of questions like project name, authors, where you’d like your ML experiment to live, for instance. The second step is that people would put their Databricks host and token into GitHub secrets. The third is that you take your recently created local project, and then just initialize, git, and add the code and then commit and push to your new repo. And that’s it. After that, what you’ll see is that as soon as your code touches your GitHub repo, GitHub will automatically kick off some tests as indicated by a yellow light going by your recent commit. Those tests will run the unit tests locally inside a GitHub VM, and then it will take the integration tests and run those on Databricks. And then upon success, it will return a green checkmark or if there’s a test failure, it will return a red cross.
On the Databricks side,
we’ll see the following thing happen. So first of all in that ML Flow Experiment location that you previously specified, or if you chose the default, you’ll see that it has pulled all of the code from your repo and packaged it up as a wheel, and then saved it as an ML flow artifact on Databricks file system. This allows Databricks jobs to now access that custom code as a library. The next thing you’ll see, as indicated on the bottom picture is that an ephemeral job is started on Databricks, hence the wheel containing all of your repo code and pipelines is then installed on that job. Then the pipeline runner inside the CICD template is called and your code is executed, including all of the tests. If those tests pass, then the job run is successful and that data gets passed back to GitHub so that it can show the green checkmark.
Basically, because we’re using GitHub actions here,
any sort of trigger that you’d like to define custom is totally valid. But to start, we’ve defined two. So one is on a push, so when someone does a push to the repo as a commit, and the second one is when someone wants to package up the repo for a release, when they push code to database project on GitHub, what happens is first, all the unit tests are picked up and run within the GitHub VM. If the tests are not successful, then it will stop and just tell you that something went wrong and you need to fix it. But if they are successful, the next step is that Databricks CICD templates will then package up your code as a wheel, it will send it off to your Databricks workspace and log the artifacts the wheel as an artifact to ML flow. Next, it will kick off some tests that will run through various DEV tests and integration tests. Pulling the wheel are Fact and installing it on clusters. If those are all successful, then it will pass that information back to GitHub and you get your green checkmark.
A similar process happens when you do a release. So the first step is you create a release on GitHub, it will run the local tests. If those are all good, packages the wheel, pushes to Databricks, runs your integration tests on Databricks, and if that’s all successful, then it will package up your job config, and then push that job to Databricks. Now in the job config, you can specify things like reoccurring run daily or weekly. And once you do that, that means that your job is now running in production.
So next, I’d like to pass it over to Michael who will take you through a demo of how this actually works. And then he’ll drop into an actual working end to end ML flow pipeline to actually show how you can implement ML ops. Including model serving and then model monitoring at scale using these pipelines. – [Michael] And now I’d like to show you a short demo, where I will show how you can bootstrap your data project using CICD templates and how you can implement CICD pipelines up to that using GitHub actions and CICD templates or and run your integration tests on database. So let’s get started. I will use cookie cutter for my, for solo use cookie cutter for bootstrapping the project to use in CICD templates.
So we can just type cookie cutter to the command line and then we can add a GitHub URL of CICD templates repository. And now we have to answer a couple of small questions. So I will have here the name of the project
and then I have to answer a couple of other questions. So I will select here Amazon cloud because I would like to test everything in Amazon. And we are pretty much arranging. So we have created the template.
And as you can see, we have now a new folder called CICD the test where our project skeleton is stored. So let’s go inside,
and let’s see what we have inside. So as you can see we have here safety test folder and this folder is well created for the Biden package. So it’s going to visit place where you can develop all your logic. So here you can create a lot of sub packages and you can place the model training code you can place the data generation pipelines future engineering, and all other things. An to have a pipelines folder where we have two dummy pipelines. So two skeleton for our pipelines. And pipeline mean, something like a job, something that you can independently schedule and run. And here we have our entry point script pipeline runner.
And the script can use the code we have developed in the package, and it can reference this code and run it. And then we have here a couple of JSON files. For example, for AWS, you can use this file to specify the number of nodes, to specify different cluster settings.
Yeah. For tests, we have here two directories, so DEV tests and integration tests. In DEV test, we have integrated DEV test pipelines that can run on Databricks, and then can test the code we have developed in our package. And in integration tests, we have also test pipelines that can be part of the integration tests. The difference between them is the depth test around on each push and integration tests around
after we create the release. But now, let’s just push everything to GitHub and let’s see how it’s going to work. So I will initialize my GitHub repository here
and I will add all the files.
And now I will commit all the files to the,
So it was locally, so git commit.
Now we can go to the GitHub.
Let’s say I have created the repository before that and now we can just push all the, so we can push our commit to GitHub.
Okay, it looks like we have done this. We can take a look here, we see that our code is there. And now we also need one configuration, configuration to make things ready. So, I have done this before, but in your case, you can do it after you have created your repo. So the thing is that we have to add two secrets that will allow our code to interact with the Databricks work space. And here we have Databricks host and Databricks token. The host is just a workspace URL, and the token will allow Databricks will allow our CICD templates code to kick off the jobs in this workspace. So let’s go back to our main lookup screen, and we can see that our CICD pipeline was already started. So let’s go inside and let’s see what’s happening here. We can see that the pipeline is quite easy at first, it’s checks out our code, it installs Python, it after that installs dependencies. So those are going to be the dependencies we have mentioned in the requirements.exg. And after that, it will run the local unit tests with the pipe test and it will build an artifact. And it will then deploy this artifact on Databricks and run the pipelines that we have in the DEV tests. And it looks like this has already been done. So the pipelines are started. And like now we have to wait maybe couple of minutes for the result. So in meantime, let me summarize everything we have seen. So we have created the skeleton of the project. And then we were able just with a couple of clicks to set up the CICD pipelines on GitHub and GitHub actions that like can run on Databricks.
So I think we can go back to this screen after a while and see if our pipelines and default tests are successful. And now, I think we can just move to our next point and I’d like to show you the project that was developed using this approach. So, I have developed this project in Python ID on Databricks.
So, let me maybe switch to ID to show you how it looks like.
So the goal of this project is to develop a model that can predict
if the loans can be repaid. So I am working here with the Lending Club data set and into the data set we have like a bunch of loans. And yeah, I’d like to build a model that is going to predict if our loans are going to be repaid. So let’s run this pipeline.
It’s a consumer pipeline and this pipeline will just use the models that I have already developed. And now we will see how it’s working. So I think another thing I also would like to show you is how easy to style the pipelines from the command line, as you have seen, I have just like select here, the runtime configuration, just runs it. And here you can see the configuration self can just reference our unknown script. And
then you have to specify the folder the name of the pipeline and the cluster, against which you would like to run this.
And then the script will build the view. So build our package, deploy it on Databricks and around the pipeline.
We have to wait maybe a couple of seconds for the pipeline to finish.
So you can see is the script that was started on Databricks. And in the meantime, we can take a look at the code. So what this pipeline does, so basically just reads the data with Spark. And then it grabs the latest model from the model ML flow model registry, and registers as part of PDF, and then applies all this too.
And then applies all this, so applies the model in SQL, and then just writes the data back in some location
and here is the actual run I have added the codes that can output all this, so,
I’d like to show you the results.
Okay, now we can see the results. It’s just a table.
And we can see that we have here a prediction column where we have the results that our model has given us. And now let’s discuss how, we can train the models and how we can deploy the models. So that the consumer pipeline has the model it can use. And now let’s go to the training pipeline. I will also run it, here it’s like the training pipeline, and I will run it. And as you can see, here, it shows this pipeline is just the class and it shows I have here, another class that provides me with data. So I’m doing here some feature engineering. So I’m reading my data from spark and then applying like filtering out the bad data and creating a couple of other feature columns. Nothing really special here. And after that, after I have the data, I can just run the train method that trains the logistic regression model. And after that, of course, I am
logging all my metrics to the ML flow. I can, I’m also marking my model as a candidate one and then I’m logging the model itself.
So the idea is that you can run this pipeline, maybe a weekly basis. So you can run it every week. And it will retrain the model using the MUR data. So the data that you have received during the week. And maybe you can have a couple of such pipelines that can train different models using different data architecture. And then the idea is how you compare those models.
And how do you make a decision to promote one of these candidate models to production. In the mean time, you can see that our model
is our pipeline is ready and to constrain the model. So let’s go to ML flow and see what has happened.
So here, you can see the experiment that this pipeline was using. And you can see we have a new tool here. And you can see that it’s a candidate Is true, so it’s a candidate model and we have here some metrics.
And if we go inside, we will see that we have a model that was serialized here. So now let’s try to use this model during the evaluation.
So, I think we can now start the valuation pipeline and discuss what is it doing in the middle.
So, I have started the pipeline. So, what this pipeline is doing, it will grab all the candidate models from the ML flow and it will compare them. And by compare, I mean that it will run the inference on each model against the latest data set.
And after that, we will also evaluate our model that is currently deployed in production. So, it will force that we will grab the latest version from the ML flow registry. And after that, we will just compare the results and if we think that our candidate model is better, then we will deploy it. So first we will register it as a model and then transition our model to production stage.
And yeah, and by the way, our pipeline is ready. So let’s see we have deployed a new version of the model. So let’s go to model registry and devices.
So we have, we can find our Lending Club scoring model. And you can see that we have now version three that was deployed.
So here is our version.
And I have shown you before that you can run the consumer pipeline and consumer pipeline will use the latest version from the model registry
in order to evaluate, in order to run the inference.
Now, let’s go back to GitHub and let’s check what what our pipeline was doing here. So as you can see, it’s green. So the pipeline is ready.
The last thing I would like to show you is the actual deployment. How you can bring all those pipelines to the real production, not just run it from the IGE but how can you run in the workplace on let’s say, on a daily basis. In order to do this, you can go to the release and onto GitHub and you can just draft a new release. So let’s do this together. I think we will not have time to wait for the pipeline to finish but I can at least show you that it will start. So now you can see that we have created the new release. And if we go back to, go back, we will see that the deployment pipeline was out here, by the way, even two pipelines were started. So one was just for the simple commit, and another one was the real release. And let’s go now to database workspace, I would like to show you what is going to be the example. So how is it going to look like. So if our integration test will succeed, then each pipeline will be deployed as a job and you can see, you will be able to see them in the job screen on database workspace in this way. So like we have our Training pipeline, Model evaluation pipeline and we have a Consumer pipeline. And the settings you have specified in the JSON file will be reflected here as (indistinct) So let me summarize What you have seen, I have shown you how you can bootstrap your project and how you can set up CICD easily like basically without changing any lines on GitHub actions and Databricks. And after that I have shown you how you can develop, let’s say, the whole machine learning project with different models and was whole model deployment features
in IG on Databricks.
And then I have also shown you how you can deploy all sorts of pipelines you develop in your project on Databricks. – Okay, thanks for that, Michael. So what did we see today? We started by talking a little bit about the goal of CD for ML and maintaining a stable pipeline for machine learning pipelines. In particular, we talked about the unique challenges with machine learning pipelines compared to traditional pipelines. Where machine learning pipelines have this tight coupling between data code and models, unlike traditional pipelines. Next, we talked about some of the challenges that teams find when they try to either use traditional methods on distributed clusters like Databricks, or they try to use Databricks with traditional software and how Databricks Labs CICD templates actually solves those challenges. So, for next steps, you can head over to Databricks Labs, CICD Templates repo to get started, or we’ll be releasing a blog post soon. Hopefully it will be out by the time this talk is aired on our blog. So the next step is please try it out. There’s a tutorial and we’re also looking for poll requests. So feel free to drop us a line or just submit a poll request directly to the repo. Thank you everybody.
Databricks Senior Solutions Architect and ex-Teradata Data Engineer with focus on operationalizing Machine Learning workloads in cloud.
Databricks Solutions Architect and ex-McKinsey Machine Learning Engineer focused on productionizing machine learning at scale.