Data & ML projects bring many new complexities beyond the traditional software development lifecycle. Unlike software projects, after they were successfully delivered and deployed, they cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. We can always get new data with new statistical characteristics that can break our pipelines or influence model performance. All these qualities of data & ML projects lead us to the necessity of continuous testing and monitoring of our models and pipelines.
In this talk we will show how CI/CD Templates can simplify these tasks: bootstrap new data project within a minute, set up CI/CD pipeline using GitHub Actions, implement integration tests on Databricks. All this is possible because of conventions introduced by CI/CD Templates which helps automate deployments & testing of abstract data pipelines and ML models.
Speakers: Michael Shtelma and Ivan Trusov
– Hi everyone, today Ivan and I will be talking about training continuous integration and delivery of your ML projects on Databricks using our new open source project CI/CD templates from Databricks labs. So I’m Michael Shtelma. I’m an SA in engineering team at Databricks, and now I’d like to pass to Ivan so he can introduce himself.
– [Ivan] Thanks so much Michael. Hi everyone, my name is Ivan. I’m a solutions architect in IMIA team of Databricks helping our customers solving toughest data problems with Databricks.
– Today we will be speaking about three following topics. At first, I will give you some background about the challenges that machine learning teams face while implementing robust machine learning pipelines and improving their quality. And after that, I will introduce CI/CD templates project. And finally, I will pass to Ivan. Ivan will show you an entrant demo of CI/CD templates project. He’ll walk us through creating a new data project and he will show you how to set up continuous delivery for that project. Okay, let’s discuss why do we need one more project around deployments and CI/CD. A lot of organizations are struggling to drive additional revenue and insights out of them. And the reason for that is that a lot of their models never get to production at all or are delayed in some stage of ML project lifecycle. The a reason for that is that unlike traditional software engineering where they usually try to take out some code from data and supplement traditional tests, in the mail you cannot easily couple code from data and from mobile and this make it especially difficult to develop. I will give you an example here. You will need a production data set to produce a model and you rarely have the possibility to work with such a dataset in the CI environment. So basically in the container or virtual machine . So, let me sum it up. To address the mentioned challenges, we don’t need to develop our models on live customer data while having a possibility to test our code on the live datasets and ensures that our models and codes work exactly how we think it must be working. And if there is something wrong with our latest change or commit, we can always revert it back and go back to the last successful commit. So why it’s so difficult to implement continuous delivery and continuous integration for machine learning. So let’s discuss what challenges machine learning teams face when they try to implement continuous delivery for ML. One of the challenges is that teams are relying often times on traditional continuous integration tools to test machine learning pipelines. This means that the actual tests are running in a virtual machine or a container. This can be challenging because it’s not possible to use whale belt distributed data sets in such a setup. And this prevents us from testing the pipeline in the real world scenario. Another issue is that often our models are relying on GPUs and the GPUs are also not available in our containers or reach out machine. To tackle this challenge, teams are turning to distributed and scalable platforms like Databricks to use distributed data sets and complicated models. And this helps a lot while implementing fast prototypes of the data pipelines or machine learning models. But as pipelines and models are getting more and more complex, the need for traditional continuous integration tools arises and this leaves the teams with the choice either to use traditional tools like IDE with traditional integration tests, traditional CRA tools. And basically in this case they will be unable to use the rail, well like rail data during the development and testing, or they can switch to something like Databricks and use notebooks and that makes integration testing really, really difficult. They can run into such a scenario during implementation of couple of journal projects. So, what does the solution to the mentioned challenges. I would like to introduce now, so I would like to introduce the CI/CD templates projects that can help you tackle those challenges. This project helps you to bootstrap your new data application or new ML application from scratch. So it’s a template that will help to create your project like to create you a new project. And after that, you will be able just to fill it out with the code, with the logic and your models for data pipelines and then you can just push it to GitHub or any other git repository and configure, for example, GitHub actions or Azure DevOps and in kind of like 10 minutes, get your CI rhyming. So get your integration tests running on Databricks . And last but not least, it’s really easy to develop using IDE if you’re using CI/CD templates, like you can run your pipelines directly from your IDE on Databricks and use your real world data sets access the same environment that is used during the production. Okay, now let’s discuss how you can set up a new project in easy five steps. So at first, you should create a new account environment in Style cookie Kartra incs past modules. And then you can just try cookie Kartra https//github.com Databricks slope changes into templates, and also bootstrap as a project. So you will need to answer a couple of really easy questions. And after that, you will get the project created from the template. After that you can create a new GitHub repository and configures the secrets and you will just need two secrets, Databricks host and database token. Those secrets will allow us to understand and connect to the Databricks workspace you would like to use. And then you can just push all the code and all your changes to GitHub repository and that’s basically it. After you have pushed your code, you will see that the CI pipeline was automatically . The template already contains one example job and a couple of example tests that will be automatically run. And you can of course change it and add your new rail pipelines and of course, add the test that will test your pipelines. Let’s discuss now the project structure. So having assumptions that you will code all your pipelines, you will place your logic, your models in a Python package. And these python packages will be created for you. It will have the same name as the project itself. And for example, now you see that they have one, there is a jobs sub package where you see another sub package simple. So it’s a simple job. And within as a sub package sub module, you see enter point python file. So within this file you can implement your code and implement the logic of your pipeline. That’s going to be the file that will be run by spark on Databricks when you run the pipeline on Databricks. There another really important folder called conf where you can see deployment.json. That’s file where you can configure all your jobs. For example, you can configure the configuration like cluster configuration, how many notes you would like to have for the job? How many ,let’s say, what types or what sort of notes you would like to have. And you can also define different environments. Ivan will show you in more detail, how this file looks like and what is it’s content. And there are a couple of other files, for example, there is setup.py. The file where our package is configured and there is a test folder where you can see a sub folder for integration tests and for the local unit tests and of cause there is our BU for our virtual package where all the, let’s say glue code and all the functionality that like we’ll run the job on Databricks is implemented. So let’s now discuss how all this runs. So let’s say what happens when you push your new code, push the new commit to git and how those commits will be tested on that topic. So in the background we have already say two pipelines that will be running in the case of, or just a simple push or a new release creation on GitHub. And basically they run the local union tests like they’re just a code in the VHL package and they deploy this code to get topics. So by deploying the code to Databricks, what happens actually via logging evidence central MLflow experiment and every change will be locked as a new run and it will contain all the configurations and the resources that your job requires and all the, when you run your job it will be also blocked two ML tool. Now I’d like to tell you a couple of thoughts about our new version. So a couple of days ago we had published the new version of CI tool to templates. And now this version is powered by the DBX utility. So what you can do now as opposed to the previous version and why DBX is really important. With the DBX you can customize a project structure. I have shown you the project structure that is the default one and that is a something that templates, like use this right now, but with DBX, you can do way more. With DBX you can use your own custom structure and you can just supply all the information to DBX. For example, where is your pipe entry point, what files you would like to deploy, So what files are needed to run your job and what dependencies you have and DBX will take care of deployment and trying this chunk. So now we don’t require you to follow our view on the project structure, you can use your own one, and then you can just supply all those parameters to DBX and then you can implement your own site pipelines around this. Another thing is that now it’s really easy to start using new CI tools. Right now we support GitHub actions and Azure DevOps, but there are way more different CI tools around. So guys feel free to use New York tools of choice with DBX and you can create new templates or you can file PR to our project and add support for new CI tools. So all the contributions and PRs are really welcome. And the last thing is that is really important for everybody to be who really loves developing in IDE, is that with DBX, it’s really easy to develop in IDE. So you can use DBX to run your changes and run your pipelines directly from IDE. So how does it look like. You can run configuration within your IDE and you can run DBX that will package like package all your changes as a wheel and deploy it automatically to Databricks and transit. And you will just see in the console the episodes. So now lets discuss into more detail the standard flows, that CI/CD template already has. So the first one is a Push Flow. This flow will be automatically run when you push some new changes to the master branch. So when you push the changes, what happens then we will at first install all the dependencies that we’ll need, like install piping and then solve the dependence because we need it for running our inbuilt and unit tests and after that we will run our local unit tests. If the unit test is successful, then we will build the wheel and we’ll deploy it to Databricks workspace and then we’ll run our integration tests on Databricks. And if our integration tests on Databricks was successful, then we will Mark our commit as a successful commit, like was a green check Mark. And now let’s say that we would like to, let’s say they are ready with our development and now we would like to release the jobs or we would like to bring our jobs to production, to our production workspace. Suppose that we have a release flow, so if you’re create a new release on GitHub, so release flow will automatically be around and it looks like it’s really similar to the previous one, but there is one defense. After running the integration tests if our integration tests are successful we will create the production jobs on Databricks as the jobs. So all the jobs you have configured will be created or updated in the Databricks workspace and they will be run automatically according to the schedule that you have defined. And now I would like to pass to Ivan. He will show you a demo. Let’s say he will show you CI/CD templates and DBX and in action, and you will see how you can really easily set up a new project and set up DCI pipelines for this project.
– Okay. Hi everyone. Hopefully you can hear me. In this short demo, I would like to show how the CI/CD templates project could be used to quick start, create new repository and fill it with a ready to use CICD pipeline. Let’s first of all erase my repository. I have just created it. Most important thing to mention is that you need to set up secrets, so you shall provide Databrick host and Databricks token. This is your personal access token, for example. And then basically from the repository perspective, you’re good to go. Next step would be to install couple of dependent libraries. Right now we need to install cookie-cutter and path, they’ll regain stalls in my environment. The second step would be you to create a new cookie-cutter project, cookie-cutter based project. We are using CI/CD templates from GitHub straightaway. Let’s call the project, say cicd demo v2. I will use Azure cloud for demo. We of course support AWS as well. And I will use GitHub actions four as a CI/CD tool. I have a ready to use profile with all the configurations providers and we’re good to go. So you generally say created repository it follows this structure described by Michael. Whatever I’ll do straight away is to add a remote branch. So it’s already initialized. The only thing I need to do is to add a remote branch, switched to this branch. Let me create an initial commit, push it. Okay. And then I will add a remote origin, already exists. I will switch to the new brunch and push everything. This push will trigger the on push action. It will take a look on the content. Now we’ll see that everything is coming from the initial comment and it will trigger the test shell, trigger the test pipeline. Just give it a second. Yep, here it is. Here’s the initial pipelines. It’ll perform all the necessary steps to test your code. So in this pipeline, we include unit tests, we include integration tests and so on and so on. Let’s take a look on a code structure. So basically here is our directory. It looks like, exactly how Michael has described. The most important part from the CI/CD perspective is this on push section. Basically it will execute the test pipeline based on your project. Let’s take a look on this shortly. So it will basically do the following; install all of the dependencies needed for this pipeline to run. Second step is to launch local integration tests then configure the profile based on the secrets we have described in our GitHub. And then finally deploy the integration test and execute the integration test. And the question here is where this deployment definition comes from. It’s very straightforward. It is in the conf director, you will see the deployment of json file. And this file basically describes the deployment structure which shall be used here. So for example, for the integration test we are using quite a relatively small cluster and the libraries section will be filled by build up project automatically. And all locally referenced files for example, this one will be transformed into, will be uploaded to the PFS and in the job definition, they will be transformed. So from this perspective, this is how it looks, this is how it works. And then in the CI/CD pipeline basically this deploy will exactly deploy this job and after to launch this job, using the DBX CLI tool. This still has a lot of functionality so basically here we show only a couple of it. It’s about like deploy and launch. You can also configure multiple environments and you can execute your code on an interactive cluster if in use. For now let’s take a look on our CICD pipeline. So basically as we can see, the job was deployed to, it was deployed to Databricks workspace and now it starts the run basically waiting for the cloud. As you can see it reports to you the status and we have given this parameter trace. So this trace parameter traces the job until the end. And if job fails, it will also report an error to the CICD pipeline as well. Let’s take a look at what’s happening on the workspace side. Here we can see them on the big list of jobs we have in our demo environment. Here we can see this job, exactly this one, which is efficiently running. Let’s click on that. And we’ll basically see that this job has been triggered already because this is what DBX launch does. And before that the job was deployed and as we can see it has provided the VHL file with our project and entry point file or the task file or Python task file directly in the job configuration. Yep. So basically this is the way we can launch integration of testing pipelines. But not only that, we actually can also do this for our release pipelines. For example the release pipeline does the same set of steps. So it creates the Python environment configure, solves that dependencies for you, then it launches integration tests, then it performs the deployment of integration tests, runs the integration tests on Databricks and then finally deploys the new job. And then you are the person who decides whenever to run the job or how to execute this. The most important part of the whole process is of course the deployment.json file. So basically as you can see for each environment this is environment name, you can define a list of jobs or no jobs, but that makes no sense basically. And inside every job we are compliant with Databricks drops API. So the way how you define the job shall be compliant with Databricks jobs API, except two things. So first of all in your local file reference could be just reference to the root of the project because when we perform DBX deploy it will automatically upload these files to DBFS and reference these files correctly. And second thing regarding libraries section you can add your own libraries, any of them. But also during the deployment, you can provide requirements file, and all requirements from the requirements file will be added to the libraries section as well. And we will also automatically add the package from test when you perform Python setup deploy. You don’t need to do this manually because we do that for you. There’s behavior to be disabled. But in general, the configuration of this VHL file which comes when you build your Python package will be also automatically added to the library section. Yep. Now we can see the stages of our job here directly, seems like the cluster is starting which is completely okay. Important thing to mention is where actually my files are stored. So here is this very long directory name, can we understand where the files are physically stored? Actually, yes we can because every environment has its correlated DBX project. So this one, and this is basically a normal flow experiment associated with its artifact location and per each experiment we have basically deployment trance and launch trance so-called. And you’ll click and take a look on the deployment trance you’ll actually find out everything related to it. So first of all, here is our entry point file. Secondly, here is our VHL file with the project. And finally here is the DBX deployment.json, this is the file which maps the job name to the job ID automatically for you. And of course you can see the job status directly in your Databricks UI, which gives you the way to understand what’s happening with the job itself. Yeah, as we can see, the job is currently running and it has succeeded. As the job has succeeded, the same thing shall happen with the CICD pipeline. Yep, exactly. So once again, where we are, it’s CI/CD templates, it’s an open project opened by Databricks labs. I’m one of the developers of this project and I will be happy to help any users of it. And of course we have quite sophisticated documentation here. If you’re looking for a bigger reference and you’re looking for more support about how to use CICD pipelines . And that’s it, thanks a lot for listening to my story and see you, bye-bye.
– Thanks Ivan. And now after the demo, I like to make a short summary. So during this presentation you have, at first we have discussed the challenges that different teams faced while implementing continuous delivery for ML pipelines, and after that, I have introduced Databricks lab safety to templates projects that can help you tackle those challenges. And Ivan has shown a demo where he has created a new project and then like in just 10 minutes implemented the CICD pipelines and for this project and shown you how to run integration tests on Databricks. So feel free to contribute to this project. If there is some issues with it, feel free to open the new issues or fix the box and send us the PRs. Thank you everybody for your attention.
Databricks Senior Solutions Architect and ex-Teradata Data Engineer with focus on operationalizing Machine Learning workloads in cloud.
I'm a Solutions Architect at Databricks, helping our customers to solve toughest data problems using Unified Data Analytics Platform. My main topics of experience are Apache Spark, machine learning and data processing applications.