Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runtastic

Download Slides

Data & ML projects bring many new complexities beyond the traditional software development lifecycle. Unlike software projects, after they were successfully delivered and deployed, they cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. We can always get new data with new statistical characteristics that can break our pipelines or influence model performance.

All these qualities of data & ML projects lead us to the necessity of continuous testing and monitoring of our models and pipelines. In this talk we will show how CI/CD Templates can simplify these tasks: bootstrap new data project within a minute, set up CI/CD pipeline using GitHub Actions, implement integration tests on Databricks. All this is possible because of conventions introduced by CI/CD Templates which helps automate deployments & testing of abstract data pipelines and ML models.

The CI/CD templates are used by Runtastic for automating deployment processes of their Databricks pipelines. During this webinar Emanuele Viglianisi, Data Engineer at Runtastic will show how Runtasic is using CI/CD templates during their day to day development to run, test and deploy their pipelines directly from PyCharm IDE to Databricks. Emanuele will present the challenges Runtastic has faced and how they successfully solved them by integrating the CI/CD template in their workflow.

Speakers: Michael Shtelma and Emanuele Viglianisi

Transcript

– Hello everybody. Welcome to the session title Developing a ML-enabled Data Pipelines on Databricks Using IDE and CI/CD. So, the agenda for today is the following. We will start presenting the two companies Runtastic and Databricks then we’ll talk about the challenges, we add the Runtastic, and the solution to these challenges. It is the CI/CD Template. Then we’ll talk about how we have integrated this CI/CD Template and Runtastic and then at the end there would be a demo showing you the CI/CD Template in action. I’m Emanuel Viglianisi. I’m a Data Engineer at Runtastic since January, 2020. Previously, I was working as a Researcher in the FinTech domain. I’m very passionate about Python and working with data.

– Hi everybody, I’m Michael Shtelma, I’m a Solutions Architect at Databricks. Before that I was working for Data. I really passionate about all data and ML topics.

– Okay, let’s now start introducing Runtastic. The best way to reduce Runtastic is by having a look at some numbers. Runtastic is an Austrian company founded in 2009, that is now a leader in the digital health and fitness. Since 2015, Runtastic is part of the oldest family and our 270 employees coming from more than 40 countries are working hard developing the two of the best apps in the market the Adidas Training and Adidas Running, which count more than 167 million registered user and 309 million downloads. Adidas Training contains a variety of body exercises, that requires basically no agreement and provides to the user tyroid workouts. And Adidas Running is that allows the user to track his activity. For example, running or cycling, and provides to the user several reports and statistics. Both apps are based on the concept of community. And you can share your achievements with your friends and other artists all around the world.

– Databricks is a software company. It’s a managed service that provides a unified data analytics platform. Sounds like a unified data analytics platform allows three core data personas to come together to collaborate and drive business outcomes using data. Those key personas are data engineers, data scientists and data analysts. So, Databricks is a global company and they have more than 5,000 customers and more than 450 partners. And there also original creators of different and open source projects. I think Apache Spark is one of the projects that like everybody knows but apart from Xero, there are also creators of such projects like Delta Lake, MLflow, Koalas and minimal

– We started our journey by moving our own premise analytical architecture to the cloud, in collaboration with Databricks and Microsoft Azure. As all of our code, we want also our analytical data to meet high quality standards. And we set to adopt tooling and procedures to ensure that. CI/CD is fundamental nowadays in every software development workflow, to ensure high quality software. Then the, our question, is there a way to integrate CI/CD in Databricks and use it for our data engineering pipelines? But let’s take a step back. What is CI/CD and which are exactly its benefits? CI stands for continuous integration, and that is the practice of automating the integration of different code changes from multiple contributors into a single project. The integration can be very complex because a coder may work well if there’s an installation, but may not work if it, it ran together with other codes. That’s why the CI process usually involves also tools for automatically running tests before and after the integration. CD stands for continuous delivery and it is the practice of releasing and deploy the product frequently and automatically. CI/CD is exactly fitting our needs because it allows us to automate long. And there are prone deployment processes like testing the code before every pull request merge, or deploy the right code into the right environment. For example, dev environment or production environment. Now I wanted to show to , to show you why you need CI/CD. Let’s imagine that there is one developer that has developed and tested one component, and another developer that does developed and tested them the second component. But when they tried to run the both components together, the results is not what they expect. Again, I know the to show you that both components work as expected if tested in installation, but the integration is a failure. Okay. Now I’m testing our ETL pipeline. Is very difficult because ETL pipelines, they are not just about the code, but they are also about the data. And we can not just use more data but we need production line data, and production data usually are available only in the cloud. That means that even for just testing our code we need to run the tests in the cloud. Moreover, our ETL pipelines make use of different cloud services. For example, we ingest the data using Azure Event Hub. We stored those data into the Data Lake and then the access to this data is restricted by some Azure active directory rules, and the pipeline make use of secrets stored in the Azure Key Vault. Moreover, other than the complexity added by working in our cloud environment, we also experienced that that mix also adds some complexity. We tried to implement CI/CD using the two options documented the in Databricks. We tried first with Databricks notebook. It is amazing for testing some query and visualize the assault. But we also found out that it’s very difficult to divide the code into different sub-models or sub-project. So versioning is not possible for being the versioning of the entire project, but just one notebook at the time. And there is no perfect place for placing the test and no automatic tool for running them together. Then we tried Databricks connect, that allows you to develop the code in the IDE and by running the Spark jobs in the Databricks remote classes. That seems amazing because we can divide our code in some modules, but also here there are some limitations. The first limitation is that, it does not support Streaming jobs. And then that is what we maybe use. And then it’s not possible to run arbitrary code that is not part of the Spark job on a remote cluster. It was clear that the available options didn’t work for us. And now, Michael will explain you the solution we came out with.

– So, Databricks notebooks are really great for pass prototyping for trying out different ML algorithms or just laying with the data and trying to understand how the data looks like. But it’s can be quite challenging when we try to productionize some tools we have developed to notebooks, in notebooks. It can be quite challenging when we try to put some jobs to production because as you are developing in notebooks it’s not really easy to follow good software engineering practices. It’s not easy to develop tests and use the test of in development. And it’s also not easy to implement integration tests. So, basically connect notebooks to Git. And after that trigger the integration test when we push the new version to our Git provider. And because of all the challenges I have mentioned teams usually end up in one of the following situation. So items they end up using IDE and there they use proper software engineering practices with local unit tests with integration tests and CI/CD, or they just use notebooks and they don’t have a proper tests. And because of that, they can have quality issues from time to time. And those issues, it’s something that we have tried to solve with the Databricks slops CI/CD Templates project. So, let me tell you what this project can help you to achieve. So, using CI/CD Templates, you can bootstrap bend you data project and straight after that, you can push it to GitHub and they have already GitHub actions pipelines there that will start testing your code on Databricks. So, to put it short, this project allows you to develop in your favorite IDE around your code, on Databricks, directly from your IDE and up to you already with your features you can push your code to GitHub and GitHub actions will take care of the testing. So, now let me tell you how you can set up all this. So I trust you can use a Cookiecutter by module to bootstraps the project. After that, after you answer all the questions that Cookiecutter will ask you you will get a new directory with the project code and then you can open it with your favorite IDE. So, maybe it was by charm or vs code, and you can start developing your logic status. After that you can initialize GitHub repo in your director as opposed created selecting your project director. And after that, you can also create a new GitHub repo on GitHub. You can push your changes or you can push all the code to the GitHub. And this will automatically start the test on Databricks. And before that of course you have to set the secrets, Databricks halts and Databricks token secrets that will allow setting the templates to determine which Databricks workspace I use. And now let me tell you what happens in the background. So when you run your code from IDE or you push your code to GitHub and GitHub actions, start the test. So what happens in the background, the CI/CD Templates will run your local tests and we’ll package the Python package you have in your project directory. And we’ll push all those artifacts to the two Databricks workspace and a undiscovered SIA using ML flow for that. So we are logging all the artifacts in MLflow. And after that we are using Databricks strips API to trigger the jobs on Databricks, and we can use those artifacts within the jobs. Now, let me tell you about CI/CD pipelines that we deliver with CI/CD Templates. So, if you push something to the GitHub master then the push flow will start. And put flow will just run at first, the local unit tests. And in the case of local unit successful we will build the pipe and package as a wheel, and to the load this wheel to ML flow together with other artifacts. And then we can kick off the developer tests jobs on Databricks. And in the case that our jobs are successful. So like intubation tests are successful, we will mark the push as successful. So like we will market using green check mark. And now let’s discuss how you can release or how you can put your work to production. For that to have a release flow. So, in GitHub, you can create a new release and this will trigger a release flow. And this looks really similar to the previous one was a couple of small differences. So at first turned around the local unit tests. And if our local unit tests were successful then we will build the wheel and track it like logo to ML flow in the same way we have done it before. And after that, instead of running the dev test we will run integration tests and integration tests are successful. And here integration tests will be run a little bit around also on Databricks as a jobs. So, if those tests were successful then, we will create the jobs on Databricks for all your pipelines. It’s the jobs are already created, like they already exist. Then we’ll just update the definition of . And now Emmanuella will tell you how Runtastic is using CI/CD templates.

– Okay, thanks Michael. So let’s first talk about our setup. We have a total four environments. For each of these environments, we have created another bridge course place. We have that environment that is basically a breakdown for the scientists that analyst and that engineers. When we run a test query and test code. Then we have the pre-production environment that runs release guided code on production data. Then we have the staging environment that runs the stable code or this kind of data. And then we have the production environment, that runs a stable code on production data. For each of the workspaces, we also have created another bookstore canal that allows us to run a program operation on the workspace and using the Databricks rest API. Then adding a look to our code, that’s how the, our project folder looks like after adopting the CI/CD template. We have analytics that can, that is the main vital module. And here, you should place all of your all your pipelines business logic. Then we have the folder name. Pipelines contains the pipelines configuration. And then we have the four dev tests containing unit and integration tests. And then we have runtime requirements. These are list of item dependencies shared among all the pipelines. Let’s take an example, let’s imagine a pipeline name. And you made an innovation pipeline that anonymized some important data in their main module analytics that can’t we have a script anonymize it by that contains that the main cluster with the ETL logic. So, that would be the code that reads the data from the Data Lake using maybe some complex CDF for being the actual anonymization and then to ride them the data back to the Data Lake. The folder, pipelines contain the configuration for an immunization pipeline that will be available later. And then the folder tests will contain a unit tests specifically for testing the EDF. And then you also have a integration tester for testing the ETL logic. Now that pipeline configuration here we find Databricks configfys the tire JSON file containing the input parameters for our pipeline, for example the sewer spot to the destination part and so on. Here we have chosen to add one that the mix config for each of the environments that brought reduction staging. Then we have job spec Azure files that also are JSON files containing the configuration for the job. And here we can find, for example, the cluster properties or Python libraries to be installed, et cetera. And then we have pipeline runner. It is the pipeline entry point. And these should contain just a few lines of code necessary for calling some cluster in the main module. Once we have written our code and also build, let’s say our configuration, then we are ready to run our pipeline. For running the pipeline, we need first to define two environment variables Databricks ENV and Databricks Token. The first one it is a string indicating which environment we would like to target. It is very important since the CI/CD template peaks, then the job spec file with a soft fixer matching the content of this variable. And then we added the Databrick token. That is the token we have created before, and that is used to authenticate and then running operation on the Databricks workspace. After the variable definition, we can run our pipeline using script called the ramp pipeline where we specify just the pipeline name. The script we’ll create a new job using the jobs configuration and then it will execute that pipeline entry point passing the parameters in the Databricks . Deploying the pipeline is similar. So, first when you do to define the two environment variables, Databrick ENV and Databrick token to target the environmental we want. And then we use the Python library of the CI/CD template to do the actual deploy. We call the need to release the CI/CD pipeline specifying the following parameters. The folder containing the test pipelines. The folder contains the pipelines to deploy a flag specifying if we want to run the tests before the deployment and then the target, the environment we want to use. Also the integration with GitHub actions is very easy. First, we need to define our secrets. In this case, our secrets are the data that we store canal of the different environments we would like to use. For example, if we want to deploy on different environments on dev environment production environment, we need to define the databricks focusing on different environments. Then we are ready to create our GitHub action. Here we have an example of GitHub action with the name of release workflow, and that is triggered every time we create a new release. Then I have a meeting for brevity, some boilerplate lines but at the end, then the steps you have to execute. I accept two steps we have seen before. Define the environment variables and then using the library to do the actual deployment. We have a total of three GitHub actions, that are very integrated in our git flow. Whenever we want to make some changes, we create a feature branch from the master branch. Then we do a our changes, we do our comment, and then we create the pull request. The pull request is then reviewed and then we add the label called STG, that triggers GitHub action that executes the unit tests and integration tests. After the pull request is tested and approved, that can be merged into the master branch. The merging itself triggers another GitHub action that does the test again and deploy the code to the pipelines in the pre-production environment. At the end, when we create the new release, the pipelines are also deployed into the production environment and the staging environment. Now, there will be a demo showing you use case scenario.

– So let’s start how our demo and this demo we are going to see how to create a project starting from the CI/CD template. And we also see our real use case and how to deploy the pipelines automatically using GitHub actions. So, let’s start, let’s go first to the GitHub page of the CI/CD template. Here, you can find a read me with some instruction and we are going to follow these instructions. So first of all, we need to create a new conda environment. So conda, create and let’s create a new one. Okay, once it is created, we are going to activate it with conda, activate CI/CD template. Now we have that conda environment activated and the second step is install Cookiecutter. So, Cookiecutter is a library that allows us to specify some parameters and it creates a new project starting from our template as in this case. So, we first need to install PIP. So we do conda and install PIP and then now we need to install a Cookiecutter. So you start a Cookiecutter here. Okay, and now that Cookiecutter. In this folder, we can run the command Cookiecutter and then specifying the URL of the project to start cloning this project and creating your project from the template. And as you see here, Cookiecutter is asking for some parameters. For example, here the project name, in this case we want to create, call it a CI/CD template demo. The version is okay. Description is okay. Hardware as well, license is okay. And then here is for the experiment part. So, the folder is suggesting CI/CD is a folder on DBFS and you are free to choose the four that you want. I want, for example, folder member shirt, say Citi demo and experiment. And then the cloud you’re using I’m using Azure and the CSV tool in this case, GitHub action. Then he is asking for a name for our main module, that my case it is analytics backend. Okay, now the new folder is created. You see analytics backend folder let’s say a look to its content. So we go to analytics backend. And here we see that there are many folders and files and we go quickly over done. The first one is analytics backend and it is the Python main module. And you should place here, your business logic. Then there is for the deployment, that contains a Python library. That is the call Databricks slabs IC/CD templates. And it is the actual library that is doing the deployment and execution of the pipelines. Then here we have three sets of pipeline. The first one is a set of pipelines executed after each push on the repo. The second one is a set of pipelines executed after each release. And the third one is that, the set of pipelines that is then deployed in the Databricks dashboard as jobs. And then we have the folder tests that contains normal Python tests. And then here we have some utility scripts for example, run now that executes the pipeline using an existing cluster and run pipeline that executes the pipeline, or a set of pipelines creating for each of them a new cluster. And then here we have requirements TXT that it is the list of Python dependencies for your project and runtime requirements. The DXT that is the list of Python dependencies for your pipelines. Finally, we have the GitHub actions, we have two GitHub actions. the first one is onpush. And as you can imagine, it is triggered after every push and as I explained before, the last command from last command you can see that it executes the pipelines that are inside the folder dev test. And the second GitHub actions is similar but it’s triggered after every release. And from last command, you can see that it first executes the pipelines inside the folder integration tests and then it deploys the pipelines inside the folder pipelines to Databricks. Now, we are going to see our real use case. So do that. You can see the actual, some business logic some integration tests and the unit tests. And then we see how we can run the pipeline and deploy to Databricks. The example I’m going to present now it is available on my GitHub profile under the name anonymizer CI/CD demo. And now I’m going to open this project on my IDE and I’m going to explain and the most important files. So, this project is a slightly modified version of the project we have created before. And it is a simplified version of our real use case that we have at Runtastic. It contains a function of pipeline called anonymization pipeline that reads from the data table anonymize and data and then write the anonymized version back to another table. And so let’s have a look to analytics backend, that is the main module. Here we can find a some module called UTIs that contains utility functions. For example, the file Spark runner has a class called Spark Runner that creates an instance of Spark. And here you can use the, you can define some configuration. And in this case, we are also importing the library data table. Then, the core of the pipeliner is in these two classes. We have the main and we have this class that is anonymization UDF that contains a function that we are going to use for anonymizing our data. The core of the immunization function is in these three lines. It takes a string as a input. And if the string is not, returns none otherwise it returns the inverse of the string. We are going to see an example later. And then here we have the class anonymizer pipeline, that defines instead the ETL logic. The class inherits from Spark runner, so that we can use our Spark instance. And ex in the constructor configuration dictionary with some configuration, for example the source table part, destination table part and so on. And it has a function called run that reads the input data. It applies the anonymization function and then writes the resulting data frame to a data table. So, let’s write some tests now. This UDFs is purified on so we can write unit test for it. So, inside the folder tests, we can create a new PI test file. I’m not going in detail explaining how Pi test work works but here we have some test data with the input and the expected output. And basically we pass these data and we run this function for all the input, and we assert that there is salt value of this function is the one that we expect, so the one on the right. And now we want to test this, that the ETL class. So the class anonymize the pipeline. For testing that this pipeline, we are going to create a new pipeline inside the folder integration tests. We named this pipeline anonymization pipeline and we defined the Databricks config that are basically the input that we have to give to the pipeline. So the Sioux stable path, this nation table path and some options, some input that, we give to the test class so that basically to reverse to the input data and the expected data that are located inside a resources folder. And then we have the job spec Azure file that contains the job configuration, for example, the name, the cluster type and so on. And, then we have the piper runner that contains the class for the test containing the test. Also in this case, the class in it’s from Spark runner and it prepares the Sioux stable, it runs the pipeline, and then it verifies some assertion that we have here. And at the end, if everything is fine we have the sprint statement saying that all their sessions passed. And at the end, we delete some temporary folder that we have created. So, to leave the the project in a clean state. At the end, we basically read the configuration file and we run that test. After the unit tests and integration tests, now we are ready to create the function that is that we want to deploy on Databricks. Also in this case, we want jobs file and the Databricks config. And the pipeline runner contains is that just few lines, basically one line for reading the inputs, dictionary and the other one for running the anonymization pipeline. And so after that, we are ready to run our pipeline. We can run the pipeline in two ways. The first one is running the pipeline locally. For running the pipeline locally, we have just to run the script pipeline runner specifying us input parameters, the location of the script pipeline runner and also the location of the Databricks config file. But for running it on the on Databricks, we first need to set some environment variables. So, before running everything, first make sure you have installed all the libraries that you need. So you have your conda environment and then you have PP stalled and you have the stored the deployment library that you have the stored the requirements, and that you also have installed the main module that is analytics backend. After that, for running the pipeline Databricks you have to set the environment variable that are Databricks host, the Databricks token then the Databricks environment that in this case, it is the demo. And so it will match basically the job spec Azure that has the suffix demo. And at the end, the last variable the T’s and ML flow tracking your AI. That basically is saying where we want to deploy our ML flow artifact. In this case, we want to deploy them on Databricks. After we upset all these variables, we are ready to run our pipeline. I have already created a cluster and the cluster is running. So, I will run the pipeline using the script by a run now. So, for running it, I just run Python run now. And the name of the pipeline I want to run. It is this will take few seconds, but so let’s see the GitHub auctions. I have also slightly modified the GitHub actions, to include the Databricks and the variable that in this case is demo. And to change this last command, so that the this GitHub action is going to test all the pipelines inside the folder integration tests. I basically did the same also on the GitHub action own release. So, I set the Databricks and also here. And the in the last command I have the integration tests, that are, that is the folder containing the pipelines, that is going to run before the deployment and then pipelines that contains the pipelines that at the end we are going to deploy to the Databricks dashboard. So, as you can see everything worked as expected and all the assertion passed. So, we are now ready to commit our changes and push them to the master branch. So, we run GitHub all, Git commit, changes nine, and we push them to master. Okay, now that is done. We can go back to the GitHub page and we can refresh it. So now, as you can see, there is a yellow dot. This means that there is a GitHub action that is running, and we can see more details clicking here. So right now it is establishing the dependencies, then it is going to run the unit tests and then the integration tests. So I expect this to run for 10 minutes, this because during the integration test it is going also to create a new cluster, and that takes about five minutes. So, but at the end, this is how it looks like, when the GitHub action terminates. So, it is basically our green mark here. And after that, we can create a new release clicking there and we can specify a tag version, title and a description and so. And then pushing this button here, the pipelines are going to be deployed on the Databricks dashboard. And this is what you can expect from it. So, under the tab jobs, you will find the pipeline anonymization pipeline and then you can run it clicking the button here, run, or if there is a scheduling it is going to be scheduled and so on.

– Okay so, I hope you enjoyed the demo and let’s now see the main takeaways. The data engineering pipelines, have to be tested like everything in software engineering. And the CI/CD is necessary for automating testing and the deployment processes so to obtain an high quality software. Then, we have seen the CSV is not easy to implement and the available solution, the Databricks is not looking at the Databricks connect are not enough for implementing this season and for complex scenarios. We also have seen that this CI/CD template by Databricxs lab fits exactly with our needs. And it allows us to organize our code in some modules, and implement CI/CD also by integrating with the GitHub actions. So, I hope you’re enjoying the session, and you will give a try to the Databricks CI/CD template. Thanks.


 
Watch more Data + AI sessions here
or
Try Databricks for free
« back
About Michael Shtelma

Databricks

Databricks Senior Solutions Architect and ex-Teradata Data Engineer with focus on operationalizing Machine Learning workloads in cloud.

About Emanuele Viglianisi

Runtastic

I'm a Data Engineer at Runtastic (Linz, Austria) since January 2020. Previously I worked as a researcher at Fondazione Bruno Kessler (Trento, Italy) on the topic of Security Testing.