As organizations continue to develop their machine learning (ML) practice, there’s a growing need for robust and reliable platforms capable of handling the entire ML lifecycle. The emergence of MLOps is promising but many challenges remain.
Register now to watch the latest developments from Databricks geared towards automating MLOps — including new Git and CI/CD integrations, auto logging of experiments, model explainability and model serving.
We’ll also cover:
Presentations will be enhanced with demos, as well as success stories and learnings from experts who have deployed real-world examples of such pipelines for predictive analytics. Live Q&A and discussion will keep this event engaging for data science practitioners and leaders alike.
MLOps and ML Platforms State of the Industry — Matei Zaharia, CTO and Co-founder, Databricks and Clemens Mewald, Director of Product Management, Databricks
End-to-end MLOps for PyTorch on Databricks using MLflow — Sean Owen, Principal Architect, Databricks.
Matei Zaharia
CTO and Co-founder
Databricks
Keven Wang
Competence Lead, ML Engineer
H&M
Wesly Clark
Chief Architect, Enterprise Analytics and AI
J.B. Hunt Transport
Cara Phillips
Data Science, MLOps Consultant
Artis Consulting
Speaker 1:
Data is big, but it’s potential is even bigger. When combined with AI data holds the promise of curing diseases, saving lives, reversing climate change, and changing the way we live. We believe the future depends on data and unlocking its limitless potential. We’re here to make that happen.
Speaker 1:
Databricks is the data and AI company. We help data teams, engineers, analysts, and scientists work together to find value inside data and solve the world’s toughest problems. Because the challenges we face as businesses, as people, and as a planet, aren’t easy. They can’t be solved in silos. It can’t be solved by one person. We need all the data, all the science, all the brain power. We need all hands on deck, in one place. On the only open unified platform for data management, business analytics and machine learning, and that changes everything. It expands our sense of what’s possible. It makes things simple. It turns weeks into minutes. So data teams can innovate faster, because that’s how innovation should happen. Collaborative and fast.
Speaker 1:
So let’s defy assumptions. Break the mold. Map every genome. Cure every cancer. Binge watch the cosmos. Spelunk black holes. Make every voice heard. Hack the hackers. [inaudible 00:01:33]. Take more moonshots and land them.
Speaker 2:
The heavens have become a part of man’s world.
Speaker 1:
From now on nothing stands between you and your data, you and the answers because the power of data is the power of knowing. Now you know.
Sylvia Simion:
Welcome everyone. And thank you for joining our MLOps Virtual Event. My name is Sylvia Simon. I do product marketing at Databricks, and I’m very excited to be here with you as your host today, as well as with our speakers, because we have a very exciting lineup and they have prepared for you some awesome presentations and demos to discuss and show you best practices and techniques on how to better operationalize and automate machine learning at scale. So I really hope you’ll enjoy hearing from them and learn more about machine learning practices in a diverse set of settings.
Sylvia Simion:
We will kick things off shortly with our opening keynote and demo, which will be followed by speakers from H&M, JB Hunt and Artis Consulting. And then we will wrap things up with a live Q&A’s at the end of the event, but first some housekeeping and logistics.
Sylvia Simion:
Your audio connections will be muted for the entire webinar, we can’t hear you. A recording of this event will be available. Make sure to visit the Databricks blog for details. And to deliver the best possible experience the talks have been pre-recorded. With that said, we designed this to be a highly interactive event. Our speakers and a few engineers from Databricks are on hand to field questions. If you have a question at any point during this event use the chat box, which is part of the platform. We will also have a live Q&A session at the end of the event. If you have a question for a specific presenter, please indicate it in your question and we will direct it to the appropriate person.
Sylvia Simion:
For those of you who are not familiar with who Databricks is, our company was founded by the original creators of Apache Spark seven years ago. Our mission is to help data teams solve the world’s toughest problems. And our business is focused on helping companies accelerate innovation by bringing together data engineers, data scientists and analysts across the organization. Lots of people recognize us as the creators and founders of some of the best of breed open source technologies, starting with Spark, but also Delta Lake, MLflow And recently we also acquired Redash.
Sylvia Simion:
So let’s get started. Let’s jump in. We have a packed agenda for the day. Our opening keynote on MLOps and ML Platforms state of the industry will be delivered by Matei Zaharia and Clemens Mewald from Databricks. Matei and Clemens will be followed by Sean Owen, who will give us a demo for end to end MLOps for Pytorch using MLflow on Databricks.
Matei Zaharia:
Hi, everyone and welcome to our MLOps Virtual Event on automating machine learning at scale. So I think as everyone on this webinar is aware machine learning is transforming all major industries in many as from healthcare to logistics, to industrial internet of things. The travel companies already deploying thousands and in some cases millions of models to manage day-to-day operations. But at the same time, machine learning is very different from traditional software and developing and operating machine learning applications is complex. So let’s look at a few of the ways that it’s different. First of all, in terms of goals. The goal of the traditional software is usually just to meet some kind of functional specification. For example, when you press this button and you create an account for someone. And basically it’s a boolean goal, either you’ve met the goal or you haven’t. Once you’ve done it, you’ve got software that works and that’s it. It’s going to keep working.
Matei Zaharia:
On the other hand in machine learning, the goal is usually to optimize a metrics such as prediction accuracy. And so it’s something where you’re never completely done. There’s always ways to make them better. And it might also change over time as the world you changes. Our second important difference is what affects the quality. In traditional software, the quality of your software depends only on the code that you’ve written. So you can review that code, you can debug it and at some point you can say, okay, it’s done and then the application will keep working really well.
Matei Zaharia:
In contrast, in machine learning, by definition machine learning is programs that generalize from data sets you can give them to train them. So the quality depends heavily on the training data and this data has to change over time as the world around you changes and so the quality of your application is also going to change. In addition, as you change the data, now you might also have the change tuning parameters for your algorithms and this is an additional complexity. So it’s a lot harder to say that, okay, once I’ve written the code I’m done with application, you have to keep training it with new data and keep tuning optimizing it in order to get the best performance.
Matei Zaharia:
And finally, let’s look at what kind of software you use to build the applications and how you manage that. So in traditional software, usually just pick one software stack. For example, you pick a database, you pick a web server framework, you pick a UI framework and so on, and then you just build your application with those and that’s it.
Matei Zaharia:
In contrast, in machine learning, because your goal is to optimize some kind of metrics you always want to be able to experiment with new libraries, new algorithms, to do various parts of your pipeline, and maybe combine them in a new way to do the same test, because if you can improve the prediction accuracy by half a percent, quarter of a percent that makes potentially a very large impact on your business. So you need infrastructure for machine learning that makes it very easy to switch and experiment with different libraries and algorithms, unlike traditional software. So as a result of these differences operating machine learning applications is very complex and that’s why it has given rise to the whole field of MLOps.
Matei Zaharia:
First of all there are many teams and systems involved because the application involves not just some code written by an ML engineer, but also a data pipeline that’s feeding episodes, and also some work to integrate the models in an application that then monitor how it’s behaving and provide feedback into improving it. So it involves at least these three different teams, sometimes maybe more that need to cooperate.
Matei Zaharia:
Secondly the application needs to constantly update its data and you need to constantly compute and recompute metrics to see how it’s doing. So you need not just an application that’s running by the whole set of data pipelines behind it that keep feeding it and keep retraining it and make sure that it’s working at peak efficiency.
Matei Zaharia:
And finally, with machine learning, it’s pretty hard to move from development to production environments because in development, you’re experimenting with so many different libraries and approaches and you need to somehow capture it in a reproducible way and run it in production or run it in an application and make sure it’s producing the same results.
Matei Zaharia:
Because of one of these complexities what we found talking to a lot of machine learning teams in the industry that they often have to spend half of their time just maintaining the existing models that they’ve put into production. And they don’t have a lot of time to spend on developing new models. So it’s really important to come up with an MLOps process that automates as much of this as possible so that these teams can actually innovate as well, not just spend time babysitting these models and making sure that they work at least as well today as they did yesterday.
Matei Zaharia:
So their response to these challenges is a whole new class of software called machine learning platforms, which has software to manage the ML development and operations process all the way from data at the experimentation and into production. And quite a few companies have built internal ML platforms to date. Some examples include the largest web companies such as Google, Facebook and Uber, but many other enterprises are building as well. And there is also a lot of work in the open source community to design them, in particular Databricks, we started MLflow which is one of the most widely used open source projects in this space. And these ML platforms usually provide a range of functionality. They could include data management, experiment management, detract metrics overtime, model management, to let you share models, and also functions to make it very easy to deploy the models for inference or to reproduce one or test monitoring and application. And they do all this through a consistent interface so that your teams can adopt them and work with different models in the same way, and keep finding and keep improving the best model for a particular test.
Matei Zaharia:
There are a lot of different components in ML platforms, but one thing I want to talk about is what are the top features that can make an ML platform succeed or fail and that really make them a lot more successful. So based on our experience working with thousands of organizations that are using ML, we found three really important features that you should think of. The first one is ease of adoption of the ML platforms by data scientists, engineers, and the users of the model. Basically everyone will be involved in that process. So you need to ask how much work does it take for them to use the platform, especially if they have existing code or existing data pipelines or existing apps that you want to adopt this platform. And also you need to ask what machine learning libraries, what deployment environments and so on are supported by it.
Matei Zaharia:
And most of these things will make a huge difference. If the platform is easy to adopt, then data scientists, data engineers, and so on will begin using it and you’ll start getting all these benefits of the principal way to manage and operate data applications. And if it’s hard to adopt that it will be an uphill battle and people might view it as too much hassle to actually begin using this. So that’s one of the main things that we’ve tried to optimize for, for example in MLflow.
Matei Zaharia:
The second important factor is integration with your data infrastructure. As I said, machine learning applications all feed on data and it’s very common for the machine learning team to want to go back and change a data pipeline, collect a new type of data and so on or to manage the data sets themselves, for example, to create versions so that they can have a reproducible model training and experiments.
Matei Zaharia:
So it’s really important for your MLOps infrastructure to integrate with the data infrastructure, new features, such as data motioning, monitoring, governance and API set and user interfaces that make it easy for data people and machine learning engineers to collaborate. And that’s something that we spend a lot of effort on as well of course.
Matei Zaharia:
And the final thing that we’ve found very useful is to have collaboration functions that allow teams to share code, data, features, experiments and models in a central place inside a company. And of course, to do this securely so that you can actually govern who has access to what. And this is because a lot of machine learning projects can benefit from building on previous projects. And also because there are so many different types of keys and users involved, that it’s really important for people to be able to find the latest version of a model, or the latest version of data set and so on and build out on it reliably as opposed to just emailing the exact files and so on around between themselves.
Matei Zaharia:
These lessons motivate the way that we support MLOps and the database and then also in the open source projects that our teams develop. So basically our philosophy around MLOps is twofold. First of all, we think that every organization’s requirements will be slightly different because of the internals of their business, the data they have, or the expertise they have and so on and it will change over time, and as a result of that, we might have provide a very general platforms that are easy to integrate with diverse tools you might have in your company. And that allow you to change details of how you’re doing machine learning over time and still have a principal way to manage it, and a principal way for teams to consume and operate the applications that you’ve built using ML. And we can do this through using three pillars.
Matei Zaharia:
The first one is the Databricks’ Workspace, which is basically a unified development environment where data scientists, ML engineers, data engineers, analysts can all collaborate to work the same data and the same code so that helps me keep together. And then two open source projects, Delta Lake is a data management layer doc of cloud storage, which is Amazon S3 that provides transactions versioning, and a whole bunch of rich management features to let you easily work with these large data sets on the team. And the MLflow is an open source machine learning platform that you can integrate with many popular programming languages, libraries, deployment tools and so on to do a lot of the functions I talked about such as experiment management, monitoring, sharing model centrally and so on.
Matei Zaharia:
So in this webinar, we’re going to talk about the requirements for MLOps and some of these technologies in more detail. We’re going to talk about how we and other organizations are performing MLOps at scale. We’ll have some demos to show this as well as experience from two of our large ML customers, about what they’ve learned in the process. And finally we’ll have a live Q&A with the presenters at the end. I hope you enjoy the webinar.
Clemens Mewald:
All right. Thank you, Matei for that overview. And I’ll walk you through how we at Databricks address some of these problems. So Databricks provides a lot of different capabilities, but in this talk, I’d like to take a view on Databricks from the perspective of an ML platform. So during this talk, I’m going to go through each one of these boxes and describe how Databricks addresses a lot of these challenges. And we’ll start with the data science workspace. The data science workspace is really the environment where data engineers, data scientists, ML engineers and data analysts can come together and collaborate on the world’s toughest problems. The core user surface within the data science workspace using notebooks. And Databricks has quite a unique notebook offering.
Clemens Mewald:
First and foremost, it actually has multi language support. What that means is that each notebook has a default language. You can see this is a Python notebook. That means every quarter in that notebook, It will be by default interpreted as Python, but then each individual cell can actually declare its own language. So you can read Scala, SQL, Python or R all within the same notebook. And that doesn’t only give you a lot of flexibility, but it also facilitates collaboration. So a data engineer could use Scala in the same notebook and the data scientists could use Python. And then these notebooks also have cloud native collaborative features that you’re used from other products, such as commenting within a notebook. And also when you share a notebook, we have a feature called Co-Presence where you have a Co-Presence indicator which shows you that someone else’s in the same notebook. And then if the person has edit rights, you can see their cursor and their edits in real time. So you can really in real time collaborate on the same notebook and on the same code.
Clemens Mewald:
Now, notebooks are great for exploration, and experimentation but we also want to make sure that we facilitate putting them into production. So we’ve introduced a new feature called Git-based projects that allow you to [inaudible 00:17:58] data into Databricks and then check in your notebooks, into your favorite Git provider of choice, go through the code review process, run some tests, run any CSV automation that you can think of, and then bring them back into Databricks to run production jobs. So this really combines the flexibility of notebooks with the rigor of a CSV and software deployment systems.
Clemens Mewald:
Now, once you have two data science workspace in place, the next thing you really care about is making sure that you have access to all of your data. So Databricks provides a very unique product called Delta Lake, which is also an open source product, but also integrated into the platform. Delta Lake provides a transaction layer on top of your Delta Lake so your data stays within your data Lake of choice and Delta Lake provides you additional benefits on top of them.
Clemens Mewald:
First and foremost, you can ingest any format of data at any scale from any source. So it doesn’t matter if it’s a CSV file or like pre-K files or a CSV file, you can ingest all of them and use the Delta Lake format. And Delta Lake provides asset transactions to actually guarantee data validity. So any ingestion and any change that you do on your Delta tables, actually creates a new transaction and that also facilitates a feature would be called time travel. So every time a transaction happens, we increment the version, number into write head log. And then you can always go back in time, hence the name, and look at your data at a specific version, because we just disregard the transaction that happened after that version, which is a very unique feature and actually facilitates reproducibility. And to integrate this with MLflow we actually created an automated logging of what data you used and the version information. And I’ll show you that in more detail on a later slide.
Clemens Mewald:
So at Databricks we really want to focus on training the machine learning models. So the machine learning runtime provides a DevOps-free that environment that’s pre-configured and optimized for machine running. And we provide it in a few different flavors on the screenshot. You see, I pick the GPU runtime, which has all of the drivers and configuration set for you, so you can just get going. And when you choose that, you just have to pick a [inaudible 00:20:24] type on the cloud of your choice. And then you can go ahead and train your GPU models.
Clemens Mewald:
Now, what do we package up in these runtimes? Well, we package up the most popular ML libraries. That is anything from TensorFlow, the Keras, Pytorch, Scikit-learn and of course MLflow. It is pre-configured and pre-installed in the machine learning during runtime. We release machine learning runtime regularly to make sure that all of these are up to date. And we test rigorously to make sure that all of this actually work well together. And you don’t have to worry about setting up these environments yourself. We also include libraries to more easily distribute your machine learning, deep learning libraries. And I have a slide on this right after this one. And then of course, we also built in libraries for hyper parameter tuning and autoML that I’ll get back to in a slide or two as well.
Clemens Mewald:
So for our distributed training, we actually built in support into the machine learning during runtime to distribute Keras, TensorFlow and PyTorch models. Of course, you can always train them and evaluate model that’s distributed on Spark, but Horovod run a specific user library that helps you distribute Keras, TensorFlow and PyTorch models. And then we also introduced support for the TensorFlow native distribution strategy, which was introduced in TensorFlow 2.0.
Clemens Mewald:
And then for hyper parameter tuning, we integrate a library called Hyperopt that has a pretty simple interface. We just configured a search space, and then you call this function admin, and then we’ve also extended it with this API called SparkTrials. So if you use Hyperopt just the open source library, it will run all these trials in sequence. But with SparkTrials, we actually use a cluster to parallelize the trials and here you can see around six trials in parallel for a total of 96 trials. And again, this doesn’t need to be a Spark model, so you can train a Scikit-learn model in instruction that you pass it. And then we basically just take the Scikit-learn model and train six different versions of the model with different hyper parameters at the same time. And you’ll see in a later slide as well, but this integrates with MLflow automatically. So you get all of the tracking of your hyper parameters training run in MLflow for free.
Clemens Mewald:
So now that your model is trained what are the deployment options? And MLflow really provides a really flexible and rich set of deployment options out of the box. So on this slide, what you can see here is just a schematic of some of the supported ML frameworks, how it is being logged as a MLflow model, in the MLflow tracking server, and then the deployment being managed by the MLflow registry. And then really you have a lot of different options of deploying these models, whether it be a docker container, a Spark UDF as a rest end point or using some of the open source libraries. And let me double click on one of these options to highlight an important part about model deployment with MLflow.
Clemens Mewald:
So this is the core method you would use to deploy a ML model that is logged as a MLflow model as a spark UDF. So as you can see here, you just load it and here you’d be referencing the model that’s in the model registry by name and production stage, and then you just apply it to a data frame. This works for a Spark MLLib model. Now let’s see what this looks like for Scikit-learn. This is the line of code that you would use for a Scikit-learn model and this is the line of code that you would use for a TensorFlow model. So just for dramatic effects, let me go back, Spark MLLib, Scikit-learn, TensorFlow. So I think you’re noticed, yes, they’re all the same. And that’s one of the benefits of the MLflow format is that it has this abstraction called Pyfunc, which exposes any ML model as pyfunc function.
Clemens Mewald:
So all of the deployment options look the same. So it doesn’t matter whether the model was trained in Spark MLLib, Scikit-learn, TensorFlow, it works the same. And the same statement is true for all of the deployment option. So building a Docker container for an MLflow model, same thing, no matter what a MLflow framework that we used. So this is really convenient, especially because I haven’t seen a single enterprise that use only one type of framework.
Clemens Mewald:
So now that we discussed the full entire flow, let’s actually look at the foundation of all of this that’s provided by MLflow and to give you end to end MLOps and governance capabilities. So in MLflow we’ve introduced this capability that we call auto logging. So we automatically track as much information as we can about your work and that stacked with something I’ve alluded to earlier, which is tracking of the data source and the versioning. So if you use a data table-
Clemens Mewald:
Backing of the data source and the versioning. So if you use a data table, we keep track of the table itself that you used and the version number as described earlier. So with this, you can actually go back after the fact and say, I want to take a look at the data that I used for training at the exact version that I was using to back then. And this information is logged in automatically. By the way, we do hook into the Spark data source API. So this works for any Spark data source if you read a CSV file, it works as well. However, of course, if you read a CSV file, we don’t have to versioning and it’s time travel feature. Now we also started capturing the schema for these models.
Clemens Mewald:
So you can see an example of where we have to input schema and then the type of the prediction column for the model and that helps us in a lot of different ways. One of them is in model deployment. We can actually check if a schema is compatible with the deployment time of the model. This just shows the basic like auto tracking for all of the Mlflow frameworks. So for all of these Mlflow frameworks, if you trained a model and use auto logging in MLflow, you get all of the parameters, the metrics on a percent basis, and all of the artifacts that we can track automatically logged for you. And as you can see, this works for all of the popular Mlflow frameworks.
Clemens Mewald:
This is again what I showed earlier, the automatic logging for hyperparameter tuning search, and then we can visualize it in what’s called a parallel coordinates plot. And the way to read this is each one of these lines monitors trials. And they are color coded by the optimization metric, which is the loss. So you can see all of the blue lines have a loss, and then you can visually see which values of each parameter actually leads to a high quality model. And again, this is automatically logged if you do have parameter tuning on Databricks using the machine learning one time.
Clemens Mewald:
We’ve also started implementing automated tracking of model interpretability. So using a very popular library called SHAP on the hood, we calculate feature importance for the model on the way you trained it. And then if you do something like train an image model, we can actually visualize the future importance based on which areas in the image contribute to prediction. So this is really powerful and it gives you an insight into how your models behave at training time.
Clemens Mewald:
And then of course, in addition to all of the male specific features, we also keep track of the code itself. In the snap version of the code when you train it, the classic configuration of the computer that you used when you ran the model and also the environment configuration terms of what libraries that you use.
Clemens Mewald:
Once a model is ready for deployment, I mentioned earlier, you can use the model registry in Databricks. So this is a screenshot of the model, of the managed model of registered in Databricks, where you can find models, their versions, and which of these versions in which deployment stages. And we actually have excess controls implemented on a per stage level that actually facilitate the handoff between different stages, which we’ll see in the next slide.
Clemens Mewald:
So you can see we have governance in terms of requesting conditions to specific stages for models. And if someone doesn’t have access controls to actually make the transitions, they only see the options for the request. And then we of course keep an audit log of everything that has happened to model. And we just recently implemented comments here as well, where you can basically comment and collaborate with your colleagues on managing the deployment process of these models.
Clemens Mewald:
This screenshot is from the certain product and Databricks. This is the simplest way you can deploy a Mlflow model as a rest end point. You click a button that says enable serving, and we bring up a cluster and expose the model as a rest end point and it also automatically knows all of their versions and deployment stages. So if you call the model name slash production end point you will always get that your request will always be routed to the version that’s going to be marked production. And this is just a screenshot of an output that’s actually using Spark streaming for real-time model monitoring. So this is actually computing a streaming AMSE to tell you the quality of the model as new predictions and labels come in.
Clemens Mewald:
So as you may be able to guess, as we walked through all of this information that we’re keeping track of, we’re basically checking off all of their comments in this reproducibility checklist. So we have the code that you used the data that you used, the cluster configuration, environment specification that you used. So if you actually implemented a reproducibility feature that allows you for around after the fact we produce the code that you used, if you used the data table and we know the data version that you used, we can recreate the cluster for you or with the exact configuration. And we can also recreate the exact environment and library configuration that you have in your end which is quite unique, and which is enabled by the fact that they actually track all of this information end to end from data to model deployment. And that’s a quick overview of Databricks as an ML platform. And with this, I’m going to hand it off to Sean and we’ll give you a demo of the actual product.
Sean Owen:
Well, hello everyone. This is Sean Owen. I’m a principal solutions architect here at Databricks, and I’m here today to show you a few new features of MLflow in Databricks. These are things like Webhooks and serving from the registry and new auto logging features for Pytorch and SHAP. But along the way, I also want to show you how to use some interesting ways I think in which you can apply these tools like Pytorch and SHAP and even Delta Lake to solve this problem. And the problem here to be specific is classifying images, chest x-rays. And these are provided by the national Institute of health and this data set actually became part of the competition on Kaggle. So along with the images, we have labels, they’re one or more of 14 different diagnoses that indicate what is maybe shown in the x-ray, what might be wrong with the chest in those cases. And the task here will be to learn, to label these images and maybe explain why they’re labeled that way. And of course, along the way, we’ll use MLflow to help us.
Sean Owen:
So word of caution, I’m not a doctor, you’re probably not a doctor either and the model we’re going to build here is a simple one. It’s probably not good enough for clinical use. It’s not necessarily accurate enough. So please don’t diagnose from this. Don’t try this at home, but I hope that it shows that this kind of learning and explanation is quite possible. So as in all things, we begin with the data, and I’m not going to spend too much time on the data here, but I do want to show you how a technology like Delta Lake can help process so-called unstructured data like images here. So Delta as you may know, underpins the Lakehouse architecture and Databricks, which means that you can do a data warehouse like operations at scale, but you can also analyze non tip tabular data like images in the same place and Delta can help with both.
Sean Owen:
So we’re going to start by reading the images with Spark and Spark can easily read a directory full of images, and we get nice thumbnails here in line when we load these here in Databricks. However, we might not want to just read the images with Spark, but go ahead and ETL then into a Delta table. So why? Well, a Delta is transactional for one. So that means we don’t have to worry about who’s writing to the table of images as we’re reading it. It also offers time travel. This lets us maybe go back and query the state of this image data set at a previous point in time. Maybe when we built that model three weeks ago, we can go query that table as of that time.
Sean Owen:
But maybe more than that, deep learning often involves some degree of pre processing. We have to normalize images, size, and depth and channels, and maybe we want to do that just once, rather than over and over as we make passes through the data to build a model. And also in the cloud, it can be kind of slow to read a bunch of small files over and over keep listing those storage buckets. So maybe there’s a speed advantage to ETLing once into a table and reading those blobs from a nice compressed efficient data store like Delta, if we have to read them over and over.
Sean Owen:
So to that end, we’re going to start with a little bit of that. We’ll agree the images, but also quickly ETL them into blobs and we’ll need to also load the metadata associated with these images, the labels, and parse them a little bit to get the various labels for each of the images, and then simply join them. Having joined the image data and the labels, we can simply write this out as a Delta table and register it in the Meta store. And we should be able to see that example here if we want. Okay, we’re good. We’ve got images, columns, and about 14 different possible diagnoses. Back to the notebook.
Sean Owen:
So now let’s look a little bit at the modeling issue. And in this example, we are going to eventually use MLflow because I want to show you how MLflow integrates with Pytorch. Actually MLflow integrates with something called Pytorch Lightning. So for the Pytorch users out there you’ll know that using Pytorch often means writing a fair bit of boilerplate code to describe the training loop and run it manually. Pytorch Lightning abstracts some of that away. So you don’t have to write some of the boilerplate you can just write there the key pieces here. So we’ll get to that in a minute.
Sean Owen:
So about to start here, we’re just going to read this table of images. And since it’s not that big, it’s about 2.2 gigabytes. We’re going to pull it down to a pandas.DataFrame here and do a simple train test split before we start training. Now, I’ll say that you could adapt the same approach to much larger data sets. You could distribute the training on top of Spark with Pytorch and a tool called Horovod. But for simplicity, we’re not doing that here. We’re just going to train this relatively small dataset on one machine.
Sean Owen:
We’re going to define a few helper functions here. Torchvision is a companion package to Pytorch that lets you define some additional transformations that are required for the input. For example, here we need to convert the channel ordering of the images a little bit and normalize them in a way that’s correct for the pre-trained layer we’re going to apply to the model. So a little bit of a detail, but Torchvision helps will help us here. And again Pytorch users will know to access data you typically define a dataset class that defines how big the data set is and how you get individual elements. And that’s here where we use these transformers and so on that we defined to say how it is we get from the relatively raw data in the Delta table to something that’s ready for this particular model in Pytorch. It’s a really, just a matter of reading the image, transforming it and returning the transformed version.
Sean Owen:
Now, before we get into the Pytorch modeling, we are going to enable this new feature of the MLflow. The MFflow is already supported auto logging for a number of popular packages like Scikit-learn, Keras and TensorFlow, and now it supports Pytorch. So by enabling auto logging for Pytorch, we don’t really even have to write MLflow code in order to get some of that benefit in order to get to log what we’re doing as we train a Pytorch model.
Sean Owen:
So let’s actually build the model. Now here’s where we’re going to use Pytorch lightning and I hope this is maybe interesting for Pytorch users that haven’t seen it. As with Pytorch, we need to define a module that defines the network we are going to train and defines how we train it. But rather than write the training loop directly, we really just fill in some blanks here. For example, we have to define how that, what the network looks like and here it’s really mostly dense net one 21. So we load this pre-trained model. So this is transfer learning, and we’re not going to try and retrain that and add a little bit of dropout on top and add a fully connected dense layer on top of that in order to ultimately build a classifier that’s going to predict from these images one or more of these 14 possible diagnoses.
Sean Owen:
So pretty straightforward stuff for Pytorch users. And nice thing is having to find all the key parts. What a fore pass looks through the data, what the optimizer is, how to validate, how to train. We simply let Pytorch lightning, do the work down here. It can do some nice things for example, auto tune, our batch size auto tune, our learning rate, handle early stopping and things like that. So it’s a nice framework and Pytorch lightning because it has all these hooks is something that MLflow can hook into to automatically log.
Sean Owen:
So I’m not going to run this in real time here since the model training does take about 17 or 18 minutes on a GPU. But if you did run this, you would find that magically with no MLflow code in there, you would get models logged if you ran the cell to MLflow and in Databricks MLflow is integrated into this experiment sidebar you see here. And if we were to pop that out, you’d see this model is the one we get from Pytorch auto logging. And it has a fairly rich set of information who ran the model? I did. When, how long did it take, what was the exact revision of the notebook and the code that made that model? And of course, all the key parameters that were defined for the Pytorch model key metrics, which epic was best, what was the final validation loss? And of course we get the model and we get the last checkpoint. We get a summary of the network architecture, pretty useful information, even some helpful hints about how we might go load this model later from MLflow.
Sean Owen:
Now I’d also like to show you… Well, first I should say that the model, I’m not going to show you too much about its accuracy, but rest assured it actually achieves accuracy that’s pretty comparable to the results you might see in the paper that introduced this dataset. So the pre-train layer and modern tools we can do okay. Out of the box. I’m sure you could do better though with some, some more time a GPU time and ingenuity.
Sean Owen:
Now, the next thing is interesting. I’d like to show you the serving models with MLflow. One key thing MLflow does is hand you back your model in a maybe a more useful form. Sometimes that could be a spark UTF, a function you can then apply with Spark to a bunch of data, but it can also create out of a model, a microservice, a REST API. And you’ve always been able to deploy that to services like Azure ML or Amazon SageMaker. But as of recently, recent versions of MLflow, you can actually serve the model out of the model registry in something like Databricks. And that’s what I’m going to show you here. So this is actually my registered model for this notebook. And I’m currently working on version three here and if I like I can enable serving for this registered model and suddenly I have an end point available even within Databricks that I can send images to you to get their classification.
Sean Owen:
But I want to say a little more about that. So what these REST API is do is expose a service that can accept JSON formatted descriptions of input and return classifications as output. The only problem is the input here is really an image. It’s a Tensor and there’s not a great way to describe that in JSON, at least not in the form that can be automatically translated into JSON by MLflow. So to make this work, we’re going to have to customize a little bit. And this I think illustrates in some ways the power of MLflow, that if you need to you can change how it works under the hood. So we’re going to define a custom MLflow model that will wrap our Pytorch model and enable us to accept the image as a base 64 encoded string of bytes from which will parse the image, apply transformations, apply the Pytorch model and then return the result.
Sean Owen:
So it’s really not too hard. If you need to, you can do something like this. And this now enables us to turn this into a service where we can accept images through JSON and return their classification. So having defined this class, all we need to do is load our actual Pytorch model, wrap it up in this wrapper and log that. And that’s the model we’re actually deploying to MLflow to the registry to serve here now in Databricks. So you may see I’ve already registered it, but just for reference this, you will find that you can register models here by clicking register model and selecting the model you’re interested in. But I’ve already registered the one I’m interested in here.
Sean Owen:
Just to prove it works, let’s try it a little bit of code to load an image and send it to the rest end point. So I’m going to load a single image here from storage. And again, one of the nice things about the Delta Lake House architecture is that you can deal with data as tables. You can deal with data as files and images too. That’s no problem. So I load an image from my dataset. This is what it looks like. Maybe radiologists out there can make sense of that. I don’t see anything interesting, but let’s see what the model thinks. So to use the service, we simply need to encode it and encode a proper JSON requests, send it to the API and render the results. And we get that back here as a pandas.DataFrame.
Sean Owen:
What we might see here is it looks like the models thinking this might be an example of atelectasis, which is a sort of lung collapse or infiltration but probably not a hernia. So these are probabilities. And Hey, that’s useful. I mean, if this were a better model and vetted by professionals, you could imagine this would be a useful assistant for maybe radiologists that are, want some heads going down about what they might be looking for. Maybe a model can help them figure out what’s even likely.
Sean Owen:
Before I go deeper on that point, I want to introduce a new and other feature of the MLflow as of recent versions and that is Webhooks. So the model registry, as you might’ve seen there, part of its role is to manage the state of versions of a model. So you may have a current production version, you create a new version of the model and it’s the staging candidate and you test it. And at some point you promote it to production if you test pass and you have permission.
Sean Owen:
So these are important events, and maybe these events creating a new model, creating a new testing candidate need to trigger something like a testing job, CICT job. So that’s why MFflow now supports Webhooks triggers for actions in response to these events. It’s fairly easy to register them. You will have to access the REST API directly, but you can do so like so, and listen effectively for these events. Now, for the simple example, I’m just going to set up a Webhook that triggers a message to a Slack channel just for a simple demo purposes but you can imagine doing much more than that, triggering a lambda, triggering a CICT job. So I’ve registered this Webhook to trigger a ping my Slack channel, whenever something happens to this registered model and we should see if this works. That’s if I for example, go into a model and comment on it. I should find that it registers a new message in our Slack channel. There it is. Okay. Pretty good.
Sean Owen:
Okay. Now I want to get into another new feature here and that is Model Explanations. Sometimes we want to know why the model is doing what it’s doing. And there’s a popular open source tool called SHAP that can do that for you. It can actually explain at the individual prediction level, what about the input caused the model to make the prediction it does. Using SHAP isn’t that hard, but in MFflow one 12, it can actually be done pretty automatically. With a line or two of code, you can get MLflow to create the model explanations for you and even log plots like the one here. This is a really a summary plot that says overall what the important features of the model are. A long side the as you may see here are the actual SHAP values, the model explanations, so that’s available. You can do that to any model. For this particular model though, I want to take the opportunity to show you something else you can do with SHAP, not through auto logging, but through a little more manual usage of SHAP that may be more interesting for this dataset.
Sean Owen:
So it turns out that SHAP can explain image classifications in an interesting way. With a little bit of code, you can try to overlay an image with a heat map showing what about the image caused the model to classify the way it did? So in this case, I actually load an image in for my data set that’s definitely classified as infiltration, and I create an explainer from SHAP that can explain this model and ask it to explain. What’s actually explaining the about one of the middle layers of resonant. Actually excuse me, of dense net. And this is what you get out of it. So you can see perhaps in these x-ray there’s a dark spot here on the left shoulder and a dark spot here near the bottom of the right lung. And for whatever reason, the model thinks those are particularly important in its classification. And its classification is here. Okay, good. 92% chance it’s infiltration, maybe edema, maybe atelectasis.
Sean Owen:
So I think this expands on the idea that these models can be not just predict black box predictors, but explainers too. And maybe if we did build a better model, this sort of thing could make it even more useful for professionals who are looking for hence, maybe you’re looking for where to look overall. Maybe the models can more easily see something that isn’t immediately obvious to the human eye. And these predictions, these explanations could be created in parallel with Spark, written to the Delta tables, they could be also even logged. Some of them were within MLflow, so this all comes full circle.
Sean Owen:
The last thing I want to show you is a new feature in Databricks called Projects. So those of you that know Databricks know that you work in the workspace and we typically work with individual notebooks, but oftentimes we really want to work with groups of notebooks because we need to version them together. And while you’ve always been able to get a revision history of individual notebooks until Projects had come along, you really couldn’t version and commit changes to groups of notebooks at once. And that’s why that’s why Projects exists.
Sean Owen:
So it doesn’t change how you work with things. I’m actually working in a notebook that’s inside a Project, not just a floating notebook in the workspace. But what this basically means is there’s deeper get integration here. For one, I can take a look at the get branch, sorry the get repo that this notebook and these several notebooks are backed by and I can edit and commit multiple notebooks at once setting a commit message if I like. And so this makes it maybe more natural for people that are used to versioning larger projects, consisting of multiple notebooks to work with these backed by a get repo and not simply built in version control. And we hope to expand this feature over time to let you version alongside notebooks, things like maybe small configuration files or small data files as well, because sometimes that’s just the more natural thing to do.
Sean Owen:
So I hope you’ve heard and seen some new features here of MLflow, Webhooks integration with Pytorch and SHAP for auto logging. Some ways in which Delta can be useful for particular machine learning problems, serving models with the registry and how that might work with an image classifier. And finally, a quick note on Projects and what they might mean for your workflow.
Sylvia Simion:
Thank you, Matei, Clemens and Sean. A quick reminder, if you want to try some of the things you saw in the previous stock and demo go to databricks.com/trial, where these tools are available. You can also continue to submit your questions via the Q&A panel as we carry on with the presentation.
Sylvia Simion:
So now let’s hear directly from our customers about how they are handling the growing importance of machine learning in their products and systems. And more importantly, how do they manage their ML initiatives related assets and the entire life cycle within the organizations?
Sylvia Simion:
Our next speaker is Keven Wang from H&M. Keven will give an overview of H&M reference architecture and eminent stack designed to address a few common challenges in AI and machine learning products like development efficiency, and to interoperability, speed to production, et cetera. Keven would also give a demo of the production workflow. Please welcome Keven.
Keven Wang:
Hello everyone. My name is Keven. I work as company-
Keven Wang:
Hello everyone. My name is Keven. I work as competence lead in atrium group. And today, I’m going to talk about MLOps how we apply it in large scale. I work in an organization called AI foundation, atrium group. In AI Foundation, we work on a number of different use cases for different business problems from design buying to logistic, sales management, and also how we engage we are with our customer. Pretty much covering entire atrium value chain. Each of these use case is driven by a multidiscipline agile product team. They developing and deploying the product end to end. When developing this different machine learning products they are a number of common challenges, for example, how to automate as a machine learning training pipeline in your large scale, if we talk about in the only single model, but could it be solved in hundreds of our southern models and how to provide the reproducibility of your machine model?
Keven Wang:
Let’s say half year back you want to be able to take the same code and same data, retrain the exact same model, how your model approval process looks like so you can bring up the new model into production with enough competent confidence. We try to address these different challenges holistically by leveraging our reference architecture or plotting. Even if they are solving different business problems, they share a common process. For example, for model training, it’s about taking data, apply some transformation, training the model, and in the end take the model into model repository. For model deployment, it’s about also taking the data, apply exact same transformation, make a prediction, and then save the results, deliver to the end-user. In the middle, you have some common key concern. For example, how to speed up into and feedback loop, how to monitor your model performance, your data drifting, and also your infrastructure.
Keven Wang:
Also, how to do version control of not only your model but even your data so you can manage them the same as in a software artifact. Based on this, we came up with a number of technical component, for example, model training, model management, and also model deployment. It’s important to have some of those obstruction here so we can pick up different tool for each of them and evolve them independently because machine learning is emerging area. They’re new tool coming out every day, every week, and every month. We want to leverage best to simplify our work.
Keven Wang:
After a year of exploration and also truant, we converged to a number of tools. For example, for the model training, we have three stack for the newly started the use case. We tend to leverage the data break century architecture. Then for the more mature use case, we are scaling, and also automation is key. We tend to use either airflow or Kubeflow as main machine learning orchestrator. For the model management, the same as many other company, we love MLflow as in for the model deployment, specifically online models. Sorry, we love Kubernetes as well. And also, open-source tool like Seldom bringing lots of machine make specific features.
Keven Wang:
Besides this, system observability is quite important. We love Azure stack. So we tend to use a default tool like Azure Monitor and Power BI for most cases, and also for Kubernetes space application, Grafana and Premises is a great choice. Machine learning product is [inaudible 00:53:33] software product so we can leverage the best practice like continuous integration and continuous delivery to automate the learning process. Also, in the end, looking at these complex stack, we can’t do it to results a proper infrastructure automation.
Keven Wang:
Let’s talk about the model training. For newly started use case we love interactive model development or notebook. However, notebook is not a very scalable way to do product development where your codebase grow, and also your team’s size grow. You tend to acumen lots of technical debt if doing notebook-only development. We have all processed now, basically extract the complex logic in the notebook, putting them into separate isolate models where we can develop locally and then using continuous integration to put all the task together and train on database.
Keven Wang:
Let’s take a look how exactly it works. This is a present project where I will download some data from internet and trim random forced model and then save the model into MLflow. Let’s take a look at this project structure. This SFC folder, I have a number of python models and capture all differently using functions. For example, the config management evaluation method which can explain a number of plotting and evaluation methods. Prepare data method, which included a model, which included a number of [inaudible 00:55:11] with clever data.
Keven Wang:
In the test folder, I have my py test cases to evaluate these python models. In the notebook folder, I have my notebook, which you get a number of top py file. And also, you can see here even it’s notebook, I can leverage ID to evaluate the syntax here.
Keven Wang:
Now let’s say we have done all the local development, like the coding and also py test, and we want to upload this notebook to Databricks and write.
Keven Wang:
So I will call this script. What this script does is just package all my passing models as backfile and upload to databricks and then also upload my notebook into databricks as well. Now it’s done, let’s take a look. This is my databriks workspace. Here my notebook has been placed into a specific folder structure. This one is my project name and my branch name, my user ID, and my notebook. So in the first say of this notebook, it’ll try to install this egg file, just uploaded by this script. An egg file is placed also in a specific structure respect to the branch I’m working on and also my user ID, so I will not override my colleague’s work.
Keven Wang:
Let’s run this notebook now. And here to initialize as a random seed as well. Now, prepare training dataset is just one method to call. Afterwards let’s start with training the model. Here let’s say I want to change from 100 to 150 I choose instead for this random forest model. And business training model, it would also use MLflow to keep track of important parameter and also other metrics. That’s right.
Keven Wang:
Okay. We have see some key error metrics, and in the end, it will also evaluate the model and the plot, some nice chart. Looks like nice chart. Let’s say we are happy with this model, with the parameter change. Now we want to commit to the code. So go back to ID. Here I will apply same change from 100 to 150, and then I would just push my code.
Keven Wang:
Now my code pushed. Let’s take a look at CI pipeline. So this is my CI pipeline, my model there, and new pipeline row has been triggered. Let’s take a detail look. This pipeline has including two jobs. In the first job, it will do the study called the quality check. Basically, it will run all the py linked test also unit test, publish test results and also publish test coverage report. Meanwhile, we also package my notebook and upload my notebook together with egg file into databricks and run notebook on databricks. It will monitor the notebook run until the finished.
Keven Wang:
Now it’s done. Let’s take a look. So this is the result on notebook run. Let’s see all the print. And also, it’s MLflow model registry. We should see a new model. Okay, version 12. This is a new model we just trained and in the stage is still none.
Keven Wang:
Okay. Is the first demo we can see that as a local dynamic can be nicely integrated with the notebook development with Databricks. And also, by leveraging continuous integration, we can automate the process and therefore provide a good way to keep track of model metadata and manage the model the same way as LS software artifact. Besides newly startup product teams. We also have a number of product team which are more mature and as their key concern is automation and also scalability because you need to train many model. Instead of training single model, these product team tend to train specific model for specific geographic area like countries, specific type of product, like a men’s t-shirt, and also specific time period.
Keven Wang:
Then each of these is a scenario thinking about our size. We have easily can have sets on different scenarios. That is some of the models they need to train every day. For this type of use cases we tend to leverage Airflow or Kubeflow based architecture on top of Kubernetes that we can scale things up cluster up and down, and also leverage external computation power like databricks clusters, and also run local computation inside a docker container in the same cluster.
Keven Wang:
Let’s move on talk about a model management, also model serving. In high level, you can divide the entire life cycle into five stages, model development, backtest, model approval, deploy to staging and then deploy into production event. We have discussed the how to do model development automatically and the result where we generate a new model in MLflow. Backtest is about to take the model deploy into dev environment, run all the backtest and afterwards in the model approval someone in your team can approve their model. It will unpack model version into dev into stage, and also deploy this new model into staging environment. Afterwards you can some system test, then again someone can improve your model. Bump up the version from stage to prod in MLflow and also and then deploy your model into production.
Keven Wang:
Let’s take a look how it works in your second demo. But before we start the demo, I also want to say few words about the Seldon or Seldon Core. Seldon Core is open source library build on top of Kubernetes. It’s bringing in part of the features for machine learning specific task. For example, it enable you to easily package up model as microservice. You can expose it either as gRPC or Rest API. Also, Seldon introduce some core concept called [infrasca 01:02:25]. Think about your model prediction is more than just single step, including outlier detection, feature transformation. Also, maybe you want to send your a model to our altar, which was select one of your model. Most relevant towards this specific request and then make a prediction.
Keven Wang:
Each of these step can be run in a single pod or container, and you can skin and use it independently. Now back to MLflow, we can see that version 11 of this model. My model, version 11, is in production right now. And let’s say we are happy with version 12 and want to each deploy it into production. So we go back open another kit repository, my model serve, here I have a metadata file. My model info Yami, here I specify where my databricks workspace is, which is where my MLflow instances is, and also the name of my data model. And now I want to change version from 11 to 12.
Keven Wang:
So to deploy it, I just do need to do a simple kit push. Now it’s done. It automatically triggers a model deployment to pipeline. But before we look into pipeline, let’s take a look at is a Rest API.
Keven Wang:
This is a Postman. I often use it to run some rest call. So here is endpoint of my model in the dev environment, I can make a call. Then I can receive some response from my model. Looks good. And also, Seldon come up with this model metadata API, which should inquiry to get metadata information of your model. So here you can see that the name of model is my model, and the version 11 it’s created yesterday. And also, here you can see some input schema, And also here you can see the output as well.
Keven Wang:
So after the model deployment, we should have seen this model version bump up to trial. Let’s go back and go to the CI pipeline. So this is another continuous integration pipeline for the model deployment. After I get the push, the pipeline gets triggered, and this one including four stages. In the first stage, it will pick up the model from a model, from MLflow model registry based on the version we give, and also it will be the doc image then pushed it in to the container registry.
Keven Wang:
Afterwards it will deploy its model into dev environment then it’s ready for testing. So here I’m using open-source tool like Helm to term utilize my deployment. So to deploy it in dev stage and production line, I can use the same setup script. Now it’s get deployed in the dev environment. Afterwards it also be deployed into stage if I approve it and then eventually go to production, also require human approval. Now let’s go back to Postman make another rest call to see if it’s ready. I should the implants rolling upgrade, but now it’s not yet. That’s why there was some downtime. Okay, now it’s up because you get the answer, and let’s check the metadata. So now it’s version 12. Looks good. And let’s say we have done some tests in dev environment and decide to approve it to the production environment.
Keven Wang:
Review. Approve. Now it’s deploying to a staging environment. Let’s say in the stage environment. We want to run some system test, to push a load to see if the system perform as good as possible. I’m going to use this tool called a Jmeter. In Jmeter, I can simulate up to 100 clients, send the request and return a slip. And this one client where I be ramped up in two minutes. Each of these clients will send the same rest to call through my model prediction service. Is the same, and then it will evaluate the result here. I also deploy some cool feature like auto-scaling in this model. This is my set under current file. Here I specify my [infants scrub 01:08:40] in this caseI only have a single container with a single microservice. And also I specify as a horizontal Potter scaling policy, basically it will monitor my CPU usage. If it’s about 40% is just start to create more replica of my model prediction service up to 500 instances. Let’s watch how many instances there are today.
Keven Wang:
So now we have two replica. Let’s start the performance test. Now we started, we can see that in beginning latency is pretty good, 50, 60, 70 milliseconds but start to ramp up whereas number of clients start to grow. I also added some cool feature like timeout. Basically, we don’t want client and… We don’t want our end user wait forever. So in case, the model cannot make prediction within a second it’ll just return the HB 500. Then we can input some default the logic in the client side to show some default results, for example. Also, here, we can see that a new replica has been added because there’s a lot to start to increase. Well, instance has been added. Also, we can see that after exceeds run some milliseconds, we start to see some red flags and some arrows returned from model. Also, now we can see there are more replica has been added. I will not wait until it finish. Let’s assume we are happy with this result.
Keven Wang:
Now final step. I’m going back to our DevOps and approve this model and deploy it into production. In the second demo, we have seen that by leveraging tools like Azure DevOps, we can easily automate the model deployment process. Whereas to involve human is a loop and too like Seldon really make is a model exposed to the predict service pretty easy.
Keven Wang:
Final takeaway and MLOps is very complex. Before start looking to the technical stack, it’s important to take one step back, thinking about what kind of problem you are trying to solve. How your design process looks like, and then define your architecture, ends up with your tools. Secondly, if you have a number of product team all working on the machine learning product in production, maybe it’s time to think about the platform approach. Have a central team developing some key component and offer it as a service for your pro teams. Last but not least, by leveraging cloud service, you can use it. You can really evolve very fast. Thank you very much. Thank you very much.
Speaker 3:
Thanks, Keven. Our next speakers are Wesly Clark from J.B. Hunt and Cara Phillips from Artis consulting. They will cover how they’ve implemented a framework for self-service experimentation and deployment at other price scale at J.B. Hunt. This doc will cover the core values, concepts, and conventions of the framework, followed by a technical demo of how to implement the self-service automation of databricks resources, code, and jobs deployment into Azure DevOps, CI/CD pipelines. Please welcome Wesley and Cara.
Wesly Clark:
Thank you for joining us. We appreciate your time, and I hope you’ll find our work interesting, informative and that will help you in establishing your own automated CI/CD pipelines. My name is Wesly Clark, and I’m the chief architect of enterprise analytics and AI at J.B. Hunt. And my colleague, Cara Phillips from Artis Consulting, is here with me to present a technical demo. J.B. Hunt was founded in 1961 and has grown to become one of the largest transportation and logistics companies in North America. We’re currently number 346 on the fortune 500, and our digital brokerage marketplace, J.B. Hunt 360, has received widespread recognition for innovation and technology. We consider machine learning in advanced analytics as key to our future success, and I’m honored to be involved in establishing these disciplines at J.B. Hunt. Artis Consulting was founded in 2002, and they focus on four pillars, data and analytics, AI and machine learning, the internet of things, and intelligent applications. They’ve been a key partner in helping turn our vision into our buzz production process.
Wesly Clark:
I want to begin by orienting you to the role analytics plays in creating business value, highlighting which parts of the MLOps lifecycle this framework focuses on, and discussing the guiding principles we adhere to when implementing our solution. Then we’ll get to the part you really came for the technical demo and the practical steps you can take to create your own solution.
Wesly Clark:
We want to emphasize that everything we do in analytics, data science, and machine learning should focus on creating business value and accomplishing the objectives of our organizations. As scientists, we could easily lose ourselves in the numbers. So we intentionally refocus ourselves on the people we’re trying to empower and the processes in which our solutions will be embedded.
Wesly Clark:
I’ve watched quite a few fantastic talks about data engineering, hydrating the data like in creating feature stores.
Wesly Clark:
It’s about data engineering, hydrating the Delta Lake and creating feature stores. Likewise, there are plenty of presentations that focus on containerization, serving and production performance monitoring. Today, I want to focus on a secure self-service framework for automating the creation and deployment of compute environments linked to specific project branches of a product’s code repository. Before I show you the framework architecture and implementation, I want to speak briefly to the guiding principles on which our solution was established.
Wesly Clark:
We were aiming for predictability. We chose convention over configuration. We wanted it to be automated because the real power of these conventions is realized when the user doesn’t have to remember the rules to see it in action or to implement it. We wanted to strike a balance between making a secure and self-service. We sought to empower our users while simultaneously providing boundaries to keep them safe. We wanted this framework to emphasize repeatability by creating configuration artifacts that follow the code through the entire ML life cycle we ensure repeatable deployments in environment creation.
Wesly Clark:
We wanted to introduce clearly defined environments. We weren’t just seeking to automate the workflow we already had, but to create new possibilities through the tools we were giving to our teams. We introduced our analyst, engineers, and scientists. Some of the most robust concepts from enterprise scale software engineering development life cycles. We wanted to be platform and cloud agnostic. We weren’t multi-cloud when we started this, but Databricks was. It was important to us to choose solutions that could run anywhere. I don’t have time to cover all the decisions we made in detail, but I want to give you a high level understanding of the framework.
Wesly Clark:
The first thing I want to draw your attention to is a set of config files that are stored in the user’s code bridge. The environment config file is where you would store values that change based on which environment your code is running in. Next, the cluster in library config lets you define the dependencies and compute resources needed for your code to run. It also lets you specify who should have access to your project. Lastly, the jobs config is where you would store the instructions for how your code should be deployed as it moves towards production. Next, I want to explicitly define what it means for your code to be deployed to a specific environment.
Wesly Clark:
It means that a fresh copy of our code has been pulled into the Databricks project folder and it’s synced with your repository bridge. It also means that your code runs on a dedicated cluster meeting the specific requirements defined in your branches config files. Lastly, it means that the clusters operating under the authority of a service principal that only has access to the appropriate resources in the corresponding infrastructure environment, local, dev, test and prod. That right means you’re using different secrets, storing your data in files in different containers and accessing different versions of the web services depending on which environment you’re running in.
Wesly Clark:
Now, let’s talk a little bit more about how the CI/CD pipeline works alongside your product code. We’ll walk through a self-service cluster management scenario. Let’s imagine that a user with an existing repository for their product code was going to start using the CI/CD framework for the first time. Thankfully, the CI/CD repository has a setup pipeline to help get them started. Step one illustrates the first phase of the setup pipeline which will transfer the config files we just talked about on the previous diagram into the user’s product repo.
Wesly Clark:
Step two shows that the second phase of the setup pipeline will transfer a few YAML files and create child pipelines in the product to repo. One of these child pipelines is used to initiate job deployments to new environments and the other pipeline, which is relevant to this scenario, listens for committed changes made to the config files. Once the initial setup pipeline has finished transferring files and creating child pipelines in the product repository, let’s imagine that in step three of this diagram the user modifies the cluster and library config to change the maximum number of nodes for their cluster, add a third party library to their list of dependencies and modify the access control list to add a teammate to their project.
Wesly Clark:
In step four, after the user commits their change to the cluster and library config file, the listener pipeline would trigger and initiate a callback to the CI/CD repository. Step five emphasizes that the majority of the functionality lives in the CI/CD repository where a YAML pipeline would execute a series of PowerShell scripts that would validate the config file changes and convert it into three separate JSON files which are then sent separately to the Databricks cluster, library and permissions APIs.
Wesly Clark:
Finally, in step six, you see that after a few moments, these changes would be completed by the Databricks APIs and the user would see the updated cluster in the Databricks UI with the new library loaded onto it accessible only by the team members specified in the access control list. All right, one last diagram before we get to the technical demo. I want to re-emphasize the workflow this enables. Continuously creating enhancements to your product without environment conflicts. In this image you see two separate branches of the same product being worked on by two different users in the local environment.
Wesly Clark:
Each user has a separate copy of the product notebooks and config files stored in their local projects folder under their username. Each runs on a dedicated cluster set up according to the user’s config, each cluster could have completely different properties and run different versions of libraries without conflict. Once you get past the local development, only one branch of a product can be deployed at any given time. Over in the dev environment an older version of the product already committed to master is on its way to being productionalized.
Wesly Clark:
It’s already past the first quality gate enforced by our CI/CD job deployment pipeline, it also has a separate copy of the notebooks and can hit fig files stored in the dev projects folder. It is using an older copy of the library and runs on a dedicated job cluster meeting the specifications of an older version of the config files. Unfortunately, that code failed to pass the stricter quality gate to get into the more closely guarded test environment. Both test and prod are still retraining using code from a previous master release tag. Both environments have a separate copy of the notebook and config files stored in the test and prod projects folders and run on dedicated job clusters still using an even older version of the library.
Wesly Clark:
In all environments, local, dev, test and prod each cluster acts under the authority of an appropriate environment service principal. To interact with environment specific instances of tables, file storage, external web hook and surface integrations and to publish job events. These environments specific service principles also have been granted permissions to invoke those deployed jobs so that they can be initiated from outside of Databricks using the service principle credentials. I know that was a lot of conceptual material to cover. Thank you for bearing with me through all of the abstractions and theory behind the framework. Now I’m going to turn it over to my friend, Cara, to show you how you can implement something like this one step at a time.
Sylvia Simion:
Thanks Wesley. Hi everyone. My name is Cara Phillips and I’m a data science and MLOps consultant at Artis Consulting. As Wesley just mentioned, I’m going to show you these pipelines in action. Let’s start by taking a look at the file structure in the CI/CD repo. First thing to notice are these YAML files which are the templates that contain the steps that our pipeline will execute. Right above these files we have a scripts folder. These are all of the scripts that are going to be run in each step of the pipeline. We decided to use PowerShell scripts but you can use any language that can parse JSON files and sends data to the Databricks API.
Sylvia Simion:
At the top here we have two folders, one for data science and one for data engineering. Both the data science and data engineering teams are using these pipelines and these folders contain configuration default values that can be set at the team level. For example, the data science team can use these files to set default values for their cluster or jobs configurations, to set default permissions or to set libraries that will always be installed on their clusters. Last folder to look at here is the files for remote repository folder.
Sylvia Simion:
This folder contains all the files we’ll need to copy into the product repo. Once those files are copied over the pipelines can be built in the product repo. The process of copying these files and creating the pipelines can be done manually or as Wesley reviewed earlier by another pipeline we called the setup pipeline. Let’s go over to the product repo and take a closer look at that file structure. The first folder we have is the notebooks folder. This folder contains all of the notebooks that are linked to the Databricks workspace. The remainder of the files are the ones that were copied from the CI/CD repo.
Sylvia Simion:
First, we have the pipelines definitions folder, and this folder contains YAML files that will trigger the execution of the pipeline steps stored in the CI/CD repo. The last files we’re going to look at are the cluster in libraries, config JSON file, as well as the jobs config JSON file. The user will use these two files to create and manage the configuration for their clusters and jobs. The first step the user needs to do once the repo is set up is to create a new user branch from master. Let’s take a closer look at the cluster and library configs JSON file which is the first one the user will configure.
Sylvia Simion:
Once they’ve created their user branch they can begin editing the config files. Let’s take a closer look at the cluster in libraries config file. The first parameter they’re going to configure is the workspace name. This value will determine which folder in the CI/CD repo our default config values will come from as well as which Databricks workspace the pipeline would be authenticating to. The cluster parameters here contains a subset of parameters that the user can set for their cluster configuration. Structure of most of these values is directly equivalent to the required JSON structure for the API, with the exception of spark version.
Sylvia Simion:
Here, we created a structure where the user only needs to specify the version number, whether or not they want an ML runtime which contains many machine learning packages by default and whether or not they need GPU compute on their cluster. The pipeline maps these values to the correct key that’s required by the API. The last set of parameters in this section I want to call attention to is the custom tags. These tags are very important for tracking and managing your Databricks spend. In addition to the J.B Hunt specific tags you see here, the pipeline automatically sets tags for the environment which is either a dev test or prod as well as the repo in branch that triggered the creation of the cluster.
Sylvia Simion:
The next section is the access control list. Here the user lists the emails for everyone working on their code in the branch. The pipeline we’ll use this list of emails to set the permissions on the cluster. This ensures that the created clusters reserved exclusively for the team members working on that branch. The last section in this config file is for the libraries. For each library type the user specifies a list of libraries they want installed on their cluster or Python and CRAN packages if the required package repo is different than the default repo they can specify that using the package and repo structure here.
Sylvia Simion:
Now let’s make a couple changes here, save the file and watch the cluster be created. We’re going to change the max workers to seven and save the file back. Now let’s go take a look at the pipeline run. You see here that the pipeline was triggered when I saved the file and it is automatically running. While we wait for that to finish let’s talk a bit about what’s going on behind the scenes. Back in the product repo, there’s a file called pipeline cluster. This file listens for changes to the cluster and libraries config JSON and calls back to the CI/CD repo to execute the pipeline steps that are stored there.
Sylvia Simion:
Let’s go over to the CI/CD repo now and look at those steps. At the bottom of the repo here we have a YAML file called pipeline cluster config. This file contains all the steps that are going to create and manage our Databricks clusters. The first couple of steps here do some administrative work to set up the pipeline environments. Then we get to this generate cluster config step. This step parses the values provided by the user in their config file, combines it with the cluster config default template and creates the JSON to send to the clusters API in the next step.
Sylvia Simion:
The creator edit cluster step uses that parsed JSON file to either create a new cluster if one doesn’t already exist or to edit an already existing cluster. Once our cluster is created, all we have to do is add or update the permissions. The set cluster permission step takes the list of users from that config file, parses them into the proper JSON structure and sets the permissions on the cluster using the permissions API. Once the permissions are set, we now have to install the libraries. The parse requested library step takes the libraries in the config file and parses them and in the next step takes that parsed JSON and installs any new libraries and uninstalls any libraries that were removed from the config file.
Sylvia Simion:
Now let’s take a look at our completed pipeline. Once we go into the run you can see that all of the steps completed successfully. That means we should see a new cluster running in Databricks with the config specified in the config file. Let’s go over to Databricks and look at that cluster. As you can see here, we now have a cluster running in Databricks. The name for this cluster was defined in the pipeline by the organizational naming convention. As you can see the configuration here is exactly what we had specified in our config file.
Sylvia Simion:
Likewise, the user specified in our access control list are listed here under the permissions. Lastly, when we go into our libraries, we see that the libraries we requested have been installed in addition to a couple of libraries that were specified in the default libraries file. Wrapping it all up, step one is to copy the pipeline in the config files into the product repo and create those pipelines. Then once that’s complete the user fills out their cluster and libraries config file and saves it. The pipeline then automatically is going to run and the user has their own cluster to use within minutes.
Sylvia Simion:
The next pipeline we’re going to look at is the jobs deployment pipeline. This pipeline provides automated and secure jobs deployment and editing. Let’s start in our product repo with the user’s workflow. When the user is ready for their code and jobs to be deployed in the dev environment they will come to the jobs config JSON file. Here they will specify how many jobs will be created and the configuration for each. First parameter they’ll have to fill out is the name of the notebook. Next, they will specify if they want to use a high concurrency cluster or a jobs cluster to execute their job.
Sylvia Simion:
Using a high concurrency cluster instead of a jobs cluster we’ll reduce the cluster startup latency when a job’s deployed and allow jobs to be run in parallel. Next, they will specify the cluster configuration and a set of libraries. If they’re using a high concurrency cluster as in the first example here, the cluster config and libraries are set based on the configuration in the cluster and libraries config JSON. The values in the job config for these parameters will be null. If the job is going to be run on a jobs cluster like in the second and third examples here, there are several options for these two parameters.
Sylvia Simion:
If the user specifies default for either the new cluster or libraries parameters, the config for that respective parameter will be taken from the cluster and libraries config JSON. Additionally, the user can specify none for the libraries parameter if no libraries are required. In the second job, you can see the default complaint is specified for the cluster configuration and the libraries perimeter is set to none indicating no libraries are required for this jobs cluster. The last option for these two parameters is to specify a new cluster config or set of libraries as shown here in the third job. The configs are specified using the same structure as is used in the cluster and libraries config.
Sylvia Simion:
We have the cluster configuration here and below the libraries configuration. Once we’ve saved our changes back to our branch and committed all our coach and master, we can run the jobs deployment pipeline. I’m going to do that right now and I’m just going to run the dev stage. While this runs, let’s take a closer look at what the pipeline is doing. Back in the product repo, the pipeline jobs YAML file invokes the pipeline steps from the pipeline jobs config YAML file stored in the CI/CD repo which we’ll get right now. We have three different environments that the jobs are going to be deployed to, dev, test and prod.
Sylvia Simion:
The same general steps are going to be repeated for each of these environments so we’re going to use a build steps parameter which allows us to only have to write the code for the steps once then those steps will be repeated in each environment. You’ll notice a lot of these steps are the same as what you saw in the cluster pipeline since much of the code is shared between these two pipelines. Let’s take a closer look at these steps. Like with the cluster pipeline these first couple steps are just doing some administrative work to set up the pipeline environment. Then we get to the generate cluster config step.
Sylvia Simion:
This script is going to do two things. First, it generates the default cluster config that’s going to be used when the user opts to use a high concurrency cluster or when they designate default for their jobs cluster definition. The second thing it does is parse a new jobs cluster configuration when provided by the user. Once those figs are parsed the create or edit high concurrency cluster steps going to check to see if a high concurrency cluster is required for the set of jobs being deployed and create or update the cluster if needed. If a high concurrency cluster is required the next step is going to set the proper permissions for that cluster.
Sylvia Simion:
This includes adding permissions for dev, test or prod service principles which are going to allow the cluster to be managed by other applications like Azure Data Factory or Airflow. The next step is to parse the libraries that are going to be installed on the cluster. Like with the cluster config parsing step this step is going to parse both the default set of libraries and any custom set specified by the user. After we have all those configs ready we can deploy our new jobs or edit our existing jobs in the next step.
Sylvia Simion:
Finally, once our jobs have been deployed, we need to set or update the permissions. You can define permissions for a custom set of users or groups in the default values for jobs stored in the teams folder up here in the CI/CD repo. In addition to these permissions, the dev, tests or prod service principles will be given access so they can orchestrate these jobs from external applications. Once we have all the steps to find, we can execute them in each of the stages below which includes security and quality gates in between each environment deployment. Let’s take a look at the jobs pipeline.
Sylvia Simion:
We go into the run, we see that all the steps for the dev stage completed successfully which means we’ll have three new jobs created in our Databricks workspace. Let’s take a look at those. You’ll see we have three jobs created for our dev environment since we configured three jobs in our jobs config file. Notice the cluster definitions for each job here. The first job is using a high concurrency cluster and the second job is using our default cluster config with one to eight workers and third is using a custom cluster config with one to four workers.
Sylvia Simion:
Let’s go into the second job here and take a look at the configuration. In the configuration tab you’ll see the specs for the jobs cluster that was created for this run. Then below in the advance section we can see the pipeline gave the correct permissions for the two service principles, the admin and dev as well as three ACL groups we specified in the default access control template in our teams folder. To summarize the jobs pipeline, first step is for the user to fill out their jobs config JSON in their user branch.
Sylvia Simion:
Next, they merge all their codes and master then once the code is ready to go, the jobs pipeline can be run and the job will be created and run. Today we created two pipelines to automatically create and manage our database clusters in jobs with an easy self-service workflow for the user. Thanks everyone for listening to my tech demo. I hope you learned some interesting and useful things and I’m going to pass it back to Wesley to wrap everything up.
Wesly Clark:
Thank you Cara for giving such a thoughtful technical demo and for all the hard work you’ve put into bringing the vision of this framework into reality. Now I’d like to briefly recap what we discussed today. We told you how J.B. Hunt was trying to accomplish as an organization and how our team has focused its work to ensure we are contributing to the realization of that mission. We drew your focus to the section of the MLOps life cycle where our framework operates, and where we feel the most progress of iterative improvement occurs.
Wesly Clark:
We described the motivating principles that guided us while we implemented our solution. We visually represented the framework with a series of architectural diagrams to help you see the big picture. Then Cara gave you a closer look through a step-by-step technical demo. Finally, the only thing left to say is thank you for investing your time with us. I hope this was helpful and additional resources and details will be available to download after the presentation to help get you started. Thank you.
Speaker 3:
Thank you, Cara and Wesley, and thanks to all of our previous speakers as well. We now have a few minutes for Q&As with our speakers and we’ll be right back.