Introducing MLflow for End-to-End Machine Learning on Databricks

Solving a data science problem is about more than making a model. It entails data cleaning, exploration, modeling and tuning, production deployment, and workflows governing each of these steps. In this simple example, we’ll take a look at how health data can be used to predict life expectancy. It will start with data engineering in Apache Spark, data exploration, model tuning and logging with hyperopt and MLflow. It will continue with examples of how the model registry governs model promotion, and simple deployment to production with MLflow as a job or dashboard.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– [Tutor] Everyone, welcome to Spark Summit.

This is Shawn Owen. I’m a Principal Solutions Architect here at Databricks, and here with a couple other folks, I focus on Data Science and Machine Learning. So, every week we’re talking to customers who are trying to figure out how to do data science at scale, maybe with a, with a bunch of tools out there, maybe including Spark and maybe including this new project MLflow. It’s an open source project from Databricks, but really you can use anywhere. And today I want to show you MLflow for those that maybe haven’t seen it and are curious what it does. And along the way, show you a little bit about Databricks too, for those that haven’t seen it, but really I want to talk to you about MLflow. It’s the glue that helps glue together a couple pieces of the data science lifecycle that can be kinda hard sometimes. And that includes how you get from models to track experiments and how you get from those experiments out to production. So, first this is Databricks for those that haven’t seen it, it’s a web based environment here. And most of what you see when you use Databricks is something like this. It’s a notebook like environments. Let me start at the beginning here. And what you don’t have to mess with too much.

Predicting Life Expectancy from Health Data

What you don’t see here as a bunch of dealing with resources and cluster setup, and so on. Clusters can be preconfigured. They can spin up, and they can scale up and scale down. And typically as a data scientist, you’re just gonna start out choosing a cluster that’s perhaps already configured, maybe already running, but if it’s not, it’ll turn on. Spark will be available for you and it’ll turn off or scale down when it’s not needed. So really we can just get right to work. And in Databricks, as in as many environments, we are often working in a notebook like environment like this, where you can write a code, of course, which we’ll get to in a minute and also Mark down to document what you’re doing and see the execution and output of your code in line here. Now, the Data Science Lifecycle really involves many parts, but I like to think of it as falling into three categories, really. There’s going to be some data engineering upfront. We’ve got to take raw data and someone’s got to clean it and standardize its representations and get it ready in an organized way for other people, maybe analysts or data scientists to use, to do some modeling. And then of course, there’s data science. There’s some exploratory data analytics, there’s some modeling of course, and model selection. And that needs some organization too. we don’t just build models. We wanna keep track of them and do that in a principled way. And last but not least, is that hop from a model to production, for some definition of production. It’s often harder than it sounds to know how that’s supposed to look. How do we get this artifact out to something that I can run at scale in a production job? And that’s a big part of what MLflow helps with too, As we’ll see here. Now, we’re gonna do this in two parts. I’m going to talk for maybe 20, 25 minutes here and get through some of the data engineering and the data science aspects. And then we’ll pause and take some questions. And then we’ll come back after a break and cover some more issues. The lead really leading up to model productionization and deployment. Now, in this example, I’m going to pick an interesting problem. It’s not the problem probably your solving, of course, but you can put your problem here in its place. Imagine as I go through this, that we’re solving your data science problem. And I tell you that I think most of what we’re seeing here maps to a lot of jobs, you’re probably trying to create and running today already. But the problem to hand here is the question of life expectancy. What determines life expectancy in say, developed countries over the last couple of decades. We know a lot about these countries, demographics and their health indicators, but we might wonder what is it that really drives this life expectancy. So it’s a predictive analytics problem, but it’s also in another sense of the question here is explaining a model to which is an interesting angle. Now, to do this, we’re going to grab some data from a couple of sources. One is the World Health Organization. The other is the World Bank. And finally another data set concerning drug overdose deaths. So, of course, every process begins with data engineering. Now these data sets are actually fairly small, but nothing about this would change If the data sets were very large or if they weren’t simply a CSV files, as in the case here. So we start here in the world of the data engineer and he might wanna work in languages like Scala, for example, and use Spark directly to express queries and transformations of the data. And that’s fine. Here we can use Scala. We can use Python, R even within one notebook as I’ll do here. But the point is that that the choice of language doesn’t necessarily determine what all these different roles have to have to use. They can make their own choices separately. And the first part here, we load the CSV files. We have to tweak a few things about the input that are a little nonstandard. And we can take, for example, and first look at the data after we’ve loaded it. And it displays in line here as nice table. This is a pretty wide data set. I believe it’s over 1600 features, for countries over the last, I think 10, 20, 30 years, at least here. Now, along the way, we might need to save off some different tables from this data. For example, a lot of the data can has codes for different health indicators in the data and a description. And we’re gonna work in terms of the codes, but we might need these descriptions later for lookup to make our output a little more interpretable. So we’re gonna save those off. And I’m going to save them as a Delta table, as a minor example of what Delta might be able to do for you. But let me come back to this lookup table in just a second, once we’ve loaded more into it.

So, although this code isn’t terribly important, what it’s doing, it’s just doing more filtering transformation. I think a lot of the elements here may be familiar to any data engineer that’s written code like this to process data. For example, here, we’re using spark Scala API directly to do some filtering, but we can also mix in SQL syntax to express those filters as down here. Or maybe in some cases, it’s just easier to express a transformation as a language native UDF for a user defined function, rather than expressing it in SQL. And that’s possible too. So we’ve got all those different approaches mixed and matched here to further process this data. So we can register this data frame as a table, and then switch over to SQL to create, for example. So maybe we wanna look at this data ordered by my country and year. And as you can see, it’s quite a wide data set here. We could do the same, but maybe use some of the built in plotting tools here to take a quick initial look at the data. So this is life expectancy as a function of time and also grouped by by country here. And you can immediately see there’s a bit of an outlier here, as one country here that over time has a lower life expectancy. And one that’s even seemingly decreasing unlike other countries in the time period, 2014 to 2016 or so. Now this one happens to be the United States. And so one thing this might’ve caused us to wonder is what’s different about the US, can we figure out what’s different about this country that’s causing life expectancy to be lower?

So moving on, we might load again another dataset, which is again, a different data, but same idea, country and year and indicators for many decades in most developed countries.

Here again, I’m going to read off some of these description codes, so the codes in their descriptions and save them as a Delta table. And you notice with things like Delta tables, I can do things like append to them, and that’s different than maybe how some other data sources, like I dunno, a CSV file or a parquet file work. And the Delta storage format is really just a layer on top of parquet as representation, but it adds some interesting new properties and attributes. Number one, I can do upstarts and appends efficiently, but I can also, as I modify this table, track the history of the table and see who modified it when, and even if I needed to go roll back and query the table as of a previous point in time. So for example, as I write these tables, they become visible in the metastore here. And I can access them through the data tab if I need to.

And I can see in the metastore here, for example, the Schema of this little lookup table I created and some of some of its values, but I can also look at the history of the table. Here, I wrote the table and then I updated it. And I can see who did it, when, and so on. Now that’s not so important for this particular table, but it be, I’ll make a comment in a minute about where this bit might become more important. Remember dealing with the main data set.

So we’re gonna do little more filtering on this data as before. Same idea, just different properties here. Take a look at the data. And maybe move on now to load a third data set concerning drug overdoses.

Same idea. We’re going to filter and standardize some of the column names to be a little more useful.

And then in the end, join these three data sets on country and year, write them as a Delta table and even maybe switch to SQL here to register these as tables in the metastore for other people to use.

So this input table here will become our main input to the data science process. And this is where Delta might actually come in a little more handy. Number one, in Delta table, right, are transactional. So, if someone’s running this data engineering job and updating my table and I’m reading it at the same time, no problem, I won’t get Phantom reads or an inconsistent state. But of course later when I’m doing my data science and I built a model, maybe I need to go back and figure out what the data was like. At the time I built my model. And with Delta, I could do that. I could go back in query as of a version as of a timestamp.

But the representation here probably isn’t that important to the data science per se. And here we’re going to move on to the world of the data scientist. So, her work might begin something like this. Now, again, she’s now choosing to work in Python, which is fine. You can still read all this data, use spark via PySpark to query and manipulate it. And as we’ll see below, indeed use a lot of different Python packages, not just Spark. So she might read the table and look at some summary statistics here. And we see that, for example, some of these features that are actually no values or there’s very few values that are not known. So we might need to do some missing value imputation. And as a simple approach here, I’m going to simply forward fill in backfill missing values by country and year. And because the data set is small, I’m gonna do something maybe kind of too simplistic. I’m gonna pull this data down to the driver as a Panda’s data frame, use pandas to perform that manipulation and then send it right back to the cluster. Now, if the dataset was large this wouldn’t be a great idea, but for smallest datasets, it’s perfectly fine. And if this is the simplest way to get done what you need doing, you can do it. This is just one of a few ways that pandas and pandas syntax is usable and useful. If you know it, when you’re working in PySpark, for example, and I’ll get to a couple others later, but this is one simplistic example here.

The next comes a little more EDA or exploratory data analysis. So, maybe she wants to take a look at correlations between a couple key features here. So we might make a nice pair plot like this. And this is using a package called seaborn, which is built on map plot lib. It’s a common visualization library, and it is already installed in the machine learning runtime here and along with most popular data science packages. If it wasn’t, no problem, it can easily be installed afterwards or updated. But for most of the common packages you may want to use the runtime itself probably already has installed for you. So we can see that through this nice visualization that there’s correlation between, for example, GDP here, and this is spending on healthcare. And of course that makes sense. Those are kind of correlated, of course. And we can also see in the last row here, this is opioid deaths versus everything else. And there’s a clear set of outliers here from one country and no surprise. These dots are all United States over the years. So this is further evidence that something’s different about this one country here. And maybe we need to look into it as we build our model and go to explain the relationship between these features and life expectancy.

So, as part of the data science process, sometimes a little more engineering’s necessary, so that raw, or rather, somewhat processed data from the data engineers, that data is filtered and it’s cleaned, but it’s not edited or aggregated for a particular purpose. Often individual modeling jobs or analytics jobs will need to further transform it for their purpose, whether it’s to aggregate it or add more features or simply transform the representation to something more useful for modeling. And here we do as a sort of simplistic one hot encoding. That’s all we’re going to need to do to this particular dataset, but you can imagine this could be something more. And we will then use some SQL here to create a new Delta table out of that featurized dataset. And this is going to be another interesting handoff point, not just the basis for building the models, but the basis for our production job later, which needs featurized data to score on as well. So we may also end up treating this as some kind of production pipeline job. That’s constantly translating new data, that’s assigned from the data engineering team into a form that the model can understand.

So now let’s get to maybe the more interesting part here, modeling. Before we get into that. I want to introduce very briefly another package called koalas. So, obviously if you’re working in Python, you’re probably familiar with pandas. It’s a really widely used tool for data manipulation. You can manipulate data as data frames, of course. And if you need to scale that up and you need to manipulate data at large scale as data frames, there’s of course spark and there’s PySpark, but not everyone does PySpark. And maybe it would be simple. It’d be nice to be able to take some code that works in pandas and have it just run on Spark without changing it to use PySpark syntax. So that’s the idea behind koalas. It’s a reimplementation of most of the pandas API, but it’s backed by spark. So if you read data from spark via koalas, you get an object here that acts like a pandas data frame. So for example here, we can use Panda style indexing. Just select all the data up to 2016. And this is actually going to be carried out by Spark, even though it looks like we’re using pandas. Now, I don’t do much with koalas here in the simple example here, I’m gonna use it to select the update up to 2014 as the train set and data after 2014 for test. And we could have done this easily in Spark as well, but I wanted to show you that you can also do this with Panda style syntax, even at scale via something like koalas.

But actually, we’re gonna work with these in terms of pandas data frames, just because that’s going to turn out to be a little bit easier for the small dataset for this particular modeling problem.

Right, onto some modeling.

Now we could spend days talking about all the different choices you can make, ways to approach this modeling problem. It’s a regression because we’re trying to predict life expectancy as a function of a bunch of features. And there’s a lot of approaches, a lot of tools you could apply to it. One thing that’s clear to me though, is this data sets not that big, if your data sets on the order of gigabytes that easily still fits in memory on a modern machine. And so you don’t have to necessarily directly distribute the model train of a dataset of that size or so. Now that’s great. That leaves us a lot of options. Like for example, we can simply use a XG boost, for example. It’s a popular gradient boosted trees package to perform this regression and it’ll run just like any other extra boost code you use anywhere else. But we may wonder how can we get spark involved here? I mean, we’ve got a cluster, we’ve got a nice environment here that can give us access to more resources. How can we take advantage of that even if we don’t need it directly. Well, let’s see. So this code here is probably the, like the core of any modeling code you’re writing. At some level, all we’re doing is training and XG boost booster. We’re fitting to the training set and evaluating it and reporting the accuracy of that model. Of course, XG boosts like just about any package has a bunch of hyper-parameters to tune. Like what’s the max depth of the tree. What’s the learning rate, what’s the regularization parameter and so on. It’s not obvious how to fit these, of course, or select these. So commonly we’ll run cross validation and run that through a grid search or a random search over these parameters, to try and find some that seemed to work pretty well. Now that’s okay, but I wanna be a little bit smarter about that. I mean, if we’re gonna build, not just one model, but hundreds of models to try out all these possibilities, maybe we wanna be a little smarter. And that’s where a tool like Hyperapt comes in. It’s baked into the runtime, you can use it here. It’s an open source, basie and optimizer. This is gonna help us try and minimize this loss as a function of all these hyper parameters, but it’s gonna do so in a slightly smarter way. So, as it built models and learns about their losses, it’ll figure out which combinations seem to be promising and explore those combinations more in depth and intend to ignore combinations that just don’t seem to be working out very well. So that can save a lot of time, wasted time exploring combinations that just don’t work. But as a bonus, if you use extra boost here, you get really, two bonuses. If you define the search space like this and tell Hyperapt the range of values we’re interested in trying out and turn it loose here, number one, this is integrated with spark. So these modeling jobs, which are in themselves, unrelated, they can run in parallel are running parallel, or some of them are on the Spark cluster rather than serially on one machine. And obviously that has some advantages to speeding up your search. So if we run this code,

Hyperapt will use a spark cluster to build models and learn and learn and try more models and overtime the loss of the best models, is clearly getting, getting lower, which is great. But there’s another thing going on here that I think is interesting. So if you run this in Databricks, at least, you will see that the via MLflow, all the models, this is creating are getting tracked for you in this notebook, or rather the experiment that’s built into this notebook. And you can open the run sidebar to see those. But I think it’s actually more interesting to see this popped out here.

So what we’re looking at right now is actually the MLflow tracking server. Which you can run yourself. You can run this on your laptop, on a machine and use it and log to it. Here in Databricks it’s built in, it’s what’s run for you, but you can get this outside Databricks too, no problem. So all these possibilities here, these were all generated automatically by Hyperapt. And with each one, I see if I click through what hyper parameters it shows and what the loss was. So I’m getting some automatic logging here from MLflow because Hyperapt and MLflow are themselves integrated. Now, of course, that’s one thing that we can do, we can drill into individual runs, but maybe it’s more interesting to compare a bunch of runs. So if I select, I think these are, well yeah, let’s compare all of these.

You might get a little more useful to you if you look at these altogether. So we can see all the runs and compare their numbers side by side, but often it’s more useful to take a look at it in a parallel coordinates plot. So here I might, for example, select just the ones that had the lowest loss, the best ones, and look at what their hyper-parameters were like, and maybe learn a little bit about what’s working well. So I can, you know, they might further drive my exploration and experimentation. Like it looks like a higher ‘min child weight’ is tending to work better for those models that are doing well.


Hyperlapt’s now helped us find the best combination of hyper parameters. So we might proceed then to build one last model, using MLflow, using those best settings and build on all the data here. This is another look at how you can use MLflow. So this is the more manual or direct usage of MLflow, which still isn’t very complicated. So here’s my modeling code. And to use MLflow, I really just need to instrument it with MLflow. So for example, I start a run, which is the basic unit of model logging, artifact logging. Do my work here, and then tell MLflow what I want to record about this. For example, I wanna record the hyper parameters and I want to record, of course the, the model itself. I need to save that, but you can do other things too. Like here, I’m using a package called SHAP to create an explanation of the model and why it predicted what it did in certain cases. And having it generate some, a future importance platforming, which I can also log here with the model. So it’s not just models you’re logging. It can be arbitrary artifacts. It can be metrics, parameters. It can be small data files if you want, but that’s less common. Certainly the models themselves. And MLflow has direct support for most common model types or modeling packages. So if we quickly take a look at the docs, you’ll see, there’s direct support for, for example, Keras, TensorFlow, PyTorch, Scikitlearn, of course, XG boost here, SparkMlib of course, and a couple other modeling libraries, including most R functions as well. I should mention that in this example, we’re working in Python, there’re API is Primo flow in Java, the Scala and also R. So if we, when we log this, we get another final run here.

It looks like this. And we see, for example, who ran it, me, when, linked to the exact revision of the notebook that created it. I can add notes here If I want. I see all the hyper-parameters I logged. And most importantly, I see, for example,

the model I logged here and its artifacts, there’s the booster, serialized. Here’s a bit of information that logged about the necessary environment, including them outflow and extra boost, 0.9 in this case. And that summary plot two, which I’m going to come back and show you in a minute, but you can look at these plots in line two as artifacts that got logged together.

Now, the next step in the workflow here is going to be to get this model to production. And that’s not just a question of literally creating the production job, but the workflow that leads up to that, and that’s gonna entail looking at the model registry. So I’m gonna pause here, ends part one, and we’ll pause for a minute, take a break. I’ll be happy to answer some of your questions about what we’ve seen so far. Then we’ll come back and we’ll look at how we’d use the model registry to track the promotion process, and then MLflow to get a production job out of this as a batch job or a REST API, or maybe even a dashboard. So we’ll be back in a minute. Okay, welcome back. Let’s recap. So, we’ve started with some data engineering to process some raw data concerning health indicators and life expectancy. And we’ve built an initial model that we think relates those health indicators to life expectancy in a meaningful way. Like we could use it to predict life expectancy

state of a future country, which is great. So we need to get this model to production, let’s say. And before we say what it means to get something to production, let’s talk about the workflow. So, right now we have, as you recall, we have this run here in MLflow and we’ve logged the model we’ve built. That’s great. So we know where it is and we can easily retrieve this programmatically or just go grab it off distributed storage. But, you know, getting a model to production is more than just handing over a file to someone or even handing over this run ID here. Obviously there’s typically a little more workflow around it, right? We don’t just stick something in production straight away. And that’s where the model registry comes in. It’s this models tab here at the left here. And if you open that and search for the, you’ll find among other models, the one I’ve created here for our demo here called life expectancy. So this is like a logical model. So this is a model that might evolve over time. Today, it might be backed by this XG boost model we recreated, maybe in a month. We’ll find a new way to build a better model with TensorFlow. And so there’s going to be versions, different versions of this logical model over time. And the model registry exists to help you track those versions and also track what state they’re in. So are they, is it the production model right now is a testing candidate or staging candidate. And who is allowed to move them through those life cycle stages. So in this instance, so here’s where we are now. So this is the data site, that’s created this model. And maybe she wants to suggest that this could be the new production model, it’s looking good. So in MLflow, you could, for example, use the client to register this run as a new version of this model called life expectancy. And then declare that that version is the new staging candidate. And so after executing that we’d be in this state right here. So we’ve got a current production model at version two. And the candidate we’re looking at now is version three.

So what happens from here? Well, it depends. In many cases, maybe a CICD job takes over maybe another notebook runs and it loads that latest agent candidate and runs automated tests on it. So maybe it verifies that the model is no less accurate than the current production model. That it does okay on some golden set, for example. And just runs other tests on it. So that could be all you wanna do. And maybe that process then signs off and promotes to production. But I wanna look at, possibly a more manual sign off process. Which may be necessary or appropriate for some workflows organizations. In part, because I wanna to take a little time to actually talk about the problem we were solving and what one actually finds when you explore the model. So here maybe, maybe a data science manager takes over and her job is to sign off this model. So maybe she does go back and look at the details here and verifies that all the metrics look sound, and it looks like all the right things were done. But she might also unpack that run, and go pull out those plot artifacts here. Like that feature explanation I logged with SHAP. So SHAP by the way, stands for Shapley additive explanations. It’s a great set of ideas in a great Python package, which you can use to explain your models. Now it’s not built into the Databricks runtime, but as I say, as with any other package that you can load from PyPI or Maven central, you can simply attach that to a cluster. So in this case, for example, I created the SHAP library and Databricks and just attached it to my cluster so I can use it. It’s not harder than that.

So what does SHAP do here, this is one of the principle plots that shaft can create here are its top features. The ones that thinks most influenced the prediction of the model, which is life expectancy, but it goes a little farther than that. So over here, dots are countries and years. And in this case, what this first row is saying is that low values of this feature, this is mortality from, for example, cancer, diabetes. Low values in blue seem to explain higher predicted life expectancy and vice versa, higher mortality rates go with lower predicted life expectancy. Now that’s not really making a causal assertion, although SHAP does do a pretty good job of trying to account for things like correlated features and confounders and so on. What’s at least directionally interesting it seems to say pretty clearly that in just about any of the inputs, this value has the most influence on the predicted life expectancy. So you might certainly take this as some sign that this is actually the most important feature. And if you look down the list here, there’s some interesting, certainly some interesting ideas here, but none of them are as big. And none of these include drug related overdoses, which is maybe the hypothesis we walked in with. That’s interesting.

I’ll just briefly show you a couple other things you could do with SHAP. You can generate a feature dependence plots that try to explain not just the relationship between this important feature and the effect on predicted life expectancy, but it’s sort of interaction with another feature it chose, which is year here. So you can see how these effects change over time too. And lastly, you can also use shat to do things like look at all the predictions that are made for the US and all the predictions that were made on the data set for countries that aren’t the US and look at how their SHAP did values differ. Like how did their explanations differ? How do they get to the predicted life expectancy differently and take a difference here. And here again, there’s seems to be overwhelming evidence that it really is mortality from cancer and diabetes. That explains most of the difference in what we see in the US versus the rest of the world.

Well, interesting. And you can imagine applying this to your own model, but it’s worth noting that SHAP can be applied to not just actually boost models, but a deep learning models and really, really any model as well. But whatever it may be, whatever analysis you may wanna run or CICD jobs you may wanna run. That’s probably an important and necessary step before deciding to promote to production. But once you’re ready for production that’s where the model registry comes back in as well. So in this case, programmatically, you could move this version of the model to production. No problem. You can also do it through the UI. So I could take this version and maybe I have power to just to request that it become the production version. I can add a comment, ready to go.

And then maybe someone else with power to do the final sign off like a DevOps person and MLOps person comes through and approves it. Here I have the power to I mean, admin. And once we do that, we end up with version three as the new production model. This by itself doesn’t cause anything to happen. This is still just bookkeeping, but this will still be useful because other systems that need to know currently what’s the current production model can then just point to the model registry and look at what the current production model is and pull it rather than go figure out which version of what files seems to be the latest one. So it’s helping organize that workflow,

Right. Finally, we move to production. I think it’s worth saying that moving to production has always been a little bit hard and why? Well, number one, the output of the data science workflow of the laboratory, so to speak, it comes in a pretty different context than where it’s going to be deployed. Often production might mean a, I don’t know, could you might have been a Hadoop MapReduce job

in past years could be some kind of SQL statement that’s running production to score things. Could be, I dunno, some Java based environment too, which is just pretty different than the environment in which the model was created. And that’s one of the biggest obstacles to getting to production. Often you’ll find people still take, I mean, literally a set of coefficients from some model and reimplement that logic, that logistic regression model or whatever it is in like Java code, just in move that to production. And obviously that’s time consuming and obviously that’s error prone too. So one of the key things MLflow tries to do is translate or rap or trans code the model you’ve logged into a form that’s more useful for production.

Now it’s worth saying that you can always get the model you logged out. At any time, I can go pull the raw XG boost regressor, the booster I built and use that directly. But, maybe for me, let’s say production means a batch job. I need to go take that model and score it at scale on a bunch of new data that’s arrived in the past day. Now, obviously I can do that with Spark, but I’d have to write my own UTF, for example, and I’d have to figure out how to get the packages right and so on. And that just seems like more overhead. So one key thing that MLflow can do is give you back most models. Most that you can log with MLflow as a spark UTF, a user defined function. So in this code we ask MLflow, asked the model registry for the latest production version of this registered model, and then ask it to load the underlying model, which happens to be an XG boost booster as a Spark UTF. So this becomes a function that I can apply with Spark. And so here I go, muster some new data, some new featurized data. Maybe this is more recent data that I wanna to draw predictions for. I’ll tell you that in the original data, there were not at the time actual life expectancy figures after 2016. So, maybe it’s coherent to try to infer those. And then all I have to do is apply that function to the data using Spark here. So here, I’m gonna add a column. This means life expectancy to the data frame, and I’m done. This could be my entire production job.

Now, for use cases where a lower latency is required. You may wanna use Spark streaming and load a streaming data source with spark and process it that way. The good news is this part would look pretty much exactly the same. That same Spark UTF would equally well work in a Spark streaming job as well. So even if you need a lower latency, this could be the extent of your production job too. Which is really quite amazing when you compare with the work and the effort it sometimes takes to get to production, otherwise. Now, as I say, we can take the results of the predictions here and join them up with the previous data as well. Here, we just get them into the same form and here’s the predictions, but we can also plot those again, just like as above and look at what 2017 and 2018 look like according to the model. And as you can see, it’s filled in here nicely. There’re a little bit flat for the most part. And I think I chalked that up to the data being mostly missing. The most of the features here are missing in 2017 and 18. So, the filling in the missing values tends to just kind of make a lot of the predictions look pretty similar. But, maybe that’s not so important. That’s specific to this particular problem. Now, one thing I don’t show here is another option for deployment. And that’s a REST API. So sometimes production means publishing a microservice, something where you can send a request like a REST formatted request and have the model make a prediction and send it back as a response. That’s also something that MLFlow can do for you. Again, for most model types, it can export a Docker container that contains within it, a server that’s serving requests for your model, and that Docker container can be taken and published and run on any infrastructure you like. It could be your own Docker cluster manager or MLFlow also has some direct support for deploying to a Microsoft Azure ML or Amazon SageMaker. Both of these services are built around serving models as Docker containers. And they can provide some additional value, like scaling the serving up and down and providing monitoring and other support like that. So in addition to be able to put out the Docker container, there’s also direct APIs for integrating with things like Azure Mellon SageMaker. I don’t show that here in the interest of time, but what you would get is pretty much what you’d expect. A REST based service, where you send JSON formatted requests with all the column values and you get back a result that has the model’s predictions like a number in this case.

Now, one last form. I think I sometimes see production take as a dashboard. So sometimes we need to provide the results of the model as maybe among other things, a tool that could be used interactively. So maybe it’s we wanna turn this into, like a what if scenario dashboard. That’s also possible with MLflow and we can leverage the simple dashboard support built into Databricks to help build something like that. So maybe we want to provide a dashboard or a widget that looks something like this. So here we have, these are predicted life expectancy in the US over time. Where the value of some key feature is being varied from what it actually is. And in this case, by default, we’re going to vary. This is mortality rate from cancer and diabetes. And so maybe you want to see how changing that value does seem to effect predicted life expectancy. Now, of course, we can pair this with MLflow. We can write code to load the model. And then in response to the selections here, go load some data and send it through the model and plot the result. But we’d probably don’t want to share literally this whole large notebook to maybe perhaps a business user who doesn’t want to deal with the code, doesn’t necessarily know how to run it. And so on. Instead, we can export this part, this visualization as a dashboard

And instrument it with simple widgets here that let us in this case, vary this range here. So I’ve selected zero to 18 to start, but maybe I can, I clearly wanna zoom in maybe more on this range here. So I’m gonna change this to 12,

give it a second to recompute. And there we go. Now this, the models predictions have been recomputed across this whole range, and I can again, see a little more detail there. I can zoom in further

and so on. So this is a simplistic toy dashboard, but you can imagine this could be more complex and this might open up a number of other possibilities for you, if the goal of producing the model and pushing it to production is to power something like a dashboard or an interactive tool. That’s also possible. Not everything’s a batch job or a REST service.

So, thank you for joining me for this two part walkthrough of an end to end data science workflow from data engineering, all the way through to production for several different definitions of production. And to help I showed you along the way, some of the basic ways in which MLflow can help that process. Not just logging their models and organizing your experimentation, but also managing that signup workflow. Getting that model out to production in a principled way. And finally actually powering some of the production jobs. You may need to direct, whether it’s getting a model as a UTF or as a REST service, or indeed powering a dashboard like this. If you are interested in this, if you’re interested in hearing more about MLflow and maybe having a more hands on interactive workshop to work through, I would encourage you to take a look at Databricks’ YouTube channel, where we do have a three-part workshop; Managing the Complete Machine Learning Lifecycle with MLflow. It’s also gonna go through ML flow, maybe in a little more depth with a different set of work examples. And it’s one you can follow along with as well. So with that, thank you again, and I hope you enjoy the rest of the summit, but before I go, I do want to stop again, take a break here with you and pause to answer some of your questions.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Sean Owen


Sean is a principal solutions architect focusing on machine learning and data science at Databricks. He is an Apache Spark committer and PMC member, and co-author Advanced Analytics with Spark. Previously, he was director of Data Science at Cloudera and an engineer at Google.