Machine Learning Data Lineage with MLflow and Delta Lake

Download Slides

Many organizations using machine learning are facing challenges storing and versioning their complex ML data as well as a large number of models generated from those data. To simplify this process, organizations tend to start building their customized ‘ML platforms.’ However, even such platforms are limited to only a few supported algorithms and they tend to be strongly coupled with companies’ internal infrastructures. MLflow, an open-source project designed to standardize and unify the machine learning process, and Delta Lake, an open-source storage layer that brings reliability to data lakes. Both originated from Databricks, can be used together to provide a reliable full data lineage through different machine learning life cycles.

In this talk, we will give a detailed introduction to two popular features: MLflow Model Registry and Delta Lake Time Travel, as well as how they can work together to help create a full data lineage in machine learning pipelines.

MLflow Model Registry provides a suite of APIs and intuitive UI for organizations to register and share new versions of models as well as perform lifecycle management on their existing models. It is seamlessly integrated with the existing MLflow tracking component, allowing it to be used to trace back the original run where the model artifacts were generated as well as the version of source code for that run, giving a complete lineage of the lifecycle for all models. It can also be integrated with existing ML pipelines to deploy the latest version of a model to production.

Delta Lake Time Travel capabilities automatically version the big data that you store in your data lake as you write into a Delta table or directory. You can access any historical version of the data with a version number or a timestamp. This temporal data management simplifies your data pipeline by making it easy to audit, roll back data in case of accidental bad writes or deletes, and reproduce experiments and reports.

A live demo will be provided to show how the above features from MLflow and Delta Lake can work together to help create a full data lineage through life cycles of a machine learning pipeline.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Welcome to Machine Learning Data Lineage with MLflow and Delta Lake. Hi. My name is Richard Zang. I’m a software engineer from Databricks ML platform team.

Prior to Databricks, I was a Software Engineer in Hortonworks working on Apache Ambari.

Before that I was a Software Engineer in Opentext Analytics building the company’s BIRT iHub Visualization. – Hi everybody. My name is Denny Lee. I’m a Developer Advocate here at Databricks. I was previously a Senior Director of Data Science at Generic Concur Also Principal Program Manager at Microsoft. Also known for what is Project Isotope which is also known as as Azure Hdinsight. So thanks very much for joining us. Back to you Richard for the agenda. – Today Denny and I will provide a brief introduction to MLflow and its model registry feature as well as Delta Lake and it’s time travel feature.

Then we will show a live demo on how to use various versioning features from these two frameworks to achieve data lineage in the machine learning process.

We know that Machine Learning Development is complex.

ML Lifecycle

To give a sense of it, this is a typical machine learning pipeline. You take your raw data, you do some ETL or featurise it or data prep.

Then you want to do some training with this data to produce a model and deploy this model to production. Score head produce a REST API serving layer, run it through spark et cetera.

And then when you get new data, you will reiterate the process again.

Many organization using machine learning are facing challenging storing and burgeoning their complex ML data as well as a large number of models generated from those data. To simplify this process, organizations tends to start building their customized machine learning platforms. However, even such platforms are limited to only a few supported algorithms and they tend to be strongly coupled with the company’s internal infrastructure.

MLflow is an open-source project designated to standardize and unify this machine learning process.

As you can see that MLflow has four major components; ML project, packaging formatting your reproduced runs and make it available in any compute platform. MLflow models allows you to generate model format and standardized deployment options. MLflow tracking allows you to record and query experiments and lock metrics and parameters. And finally model registry allows you to have a centralized repo to collaborate in the model lifecycle management. Most machine learning library lets you save the model file but there isn’t any good software to share and collaborate on these files especially with a team.

Challenges in Model Management

If you’re working alone, you can probably check the file into a git repository. You may need to name the file somehow to keep track of your model versions and hopefully it’s still manageable because you need to actually remember what you did to come up with these versions of files.

If you’re working in a large organization with hundreds and thousands of models and each of them has different versioning for many different reasons, this management…

This management becomes the major challenge. You may ask that where can I find the best version of this model? How was it trained? How to add documentation for it? And also how can I collaborate with my colleagues to view the model?

MLflow Model Registry

Inspired by collaboration software development tools like GitHub, we launched MLflow Model Registry which is a repository and a collaborative environment where you can share and work on your model. You can register named models and create new model versions for your registered models. You can comment and tag your register models and model versions so people can collaborate with you to quickly find the latest version of the model and relevant information about that model.

It also has a built in concept of lifecycle stages. Like each model you can have versions that are staging production or archived and it provides a serie of API for you to easily interface with the model registry and you can do it automatically and test it with your CI/CD pipeline.

So the new workflow is that as a developer, you log your model into the model registry and work with any type of model alone as you can package it there then your collaborator can go to the model, manually view it or use automated tool to test it with the MLflow model registry API. Then the downstream user can safely pull the latest model after it’s been reviewed and check it if it works and you can also use automated jobs or serving services for, of your choice with your latest model to do some inference.

mlflow Model Lifecycle Data Lineage

And we can see that the data lineage through the MLflow model lifecycle is as follow; it starts from training data set you ETL from the raw data. You may have different versions of the training data and relevant parameters and metrics as well as the model file can be logged in the tracking API.

Then the MLflow tracking component…

Then in the MLflow tracking components to run details page, you can register a new model or create a new version of an existing model. Finally you can manage different version of the model and their life cycle stage in an MLflow model registry component.

That is part for me. Let’s welcome Denny to talk more about Delta Lake. – Hey, thanks very much Richard.

A Data Engineer’s Dream..

So we have many sessions throughout summit that talk about Delta Lake so let’s just focus on some of the key components here. What you really need for proper model data lineage is reliable data and that’s what a data engineer dream. Is that they’re able to process data continuously and incrementally as new data arrives in a very cost-efficient way without actually needing to choose doing batch or streaming.

Delta On Disk

So underneath the covers we’re talking about Delta, what’s Delta on disk? See, it’s a transaction log that actually has your table. You see the del underscore delta underscore log and the action json files that you see there plus the parquet files themselves. As you’ll note, there’s the table versions and also optional partitioned directories that you’re working with.

The data files are actually your original parquet files that you’re used to working with, a package together is now your Delta table that ensures ACID transactions. So that way not only do you have reliable data but you also have a transaction log that now we can go back and look at what the old data looked like in when you’re modeling, watching your ML models as well.

Implementing Atomicity

And so the key aspect of implementing atomicity is that you wanna be able to make changes to your table as they’re stored as ordered, atomic units called commits, all right. You have your first file 000 json here.

Then you have a second file, the 01 json. If I’m adding one or two parquet files that’s recorded in the first json or the zeroth json, in the second json or the first and zero one json, that actually records the removal of the first and second parquet files and actually adding of the third parquet file, all right.

Solving Conflicts Optimistically

What we wanna be able to do is solve these conflicts optimistically if the two clients are trying to run each other at the exact same time. For example you wanna be able to not the record start version, the record/ writes, any attempted commits and if someone else wins, check of anything that you tried to read has changed as you can diagrammatically see here. So that’s it for this the slide portion of the session. Let’s dive into the demos. – In this demo we’re going to show you how to use MLflow model registry and Delta Lake time travel to handle data lineage in machine learning process. We will also show you how to use various versioning features from these two frameworks to troubleshoot data versioning problems to achieve reproducibility for your experiments. Here is a notebook where we are going to run some machine learning code with the box with the Boston housing data set prepared by Denny. The Boston housing data set contains a bunch of columns like crime rate, number of rooms, percentage of lower status population. Our objective is to use this data set to train a linear regression model and use it to predict home values. We have few pre-run cells doing some data preparations and visualization. You can see that we create a data frame by occurring the Delta table and then converted to a pandas data frame. From the scatter plot matrix here you can see that the number of rooms and the percentage of lower status population are having positive and negative linear correlation with the median value of the house as shown here and here. But we can see it even more clearly in the following two separate scatter plots here and here as well as on the bar chart that’s showing the correlation from all columns to the median home value.

We then define a list of more readable column names and we drop all the rows without median home value for data cleanup. After reviewing the correlation coefficient matrix and scatter plots, let’s choose features that have a strong correlation to the median value.

Let’s say we will choose the columns with the absolute value of correlation coefficient are greater or equal than 0.4.

And then we do a train test split, 80/20 train test split for our training and testing data set.

And here shows that we’re gonna try different learning rates and choose the one that yields the lowest RMSE.

And we have two training session with ridge and lasso regression respectively and let’s run the training sessions.

As the training session is running, let’s take a look at our training function. Our training function takes our training and testing data sets and the regression type as well as the learning rate alpha. At first it creates the MLflow run and initialize the linear regression object based on the regression type then it fits the training data set and collects all the training prediction outputs. Then it calculates our RMSE and r2 metrics and use MLflow API to log all the parameters and metrics. It also logged the linear regression object as a SKlearn flavored model. Finally it creates a prediction error plot plus a residual plot and log both of them as run artifacts using MFflow API.

After the training process, we can see a list of runs showing in the notebook run side bar. Let’s choose RMSE and let’s sort ascending by RMSE

and we choose the lowest RMSE and go to that run.

Now we can see the run details page. In the run details page, we can see that the parameter and the metrics that we logged in the notebook and in the artifacts section, we can see the ML model file indicating the SKlearn flavor model we logged and the png files for the plot that we logged.

Since this is the best run we have, let’s register a model using this run. To register a model, we first select the model folder in the run artifact section and let’s click register model and choose to create a new model and let’s use Boston housing demo as the new models name and let’s click register.

And we can see that the first version of this model is being created.

Let’s wait for it to finish creating.

Now it’s finished creating the new version of this model which is like basically the version one of this model. Let’s go to the model version page. So this is the model version page. We can go back to the registered model page and see that version 1 is the only model version we currently have. Since I wanted to collaborate with Denny on this register model, I’ll give Denny manage permission of this model. So add Denny here and then I choose can manage at him and click Save.

If you want to load this model in our notebook, we’ll need to switch this model to the production stage. So let’s say ship it, okay. Now our model is in production stage. We can go back to the notebook and there is a cell at the bottom that we pick a row

from our data set, our training data set

and to test the model

and see the prediction. Let’s run the cell.

As we can see that the prediction is 23.7925

which is pretty good.

– So one thing I had noticed when working with this notebook is that as you can see from Richard’s model, if I go ahead and dive into it a little bit. He was actually using a different version of SKlearn. He was actually using 0.22.1 and I actually wanna use a different version of it. So what I’m gonna do is I’m gonna go back and rerun the whole thing and you’ll notice that I can jump right into here and find the lowest RMSE. So I’m just gonna pop that open and I’m gonna quickly go ahead and jump this over to and deploy a different version of the MLflow model.

All right, give it a couple seconds here. So I’m gonna grab this one, this model. I’m gonna register this model similar to how

Richard had done before. I’m gonna register it and then now I’m on v2. It’ll take a couple seconds for it to go through, just to make sure. I’ll just take a look at real quick,

transition that to production so which will automatically…

Perfect. So now that I’m good to go, I’m gonna go back to my notebook, I’ll close up the runs and I’ll go ahead and rerun this particular cell again. And as it goes through you’ll see again a value of 23.79 and based on the original median value of 17.8. But let’s say now I want to go ahead and replace the null values that we actually had. So remember that this particular Boston Housing kaggle data set, there are 504 rows but 333 of them actually has values, a median value. Well the remaining 180 or so don’t, are null, okay? So I’m gonna go ahead and instead use this particular model to update my table. So I’m gonna actually fill this with new values and so here you go. So right now what I’m doing is I’m updating our Delta table with, in this case it’s matching by ID, I’m gonna update that value with the new values that were calculated using this particular model. So if I scroll down and take a look at the values,

you’ll notice that basically I have not only the original values inserted in…

Perfect. Apparently scroll to the right here, okay. You know so with either one or two decimals or none for the median value but also the new values if I scroll down.

There you go. These are the new values that were predicted by our model and we’ve inserted the back in. Now fortunately for somebody like myself, this’ probably a bad idea but fortunately because I’m using Delta Lake, I’ve actually saved all this information

because we actually kept the transaction which Richard’s gonna show. But meanwhile I’m just gonna go ahead and hide the fact that I did this and delete the cells, okay. So now I’ve got the updated data which is sort of incorrect using the updated model, all right. Oh! And then let me go ahead and run this all over again

and I’m gonna register the new model based on the updated data which probably isn’t the best idea but again I’ve got a new RMSE value so let’s just let it run through and we’ll see what ends up happening. I’ll have the new results coming in and I’m gonna register this as a third model.

All right, perfect.

So we’re almost finished.

Excellent. So now let’s go back to this. I’ll go back and choose RMSE. Now I have an even lower one of 4.331. I’m gonna make this my new model. So I’ll go back here,

register this as a third model of our Boston housing demo,

transition this to production as well,

click on okay. So if you look at the models, you’ll actually see the three models. There’s the first one that Richard created using a more recent version of a SKlearn. There’s a second version which I ran which actually has a older version of a Sklearn and the third one in which I’ve went ahead and re-updated, updated the median value data. So if I was to go back and run down , I’m gonna keep this cell for the purpose of understanding what’s going on. I’m gonna run this one but this one’s now against the new model against the new data. And it helps if I actually write the correct name.

And here you go! You’ll notice that actually same something point but instead of 23.79, I have a value of 23.13 now. All right. So that’s it for this part.

– Now I’m back to the notebook. What I want to do this time is to retrain the Boston housing model and see if we can reproduce the exact same result. However, I noticed that Denny has rerun the notebook and created a new prediction. The code looks the same but it generates a different prediction value. Is he using a new model? Let’s go ahead and check.

Let’s go to the model page and search for Boston. Here we can see that there’s already three versions in the Boston housing demo model and the version 3 is now in production. No wonder that prediction is different. Let’s check what’s different in version 2 and version 3. So in version 2 Denny says that he switched to 0.20.3. This looks like a SKlearn library version. I think if we want to reproduce the the same result we just need to use the previous SKlearn version which is like a newer version of SKlearn and that’s fine. Let’s check version three. In model three, Denny says that updating to include predictive value for median value. This doesn’t look very correct. Wow, if I understand it correctly, he probably has updated our training data set in the Delta table with some value from the prediction output. That isn’t sounds like a good practice in machine learning. So let’s check our Delta table and see what’s happened to it. Let’s go back to our notebook and then you can see that now I’m using a cluster with the latest scikit-learn version and let’s check our Delta table here.

And in the Delta table history, we can see that there’s a new version created by Denny and from the operation metrics, we can see that there’s 173 rows updated and the two number upper row is 506, okay. Let’s check zero version and like V 0 and V 1 version respectively and see what’s the content of the two versions. So the V zero looks pretty legit and it has a bunch of rows that with the median home value bin no and the second one, it looks, everything’s filled and wow, these looks like what Denny says that the predicted value right. This doesn’t sounds right. So if I wanted to reproduce my training and my experiment, I’ll need to overwrite the tables back to the V 0, okay. So I’m gonna go ahead and do that.

Okay, done. Then let’s check the Delta table again. And we can see that we have another new version which like me roll back the previous v-0 and that will like revet back our data set version given us the data lineage on the data set, okay and this is pretty much what we need to do for getting the data set aligned and then we have the same data set and the same library environment and now let’s run the training and see what’s the output. And to have a clean slate for the training, I’ll need to clean up the previous experiment runs. So those are the previous runs, just for convenience of reproducing with a clean slate, let me delete all those old runs. Okay, deleted. Let’s go back to the notebook and refresh the run sidebar. All the runs gone, click save. And then let’s go ahead and rerun the training process. So all the way to here. So let’s run everything above the prediction, run all above

and after this training, new training session with the exactly same dataset and exactly same secular inversion, I will register a new model version with the same way that we did earlier which is like to select the run with the lowest RMSE and then use the run artifact of that run to register a new model version and looks like it finished running. Let’s open the run sidebar and then choose RMSE and store ascending, we’ll see that this is the lowest RMSE and let’s go to this run

and register it as the new version of our Boston housing demo model.

And you can see that the v4 of the model has been created. Okay, it’s finished.

This is the v4 of model and let’s make it to the production

and let’s do some prediction based on the v4 of model which is currently in production state. And let’s copy the prediction code here and let’s run it down here.

Mm-hm, 23.7925. Looks like the same value we got before. To confirm, I will go back to version 1 and make version 1 production

and then try to run the prediction again because the prediction is always based on the current production model version. So I wanna see if version four and version one can output the same prediction value. Let’s run it. Nice, exactly the same, 23.79. Cool. So to recap in this demo, we first train a linear regression model from the Boston Housing dataset and create the model version v1. Then we messed up with our library version and original training dataset in the data table and after we found the problem, we used Delta Lake’s time-travel feature to switch back to the original version our training dataset and rerun the training process with a consistent SKlearn library version. We end up reproducing the same result we had in our previous experiment training session. This is the end of our demo. Thank you. – Well, thanks very much Richard for those awesome demos. Thank you very much for attending our awesome session today. If you want to go ahead and dive in more, please join us or for more information.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Richard Zang


Richard Zang is a software engineer on the ML Platform team at Databricks. Richard has great interest and extensive experience building data-intensive enterprise applications. Before Databricks he worked at Hortonwork on Apache Ambari and prior to that he worked at Opentext Analytics building its BI visualization suite. Richard holds an MS in Computer Science from the University of Chicago and BE in Software Engineering from Sun Yat-Sen University.

Denny Lee
About Denny Lee


Denny Lee is a Developer Advocate at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers. His current technical focuses include Distributed Systems, Apache Spark, Deep Learning, Machine Learning, and Genomics.