Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific

Download Slides

Thermo Fisher Scientific has one of the most extensive product portfolios in the industry, ranging from reagents to capital instruments across customers in biotechnology, pharmaceuticals, academic, and more. The amount of data needed to construct a comprehensive view of customer needs is massive and in order to build a data science ecosystem capable in dealing with this amount of data, each step from data engineering, model development, and delivery in the ecosystem has to be scalable.

With scalability in mind, Thermo Fisher partnered up with Databricks to build an end-to-end data science pipeline with CI/CD standards, and further augmenting our capabilities through use of the latest technologies such as Mlflow, Spark ML, and Delta Lake. This talk is a summary of our journey from past to current state, as well as looking ahead to the future of our platform.

Key takeaways:

  • Utilizing big data for machine learning requires not just machine learning knowledge but also technical infrastructure to support continuous development, deployment and delivery for machine learning models.
  • How you can build a scalable data science pipeline with latest Databricks technologies.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Thank you all for joining my session today. My name is Allison Wu and I’m a Data Scientist in the Data Science Center of Excellence in Thermo Fisher. Today I’m going to talk about how we enable Scalable Data Science Pipelines with MLflow and Model Registry in our company.

So, before we go into any details,

Key Summary

I’d like to give you a high level summary of why we have achieved in this progress. We have standardized Development of Machine Learning models by integrating MLflow tracking into the development pipeline, and we also improve reproducibility of Machine Learning models by having GitHub and Delta Lake integrated into development and deployment pipelines. We also streamlined our development process and deployment process for a Machine Learning models on different platforms through MLflow and Centralized Model Registry. When is that so important for our team to set up is also highly relevant to what we data scientists do in our Data Science Center of Excellence.

So what do data scientists at our Data Science Center of excellence do? We generate a lot of new novel algorithms that can be applied across different divisions, and we work with cross-divisional teams to migrate models, and in this kind of cases, model standardization is actually very important for both productivity and reproducibility. A lot of my migration and standardizing are often times needed for enabling new data science into a new division. And while doing all this, we also are responsible for establishing data science best practices across the company. So there are multiple fields of data science that’s rapidly growing in data science, for example, operations, human resources, R&D, and commercial marketing. And we’re actively engaging all these kinds of area, but today I’m gonna focus on commercial marketing.


Commercial & Marketing Data Science Life Cycle

So and, what’s Commercial Marketing Data Science Life Cycle look like? So at first we have, all different data pipelines of piping all kinds of data, including transaction, web app, web activity or install base data from customer interactions. And we use all these data pipelines of piping to the same data like that’s needed for data scientist to consume, for model development and deployment. And we have both Machine Learning Models and Role-based Legacy Models is running currently in production. And for this would be the phase that we will be focusing on a lot today for supposedly for Development and Deployment. And we, for this process, we involve a lot of new technology in database such as Spark, MLflow, Delta Lake, and I’m also having GitHub and (speaks faintly) through all this process to help us with reproducibility.

And after we developed all this model, this model would deliver results, recommendations through different channels, such as email campaign or website or we also generate very prescriptive recommendation for our sales rep, through Salesforce or Adobe Analytics. And all these are meant to provide most relevant offers for our customers. And how the model performs are measured by the revenue generated or engagement ready, for each recommendation. And all these will eventually be feedback to both Model Development and Deployment phase or potentially a feedback to data processing pipeline in order to, for example, bringing new data to help us understand our customers.

Model Development and Deployment Cycle

As we are focusing on the Model Development and Deployment Cycle, let’s take a closer look at what each phase involves. So in Development phase, this means like, Exploratory analysis or Model Development, such as feature engineering, feature selection, model optimization, all these different good stuff. And then when as data scientists devolve their model to a certain point that we feel it’s ready for deployment, we move the model from the development environment, into the production environment. This is a process we call Deployment. And in during Deployment phase when we model as running in production, for a model that lives in production, it can go through a few different processes for example, like it can be, it has to be around for every day, scoring, and it all can also need to be retrained or retune every certain period of time. And this can all be done in our production. In order to monitor all these process we have another classical management, that monitors all these production runs to make sure our model are producing accurate results, or it can also run as whenever there is, we’re on events going on for models that’s running in production. And a lot times we also track feedbacks through this process and these feedback were either feedback to deploy models in production, or we can also feedback all the way back to Development phase in order for things such as new model Development.

So, and then one thing now I’m not gonna talk about today, but it’s also a very important stage is the Delivery, and we can deliver all the model, a lot delivery is done through using a recommendation that’s generating in production environments. So production environment result fit into different delivery channels like, web recommendation or email campaigns.

An Example Model Development / Deployment Cyc A model that makes product recommendation based on customer behaviors, such as web

So let’s take a closer look at an example model in our pipeline. So this is a model that generates product recommendation based on different Customer behaviors such as web activity or sales transaction.

And this model is found about six to eight weeks in exploratory, Exploratory Analysis and Model Development and prototyping. And then after development, it was moving to production and it’s mostly run in two ways in production. One it’s doing daily scoring, so it what it means it generates new input metrics based on new data, and then run through the same model every day to make sure we get the most accurate prediction based on new data. This model is also retraining retune with latest data for every two weeks.

And then this model to produce recommendation from this model is delivered through, email campaign and also through our sales reps dashboard to recommend them who is the best customer to engage with for this specific product.

And then this, the very last part is the management part that this production processes also monitor in our production environment through MLflow.

And so what we used to do as that in development is that we have a bunch of Databricks notebook and there’s no version control of any of the notebook and there’s no unit testing for our future functions or anything. And regression test is very hard to do, especially if you’re inheriting a notebook from a previous colleague. And this is the situation that we are probably all pretty familiar with. There we have multiple versions of final document, final notebook,(speaks faintly) And we count the fields ourselves with why exactly as the one that we should be used. So what we do now is that, we’re still in the Development environment, we still use Databricks notebooks for Exploratory Analysis on like Feature engineering, but once we get more comfortable and feel the future has grown past mature, then we actually write it into python modules and python functions that are version control into GitHub. And each Machine Learning features are actually independently are testable and shareable. So, and then aside from integrating GitHub, we also integrate (mumbles) to version control all our data that are used to train our models. This way combine was MLflow, we also are able to track machine and ML Development, So we can track all the hyperparameter tuning and how feature selection is going in different experiments.

What we now do…

And by the time we feel more comfortable with the model in the Development and then we can actually register it into the Development stages Model Registry. This actually makes the regression testing against previous versions of model, way more easy from way more easier, Because it provides a cleaner interface for us to pull out the previous version and look at how these two model compares. So I will have a few demo on that as well.

So here are a few scenarios that we can look at to see how exactly these process improve our development.

Tracking Feature Improvements Becomes Easy

So, first tracking feature improvement becomes a lot easier. How does that go? So we all have run into this, our boss come over and ask, oh, what are the important features in this version versus the previous version? And what we used to do is that, Oh, let me find out how the features do in my….uh…. model_version number 10 notebook. And I (mumbles) sometimes I wish I had a saver screenshot for the feature importance figure and I didn’t. Too bad. So, sorry, and I believe this is what we are all familiar with is that, we have so many different final versions that we don’t even know which one we should pull to get the data from the previous version. And how do we improve them? So now what we can do is that, sure, we’ll just pull it from MLFlow. And this is how it would look like. So this is the video.

So here we can see there are multiple MLflow runs to our log into another flow experiment. And now if since I want to compare with a previous version, I can just click and choose a few different models that I want to compare with. And then we can go into each one and see the feature importance figure actually locked together with a model. This way we will know and no longer need to worry that oh, I do not know lock feature importance figure this time because we can all log in through MLflow. And you all (mumbles) always be tracked. So now also sharing a Machine Learning features also becomes a lot easier.

Sharing ML features Becomes Easy

On the comments you’re at scenarios that a colleague will come over and ask me, oh, I really liked the feature you used in your last model. Can I use that as well? And what we used do is that, Oh, just copy and paste this part of the notebook. But well, I have a slightly different version in this other problem, I think I might have used this one.

This is usually not a great, greatest idea, because we often times confuse ourselves. And not sure which one to actually share. And it’s also really hard to track the changes over time for these functions. Now with GitHub integrated and into our development workflow, we can what we can do now is that, sure, I added that feature to the shared Machine Learning repo. You’re free to add it, use it and by importing the module, (lighthearted guitar music) Sorry, can we, sorry.

What we now can do is that, sure, I did add that feature to the shared ML repo. Feel free to use it by importing the module. And what’s even cooler. And you can see it on the side here is that, magnet is our Internal Shared Machine Learning repo and all these feature functions are testable, feature functions that we can write a lot unit just to make sure it does fit in all different kinds of situation and also make sure in the future if anyone modified a function it doesn’t break all the other people’s code. And what’s even cooler is that you can also lock the exact version of the repo you use into Mlflow. I’m here you can see that we not just we lock the environment setup, the conda yami, and also the source packages. That’s how we make sure we always locked the exact version. And this out so we make sure that even if the repo continue to evolve, after your model two bombing, you can still trace back to make sure you know which version you use for your own model. And this way you can always reproduce with the exact dependencies and environment setup.

So what we learned, during for improving this Development process. Reproducing model results is not just rely on version controlling of code or notebook but also version control in the training data, environments and dependencies. And MLflow and Delta Lake allows for tracking all these necessary things needed for reproducing the model results. And integrating GitHub also allowed us to establish batch practices of accessing our data warehouses and also standardize our Machine Learning model. And it really encourage collaboration and review among different data scientists. Personally, I think the last one is a big parse, as since data scientists a lot of times work in silos and having a collaborative platform is so important for data scientists to grow.

Let’s talk about deployment….

Okay, so let’s talk about Deployment. So what happens all the time is things work in development and everything breaks in production. And how can we streamline that process? So what we used to do is that, we manually import and export all of these Databricks notebooks that’s needed for a Deployment. And we’ll also we manually setup set up all these different clusters in production based on clusters in developments. And this makes troubleshooting super hard. And what is happening is a lot of times (speaks faintly) coming on Saturday to get instead alongside with data engineers to make sure their model can be deployed correctly without error. This is not something we want.

So we have tried really hard to streamline this process. And what we now can do is that, we registered our model into Development Model Registry, and then we can move all the artifacts that are needed to deploy a model from Development environment, into the Production environment by registering the model, and copying all the artifacts into our Centralized Model Registry on our Production environment. It’s also makes regression testing within the Production environment very easy, because you can actually move the model first into staging on production, and then now that the staging and the production model are in the same environment, and you can easily do comparison and testing within the same environment. This also allows us to have easier way to track and monitor all model. So you can see that in the Centralized Model Registry, now we have different versions of models in there and we can easily match across different versions to make sure and we can clearly know which one are in production which one are in staging, and we can use the notebook to still ask you all these model pipelines and use MLflow to track how the model actually performs in Production environment. It can track a specific (mumbles) how the output of the model and so whenever there is some weird events going on we can get first alerted.

This another thing is very important that Centralized Model Registry allows us to manage model across different environments and different platforms. Some of our data scientists on different teams like to use Databricks, some of us like to use SageMaker. SageMaker they’re average they can all register through the same Centralized Model Registry. And they can also be deployed to different environment based on the original input. So here I’m gonna have a short demo on how we can register them, register the model from Development environment into Production environment.

Deploying and Managing Models Across Different Platforms through a Centralized Model Registry

– [Instructor] Here’s a demo of how we can register models from Development shard to the Centralized Model Registry on production shard. The other demo I’ve deployed to our test show instead of our actual production shard. In reality, this is a way to use Model Registry as a hub to manage models, developed on different platforms. This is new feature for MLflow and only works with MLflow 1.8 and above. So before we start anything, we need to set up the credentials so that when you reach out to the production shard, it can authenticate that it’s you. We’ll need to create a token on your destination (mumbles) and use this token as those handshakes on your source shard. You can store these credentials in your database secrets and create a database profile here on your local shard. Now you’re all set up for the connection between the source and the development of source and destination shard. See here. And now we can start finding the model that you actually want to read register from your source shard to your destination shard. So you can see here, in order to find the run ID, you can either pull it from a Model Registry or from an MLflow experiment.

However, it’s usually a better practice to deploy to production using models that are already registered and approved in Model Registry. So this is what I’m going to demo today. So here is how you can create a client on Mlflow, tracking server, and then this is how you can specify what model you want to vote for. So this directory pulls the latest version of the Production model in this Model Registry. And this code will actually give you the actual absolute path for all the artifacts that you register for the model. So, by parsing this artifact path, you can get the experiment ID and run ID as well. Experiment ID is usually this one and run ID is this one.

So after we got the experiment ID and run ID, I know exactly where the model is, we can transfer the model from the Local Workspace to the Destination Workspace. And here’s how we can do it. And you can see here, we just need to give it a run ID artifact path, after uri and then it can initiate a copying over. And here you can see it’s copying over the model that we specified in here, in this case it will preserve the exact same artifact parse, between the destination and the source shard. So in that case, it also preserved the experiment ID and run ID.

So, at the source tab where you can actually create the registry on the Central Model Registry and point that model or point the registry to that actual model by registering the model with the source ID that you just transferred to it. So here actually just need to specify the run ID you just derived from the Local Model Registry.

As I said, we’re preserving the same run ID on the source shard and the destination shard. So on the destination shard the run ID would be exactly the same. And in order to create that Model Register on the remote server, you have to create a remote client here, using the tracking you just set up. And so this is how you can create a model version using the model you just transfer from the source to the destination shard. And then you can also update a metadata on the Remote Central Registry, I usually save the Source Workspace information also run ID, and also if have security analysis you can also close in the Run URL, and this is what it will look like, in the destination shard. You can see it has the DEV Source Workspace and also the run ID and you can also see the Run URL. It’ll point to the most original experiment run. And I can actually click on that to see what’s over there. And (faintly speaking) local, now I could reproduce this model, and I also include a feature importance there (faintly speaks) model performed with. Okay, so, and then you can also update model version to production if you want, all through this remote client you set up on the local shard. So this is how you can register a model to Central Model Registry, and what I’m gonna demo the next is actually how we can use a model. So remember what I mentioned, for our model that’s in production a lot of times we use a model for daily scoring, and sometimes we need to retrain or retool maybe every two weeks or every month. So this is a demo of how I can use the model in the Production Model Registry to do daily scoring and use MLflow to monitor the output. So, here is how, first how we can take it. So it’s similar to what we just demo in the other notebook, how we can get the artifact path for the for the model. And then one thing that’s very important and what we mentioned before is, we want to make sure we install all the right dependencies in order to run a model. So here is actually how I can unpack, all the source packages possible that are locked together with a model. So this is how I pulled it from the artifact pack, and then pip install all the packages in that folder.

And then here is how I can put all the package, install all the package, and then now my environment is all set up and I can run the daily scoring, and here I do actually start experiments so that I can log all the metrics I want to monitor for this model. So in this case, I will be only doing scoring, so, what I’m gonna do is that I do scoring, and then I bring in these other models that I actually installed with the source packages, and that means I’m using exact same function to produce the input metrics, and after I’ve produce the input metrics, I can actually version the input matrixes with a Delta Lake. So here I actually override, and then save the input matrixes into a Delta Lake file.

And then, after I generate it into metrics, I can deal with a lot of models directly from the Model Registry, and also by specifying what I want to run the production, (faintly speaking) need to specify which version I want to run and one other registry I want to go up go get the model. So, this way I got the model and then I can run it using the new input metrics I generated that day. And then I can get all my prediction.

And at the end, I will log all these parameters that I used to run this time, for example, the artifact path, and what stage of model that I pulled from,(mumbles) what’s a model registered name? And what’s the prediction date I used for those ones to generate the input metrics. And I also sometimes I also lock on metrics scoring. So for example, like the total row count sometimes and or if a binary classification models, not a lot, how many parses I get and how many total negatives I get. Sometimes I can see a huge spike of parses and I know something is wrong with them. – So this whole process also made regression testing becomes a lot easier.

Regression Testing Becomes Easy

So a lot of times we want to compare two different models. And then what we used to do is that we have to have to dig through all the previous colleagues notebook and find maybe the performance metrics is not even logged. And what we now can do is that we can pull this directly from MLflow and compare them. So, here there are two versions, that we can choose to compare and choose them and compare and the Model Registry and then we can see all these different metrics that I’ll log side by side from these two models and for example the validation sets area under precision recall. And then we can also compare them this way by looking at this and see how different features affect these different model and how their performance on release to a different settings And now troubleshooting transient data discrepancies also becomes (mumbles). A lot of times Data Engineers common access why the wrong yesterday yield a weird number of predictions that’s not usual. And what we used to do is, it’s really hard to troubleshoot this kind of problem because maybe today’s run already overwritten the Input tables. So what we now can do is that, because we version control all our source data through Delta Lake, we can low the specific version that were used for yesterday’s model to troubleshoot this kind of problem. And then it makes that a lot easier, and you can see on this side, on the right side there, we can low even though it’s the same Delta Lake file, we can load this specific version of it. So what we learn from this, improving this Deployment process is that, yes as scientists really like the freedom of trying out new platforms and new tools and allowing for the freedom of platforms and tools, and this can be a nightmare for deployment in production environment. However, MLflow tracking server and Model Registry really allows logging a wide range of flavors on Machine Learning models, from Spark ML to Sci-kit Learn to SageMaker, and this really make management across different platforms, in the same centralized workspace possible and easy.

Thank you all for joining our session today, and I would really appreciate any feedback from you all.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Allison Wu

Thermo Fisher

Allison is a data scientist in the Intelligence Generation team within Thermo Fisher’s Data Science Center of Excellence. The Thermo Fisher Data Science Center of Excellence establishes data science best practices and drives end-to-end data science model development. She graduated from UCSD with a Ph.D. in Bioinformatics and Systems Biology in 2016 and started her data science journey in Global Strategic Pricing in 2018 at Thermo Fisher. Specialized in machine learning, she has developed models in various fields such as imaging analysis, pricing optimization, and customer behavior prediction. Aside from developing machine learning models, her current focus is enabling end-to-end data science pipelines from development and deployment to delivery and management in production environment using technologies such as Mlflow, PySpark, Delta Lake, and Git.