Humana strives to help the communities we serve and our individual members achieve their best health – no small task in the past year! We had the opportunity to rethink our existing operations and reimagine what a collaborative ML platform for hundreds of data scientists might look like. The primary goal of our ML Platform, named FlorenceAI, is to automate and accelerate the delivery lifecycle of data science solutions at scale. In this presentation, we will walk through an end-to-end example of how to build a model at scale on FlorenceAI and deploy it to production. Tools highlighted include Azure Databricks, MLFlow, AppInsights, and Azure Data Factory.
We will employ slides, notebooks and code snippets covering problem framing and design, initial feature selection, model design and experimentation, and a framework of centralized production code to streamline implementation. Hundreds of data scientists now use our feature store that has tens of thousands of features refreshed in daily and monthly cadences across several years of historical data. We already have dozens of models in production and also daily provide fresh insights for our Enterprise Clinical Operating Model. Each day, billions of rows of data are generated to give us timely information.
We already have examples of teams operating orders of magnitude faster and at a scale not within reach using fixed on-premise resources. Given rapid adoption from a dozen pilot users to over 100 MAU in the first 5 months, we will also share some anecodotes about key early wins created by the platform. We want FlorenceAI to enable Humana’s data scientists to focus their efforts where they add the most value so we can continue to deliver high-quality solutions that remain fresh, relevant and fair in an ever changing world.
Speaker 1: Hello, and welcome to the session entitled the Florence, AI Reinventing Data Science at Humana. My name is Dave Mack and I’m a cognitive and machine learning principal at Humana in the digital health and analytics organization. Today I’ll be walking you through some key aspects of how easy it is to build a model on the platform I’ve helped create over the past 18 months. I hope that you can learn something new to take back to your organization. Humana’s bold goal is to address the needs of the whole person. This member centric view is part of our commitment to help our millions of members achieve their best health. A Fortune 50 company, Humana has a breadth of resources available, and we recently invested significant time and money into fighting the COVID-19 pandemic, food insecurity, loneliness and social isolation, and inequities in healthcare, to name a few.
Humana has also made a significant commitment to digital health and analytics, forming a top level organization led by Heather Cox in 2018, so that through advanced analytics, experiential design, data and technology, we’re working to meet our associates, members and the communities we serve, anytime, anywhere, and anyhow. I’ve been privileged to be part of Humana since late 2015, when I started in clinical data science before helping lead Humana’s transition to the Cloud. Our platform now supports hundreds of users, taking advantage of the flexibility that Cloud brings.
FlorenceAI is a cloud platform for automating and accelerating the delivery life cycle of data science solutions at scale in Azure. It’s name was inspired by the historic city of Florence, Italy, and all the amazing things that went on during the Renaissance there. While we’re not building innovative contraptions to lift bricks and stones to new heights like was done for the Florence cathedral’s dome, we do get to build some cool stuff with Databricks, High Spark, ML Flow, and Azure data factory, while supporting many of the most popular libraries for ML today.
The eco-system empowers data scientists to solve complex problems, promotes access to open source innovation, simplifies model consumption with a single interface, all while trying to transform workflows to improve performance. Today, I’ll be touching on many of the key foundational pillars you see listed here.
The model is really nothing with terrible ingredients. So we have to have some good things to start from. And from that we have our feature store. It has tens of thousands of features available for training and scoring, hundreds of instances across multiple years, ready to use for data scientists. It’s flexible to cover most use cases, but specific enough so that custom situations can be handled through self-service operations, using the same code. We put a lot of domain expertise into the feature design, that’s been pre-computed for the entire population, and is refreshed regularly at monthly and daily cadences. We have extensive metadata ready for the data scientists to understand and learn about the features, how they were created, and what they mean.
We support them through the whole life cycle, from cohort design, to the initial feature selection, and then into the model training experiments, where they cycle through and find the best model. Once that’s done, they score and register that model, and record all of their training artifacts to make sure that their work is reproducible. We then work with them to develop the scoring code and testing needed for their model, before we promote it to production and automate the scoring on a regular basis.
I want to use an example problem today to help trace the workflow through this platform, so that you can see how a model is built today. We’re going to predict the most severe stage of chronic kidney disease in the next six months. This is an important chronic condition, because as members progress through the disease, if their crash ends in the later stages, they can get the dialysis much quicker and have a lower quality of life. We want to delay that as long as possible, and that’s why this is an important problem to solve. You can see some of the criteria that I used to define the cohort here, and have built up a set of numbers that have a different stage. You can see that we have a pretty imbalanced class set here, which is a common problem in healthcare today. Let’s see how we can build this model on the platform today.
So first we’ll start with our initial feature selection and do some basic model training. I’m going to be walking you through a notebook that we have, that’s a template that the data scientists can start from to do this feature selection. Our goal is to get down from the tens of thousands of features we have in the feature store and any custom features they create, down to a few hundred features.
So the initial feature selection template here is the Databricks notebook. And we have some of the initial things installed and the imports made. We also have a few different parameters set up. From there they can load a combination of feature story data and their custom features. There we’ll split it into a train and test it, and then compare the distributions of those two things together to make sure that these things are similar. After that, we’ll do some basic prep to categorize them and look at a few things that are basic to eliminate some features early. So we’ll take a look and see any that have a single value or high percentage of values and exclude those. And then we’ll do some final formatting before we get ready to do the pipeline of the model. Before Spark 3.0, there wasn’t really an easy way to do the string indexing with all the columns at once. And with tens of thousands potentially, we had to roll our own to make that happen for our users.
After the indexing step, we get to a typical spark pipeline to do the prep, before we get into the feature selection. We index our target, learn how to encode our variables, along with imputation of our numeric variables, and then bring them into a feature back there. From there, we give data scientists the option to use different aspects of feature selection. I chose a random forest classifier here with 25 trees, and fit that model to my index data. After that, we have a helper function to help them extract those feature importances out of the random forest classifier, and sort them in a [inaudible] data frame so that they are going from most important to least important.
Then they can display that and walk through those features and see if they makes sense. We can then use a pretty low cutoff to get down to hundreds of features for our first round of modeling. The process in transform features have slightly different names, so we use this block of code to get back to the original feature descriptions. Then we can save those data sets out and go on to the next step.
From there we’ll do our first round of modeling using SparkML. We have a simple helper function to make the creation of the ML flow experiment consistent for each user. We also have created a separate helper function for the Spark training. We do this in a similar prep fashion to the initial feature selection. And we can see here that we’re inputting a cross validated estimator and a parameter grid to do the search. This gives us a wide variety of options to be able to parametrize these for any type of cross-validated Spark model.
After we do our experimentation, then we’ll arrive at a Best Model. We can see at the top here that we have different parameters from that best model and we’ll have their estimator. In my case, a logistic regression that came out to be the best model. We then have a different helper function that allows us to be able to do a few more detailed things about our model, and it even generates a heat map. This is important because we could have just had a majority class classifier and had a high accuracy, but this heat map shows us that when we used our class waiting, we ended up getting pretty good predictions along the diagonal, that we still have a little bit over prediction in our majority class, and there was a little bit of a problem in the middle here with some of these mid-range stages.
Let’s take a look at the details of these Spark helper functions to see what’s going on behind the scenes. For this experiment utility has a couple of the helper functions I referred to earlier in the notebook here, are ML flow, experiment creation, and feature importance extraction. Here’s the SparkML set up with our cross validated parameter grid search. We have this function here where we start our ML flow run, grab our model and grab our test predictions. From there we’ll log the metrics that come out of our test set, and then we’ll also grab our training metrics from each of the cross validated parameter rounds. We’ll grab those parameters as well, so that we can graph them in a bar graph format to plot the different accuracies for each round. We’ll then save that figure out to the ML flow run, along with the best parameters for the model and the model object itself. We’ll also write that to ATLs for backup.
For the best model, we go through and do similar things here, starting our ML flow run and fitting our model. But then we’ll score it on the full training set and tests set, log those metrics and the best parameters of our model. And then if this were a binary classifier, we might create a rock curve, but in this case, it’s a multi-class classifier, so we’re going to create that heat map you saw earlier. From there we’ll log that model out and back it up to ATLS as well.
So with that, we can really encourage reproducibility with this reusable code. We automatically save many items in the ML flow run, and that’s all workspace scope and can be seen within that particular user’s workspace. We also asked them to say several artifacts out to help make their model reproducible, like input schemas, training and test sets, and the scores associated with those, along with anything they did outside the pipeline for [inaudible] prep. These can be used for scoring later.
So now we’re not done. We just ended on the first round of modeling, and now we want to take a look to see if deep neural networks can help us at all improve the model. Those that may not be familiar with deep neural networks. They have similar inputs and features that we bring in, but then they have a series of hidden layers that help work together to explain the relationship between the inputs and the targets. That will generate our predictions, and that does so over repeated passes called epics. These continue to learn each time we go through to make the model better.
There’s some extra things we need to think about when using deep neural networks. Do we want to use early stopping to minimize our training time, while diminishing returns happen over multiple epics. We can use callbacks to log things after each epic, and we can also do it on smaller chunks of data to make sure it’s working the way we want, and allow us to scale up later, once we refine our parameters.
We can do this initial searching with a tool called Hyper app. This allows us to define a parameter space that’s pretty generous for us to search, and we have a couple of different ways to go about that search. We can do one at a time with a full [inaudible], or do full random or something in between. In my case, we’re going to do the full [inaudible] to learn from each round. So you can see we did 20 trials of our model and the models improved each time. But we want to probably look at those to see if we get a sense of what kind of models are working best, as we’re using a sample of our training data.
Thankfully, MLFlow has a handy parallel coordinates plot, where we can take each of our parameters and graph them against the best validation loss for each round. We can then see in this case we’re highlighting in the first layer, we have a set of complex nodes here. If we have that set up, and then we look at the layer too, we see the complex set of nodes here, it doesn’t work very well for our model. But if we have a simpler set of nodes here that does work pretty well with a lot of runs. Also, we see the low learning rate and high batch size is important.
Now that we’ve done some initial exploration on the driver, we see that we were running pretty fast for each epic, but we were only using a small fraction of our data. If we scale up to 10 times the amount of data using Petastorm, we are running at about 10 times the length of time, and that can take a long time. So let’s bring in Horovod to help. Horovod allows us to use all of the data, and we have 16 workers here and we can get it down to a pretty similar time we saw before per epic, to help us train on all the data much more quickly. It’s four times faster than just using Petastorm alone.
Let’s take a look at how we can set up Petastorm and Horovod to run in a template. For this template here, we have all the imports here, and we have a similar setup to how we did the initial feature selection to our SparkML training. We create a train and validation set and compare those distributions, and then do a lot of the same feature prep that we used in Spark. In fact, we use that SparkML pipeline, and we’ll transform that after creating our ML Flow experiment. From there, we can separately train the rest of our model more easily. We’ll apply our class weights and do a little bit more feature prep by encoding our target so that we have data in the right format for TensorFlow. We’ll also set up our Petastorm files, and then we’ll create our actual neural network model here with a couple of hidden layers and a couple of other parameters. And the user can customize that to their needs.
From here now, we have our helper function to set up our multi-class classification, TensorFlow and Horovod. You can see that I’ve chosen parameters from what we learned before, where we have a complex first layer and a simpler second layer. We have a small learning rate and a bigger batch size. They can also manipulate these pretty easily for each run. Let’s take a look at what this helper function does for us here. I’m going to jump down here.
And this helper function is set up similar to what we saw before. We’ll have our ML Flow run start, and we’ll log some basic parameters for our hyper parameters. From there, we’ll use the Horovod runner to train our model. Let’s jump down here to see how that works. But here we have to bring in some imports due to some serialization requirements. Then you can see we’re converting our data into the format that we need for TensorFlow to work. We’re also then scaling our features and using the encoded label here, so that we can get that data set up to run. We’ll add in the early stopping and our optimizer, and then here we’ll add some metrics for multi-class classification like precision and recall before we compile them.
After that, we’ll go through and create our checkpoints and then our callbacks. Our callbacks are important because that allows us to log stuff after each round. It’s important to set this to only happen on one of the workers here, because otherwise it’s going to be logged 16 times, and we don’t want that duplicated it. We’ll then go through and fit our model. And you see we’re only writing out information intermediately each time we go through the epic once. We’ll then also say our tech point weights only once, as well, and copy that model over to DBFS.
Jumping back up. Here you can see that we, after we fit our model, we can go through and grab that metrics history and log all of our individual metrics from each epic. And that allows us to be able to generate graphs pretty easily on MLFlow, to look at our learning curve or other things you’d want to see from the model training. We’ll then log a few other things like the model structure and the model object itself.
Switching back to the other template here, you can see that we’ll continue on after the run is completed, and we’ll read that model back in to score it, and then we’ll get the predictions and our metrics. And then we can even generate that heat map that we saw earlier and apply it back to our run after the fact. This will save a lot of headaches for people and get people that maybe aren’t as familiar with BioSpark and deep learning started much more easily.
After we go through this exercise, you can see that we actually improved our model just a little bit. We haven’t changed the F1 score, but you can see in the two highlighted areas that we’re not over predicting the majority class as much. We’re seeing a lot better precision in some of the later stages here. So that gives us more confidence that we’ve got a better model. Okay, well now we’re done right? Data scientist work is all done. We can move on to the next project. Well, wait a second. Maybe not.
We have to register, score, and preserve the model to make it repeatable, and also help with deploying it to production. The nice thing about MLFlow is that we can score it using our Spark UDF and read it in that way. And that way we can bring in a PI Spark data frame to score it with pretty easily with just a few lines of code that you see here. Because we have so many Databricks workspaces because our team is huge, we need to be able to set that up properly. So from there, we have the dev work done in their workspace and they register that model there. They’ll promote it to production status after that, when it’s ready for review. From there, we’ll take the ML Flow run associated with that registered model and use it to register it in our production workspace, where we held our automated jobs all together in one space. This generates a single registry that can be used for all of our data scientists. We then will use that as the official version to do our scoring.
We also have a helper function that allows them to generate a markdown set of descriptions for metadata for each of the models. We’ll conclude the version and the path to the model so that we can have multiple versions running at the same time. To centralize things, and to make things more reusable, we’ve set up a simple framework for an ADF to deploy these models. We have three different notebooks, a feature engineering, scoring, and a score validation notebook.
We also will check the dependencies here, upstreaming that, if any other processes are needed to complete before these even start. This prevents the flow of bad data and errors from missing data from happening. We’ll also log any failures or successes of this process using a SQL server. By having the data scientists commit their model to a [inaudible], and allowing us to use ADO to be able to deploy these models, we can have them in various environments and have a lot more control over the versions and the parameters we input into them, so it will allow them to run on different types of data.
The partnership between data scientists and AI engineers is typical. All models are peer reviewed for both domain and technical accuracy prior to production to plan. We have the teams inform us of new models so we can schedule time with them to make sure they’re on the right trajectory. As they work to prepare their model for production, we’ll have a checklist for them to complete before deployment. And then we’ll work with them to create the pull request and put that into the repo. From there, the engineers will help deploy that through the pipelines and make sure everything works before it goes into production. Each models is also initially reviewed and subsequently monitored by [inaudible].We’re the first major health insurance [inaudible], and this is an important aspect of our monitoring and production.
I’ll document a few early wins from our time on the platform, that’s been just a little overrated. We scaled and automated a clunky on-prem manual process that involved creating 40 different conditions. We now create over three times as many flags in the cloud, and have gotten contributors from multiple teams following our templates. We now update over a billion rows daily in just an hour and a half for our entire member population. Another team experienced faster prep, more iterations, and better tuning and collaboration. We were able to reduce a feature engineering step from hours to minutes, and enabled their data science team iterate on models faster with one run, taking five hours on prem, now taking a half an hour or less for complex models. We also reduce the scoring step on perspective numbers from a week to 30 minutes. This allowed us to touch any aspect of the data science process from feature engineering, to scoring and training.
We also have a lot of shared resources that accelerate everyone. Hundreds of features means less process duplication and more time to improve the model. We also have the flexibility to score at scale using those UDFs, our [inaudible] UDFs, regardless of the algorithm package.
Thanks so much for listening. I hope you’ve learned something new and can take that back to your organizations and improve model training and scoring at your organization like we’ve done at Humana. Thanks so much, appreciate your time, and hope you have time to review the session and enjoy the rest of the conference.
David is a Cognitive and Machine Learning Principal within Humana’s Digital Health and Analytics organization. He joined Humana in late 2015 and spent his first few years focused on solving business...