With data as a valuable currency and the architecture of reliable, scalable Data Lakes and Lakehouses continuing to mature, it is crucial that machine learning training and deployment techniques keep up to realize value. Reproducibility, efficiency, and governance in training and production environments rest on the shoulders of both point in time snapshots of the data and a governing mechanism to regulate, track, and make best use of associated metadata.
This talk will outline the challenges and importance of building and maintaining reproducible, efficient, and governed machine learning solutions as well as posing solutions built on open source technologies – namely Delta Lake for data versioning and MLflow for efficiency and governance.
Mary Gracy Moes…: Good morning, good afternoon or good evening, depending on when and where you’re watching this. My name is Mary Grace Moesta, I’m here with my peer Gray Gwizdz and we’re here to discuss reproducible machine learning, specifically with MLflow. So, as I mentioned, my name is Mary Grace Moesta, I’m a data science consultant here at Databricks. I have been at Databricks for about two and a half years now. I have a bachelor’s in mathematics from a school called Xavier University and I’ve been working in the data science and Spark ecosystem for well over three years now. Most of my previous work at Databricks and beyond has been focused in the retail and CPG space. I’ve also had the opportunity to work on a couple of Databricks Labs projects, namely the Automl Toolkit. I’m also based in Detroit, Michigan, so I’m a big Detroit sports fan and go Tigers.
Gray Gwizdz: And hi everyone, my name is Gray Gwizdz, I’m a resident solutions architect with Databricks. I received my master’s in computer science from Grand Valley State University, with emphasis in biomedical informatics and distributed computing. From there, I went to Ford Motor Company, had a variety of roles, most recently ending in the big data space, at Hadoop System Architect and I’m based out of Chicago, Illinois.
So, as part of today’s agenda, we’re going to be discussing the reproducibility crisis, specifically reproducibility in machine learning. Some of the technical components that we use to help build out this process and then, some of the approaches and underlying tooling. So, this seminal research paper was kind of the one that kicked off the reproducibility crisis. It was in 2005, John Ioannidis made this paper, just talking about why most published research findings are false. Quickly scanning it, you might see that it talks about confidence intervals, p-value is less than 0.05 and applicability to different fields but actually, there’re much wider findings in here. And he really kind of bold claimed that, it can be proven that most research findings are false.
So, as part of this paper, he actually published six corollaries for false research findings and I actually just pulled out three of these, that kind of made sense to the machine learning space. So, the first one that made sense to me was, the smaller the studies conducted in a scientific field, the less likely the research findings are to be true. And this specific example that Ioannidis gave was, randomized controlled trials in cardiology, where you’ll have thousands of people participating in these studies, compared to studies with molecular predictors. Usually you’ll need some with a fairly unique genome as part of that study and being a hundred fold smaller. So, this one’s really almost a discussion around sample sizing and making sure that you have the appropriate data size.
But, I think, one of the things that’s really important here is that, the data is as important as the analysis itself. If you don’t have really good high quality, production quality data, ready for machine learning, that you use to create your machine learning, it’s kind of one of those garbage in, garbage out scenarios. Your machine learning algorithms can only be as accurate as the data that was used to create them. The second one is, the greater the flexibility in designs, definitions, outcomes and analytical modes in a scientific field, the less likely the research findings are to be true. And the example that the Ioannidis gave here was kind of comparing two different medical studies here. One is, when you have a finding that is absolutely unequivocal, universally agreed, death, a classic binary classification, where either are living or dead and there really is not a lot of wiggle room in between.
And the second one is really around schizophrenia, where you’re going to have different scales of schizophrenia and even discussion around what those scales mean. So, when trying to build a model to predict some of these things, you have to really be sure about what you’re trying to predict. And what this states is, if you can’t explain the analysis, you can’t trust the results. So, not even having a really accurate predictor, I think this even speaks to the machine learning models that we create, if I have some type of crazy neural network that I can’t even explain in a PowerPoint or walk people through, how can I be so certain that it doesn’t have bias, that it’s doing the right things, that I can actually rely on it for critical business function?
Now, the last one I wanted to talk about was kind of related to financial and other interests, prejudices in the scientific field. So, what it states here is just, the greater the financial interests, the less likely the research findings are to be true. Ioannidis here actually kind of blasted biomedical research and noted that they’re inadequately and sparsely reported. I’m not quite sure I’m willing to go that far but I was able to pull up two historical examples, on the left side here you’ll see a doctor recommending a cigarette and here in 2021, I think, we’re all pretty aware that smoking is not something that’s really a healthy thing to do but there is more nuance in this one. And here on the right, you’re going to see Elizabeth Holmes, she’s the chief executive officer of Theranos and Theranos has had a big vision for very simple fingerprint blood tests that were really, really accurate and were quick to analyze.
There was some kind of challenges around reproducibility, she was challenged by the medical community, that her studies had not been peer reviewed. And once they had been peer reviewed, it was obvious that some of the claims that she was making in the technical side, her testing, even along with the financial side of how much money she said that her company was reporting, really wasn’t adding up. And there’s a lot of challenge for this one because not only were the people, the insurance companies, people pricking their tests, hoping for accurate results to manage their blood. I think that the investors, if we look even back more historically to the cigarette example, people really have been affected by this and it’s very simple to say that, unverified scientific claims have real life consequences here.
I really like this slide, the scientific method demands reproducibility. So, in the bottom right corner, we see the scientific method. You start off with an observation or question, research that area, come up with a hypothesis, test it with an experiment, analyze the data and then report conclusions. And this is a cyclical process. So, if we don’t have data available, we’re only going off of conclusions. It’s almost as if, five sixth of the wheel is in motion but the data that’s providing all of this stuff is unavailable. We could be building conclusions off of false data and not really doing meaningful scientific work. And next, what I’m going do is kind of pass it to Mary Grace, to talk about machine learning specifically.
Mary Gracy Moes…: So, continuing to root a lot of this in the ML and AI space, we see that reproducibility is actually a systematic problem. So, the 2020 State of AI Report stated that, only 15% of papers published in the data and AI space were published with code. And while that number itself, that 15% number, is alarming, there was no statistic published about any supplementary or required data that was published along with that code. Which means that, it’s incredibly difficult to reproduce the results that are presented within these papers, which means that we’re kind of left to believe what’s written on the page, rather than validating for ourselves across our own experiments. And so, there’s actually an effort to address this, there’s a website called paperswithoutcode.com, which is a place where you can submit a paper that isn’t published with code and the website and the folks behind the website will actually reach out to the author to give them an opportunity to respond, to present the code and required data, to reproduce the results in the paper, in order to validate.
And so, now let’s talk about the required components, right? We’ve kind of talked about how this is a systematic problem, this is a challenging problem, let’s start to talk about the required components and can move towards a proposed solution. So, there’s really four main required components for reproducibility and machine learning applications. And first, and I think one of the most commonly forgotten, is the data itself. So, we live in a world where data is constantly changing, it could be updated daily, weekly, hourly, whatever the cadence is, there’s always more data that’s being appended, inserted, corrected to the different datasets that we’re all working on. And so, while this provides us a lot of really rich information to give our models to learn, if we have trading data, that also means we have changing results. Which sometimes is intentional but sometimes it’s not. And so, we need to account for the data to be reproducible, in order to reproduce the model itself.
Now, the second component is the code. And I always think of this as the recipe and required ingredients and steps needed to reproduce a given model. And so, as we iterate on our code, the code itself is versioned, which needs to be tracked but also, code refers to the main point of entry or reference for the required pre-processing logic, feature engineering steps and the model itself. And so, we need to be able to recreate this recipe, per se, as well as the exact version of the recipe that was used, to generate a given model. So, the code is our required component and then, the accompanying environment as well and this is the environment that your code requires to run. And this goes all the way down to the specific library versions that are required for a certain set of models or hyperparameters to execute in the same exact way.
I think, oftentimes, we hear the phrase, oh, it runs on my machine, which is great but when it comes to reproducing that, you have to be able to easily mirror the environment that the model was created in, into a secondary environment to reproduce the results. But it’s also important when we think about the promotion of a model from a development stage to production stage, if we’ve selected the model, we’ve built it, we’re ready to push it to prod, in order to push it to prod and deploy that model, there has to be an exact mirroring or oftentimes, nearly exact mirroring of that dev environment in production. And so, having a reproducible environment, not only supports the ability to reproduce and validate results but it also will make life a lot easier, when you think about the promotion of your model from a dev, to staging to prod environment.
And the last piece is compute. And this answers the question of, what hardware was used to train my model? And so, this is really an important piece, especially as we start to think about using distributed algorithms, as well as GPUs themselves, to ensure that we can exactly reproduce the underlying partitions of the data required to generate the given metrics for a certain model that we’ve created. And while I’ve listed to the four main components here, I think it’s also important to note, there are some things that we cannot control. So, for example, the non-deterministic nature of GPUs, we know that exists, we know we can’t control it and that’s a variable that we just have to acknowledge and account for when we’re building these solutions. And so, we’ll focus on the four main components through the rest of this talk but we’ll focus on the things that we can control because there’s a whole lot of things across data, code, environment and compute that we can make sure we control and log properly, so that we can reproduce things to the best and fullest extent.
So, let’s actually dive into each one of these components, one by one. And so, we’ll start with the data itself. And so, we’ve got kind of a pseudo timeline here of a model being built. So, you start with the initial model version, so model version zero, which is built on some initial snapshot of data. And as you iterate through model version one, two and three, you see that you’re changing things, like hyperparameters, you’re testing out different algorithms and rechanging the hyperparameters based on different algorithms, et cetera. However, what we can also see in this timeline is that, the data is being updated too, right? So, say we have a nightly batch shop that runs to update our data overnight, this means that we’re changing two variables in our system. If we go back to the basics of the scientific method, if we want to create a valid experiment, we need to have valid controls.
And so, in this scenario here, not only are we changing things that pertain to the model that we’re intentionally trying to test but the data itself is also changing. Which means, we’re changing two variables and so, how can we be sure that the change in our model performance is due to the changes that we are doing to the model itself? Whether it be through hyperparameters, algorithms, et cetera. Or how can we attribute that change to maybe being the change in the underlying data? And so, in order to address this, we’re actually going to leverage two pieces of open-source technology to control for this. And so, the first is Delta Lake. And if you have been attending any of the other sessions at Data And AI Summit, I’m sure you’ve heard the term Delta probably half a million times by now. But if you’re not familiar with it, it’s open-source transactional layer that sits on top of your data lake.
And so, it’s designed to add reliability, quality and performance to your data lake, as well as taking the best of both worlds from the data warehousing space, as well as the data lake space and combining them together. Additionally, it’s based on an open-source technology. So, Delta Lake, it’s open-source but it’s also based on an open-source format, namely Parquet. So, while you’re able to get these features like, reliability, quality, data quality and performance, there is no concern for vendor lock-in, which makes it a really, really robust solution for reproducibility, in our data itself for machine learning applications.
And then, the second piece of open-source technology that we’ll use to address these four components of reproducibility is MLflow. And, again, MLflow is another open-source library that’s designed to support the machine learning life cycle. It’s framework agnostic, so if you’re using, SK Learn, Keras, spark ML, whatever it may be, you can use MLflow to support those varying frameworks. And MLflow consists of four different components, there’s MLflow tracking, MLflow projects, models and model registry. Now, for the context of this talk, we’re really mainly be focusing on tracking and projects. And so, MLflow tracking is essentially a logging API, it’s a place where you can log your code, your experiments, your data, your hyperparameters associated with a given model that you’ve created. And then, an MLflow project is, the way that MLflow actually packages up all of the required components of your data science code, to be able to run it on any machine, whether it’s remote or local.
And so, now that we’ve kind of talked about the technologies we’ll use to address these problems, let’s actually talk about how we address them. So, going back to that issue of data consistency. In order to ensure that we are actually testing for the changes in model hyperparameters and model families itself, we need to create a valid control, right? Which is what we’ve talked about a couple slides back. And in order to create a valid control, we have to have that consistency in data. And so, the first way you can do that is, you can actually write out your training and test sets to a persistent location. So, you’ll split your data once, write out a training and test set to persistent location and continually read in those training test sets day over day, throughout the entire model development cycle. That way, you can ensure that, regardless of any changes to your data, changes your cluster config, the training data and tested it remained constant.
Additionally, Delta has this feature called time travel, which allows you to specify a specific data version and keep that data version fixed. And so, you can say, I want to use version number two of my data to split for training tests, write that out to persistent location and then, you can continually read in that training and test set, over and over again, to ensure, again, that you have a valid control created by using consistent data. Additionally too, you can log this information, so this version number and the resulting path to your training and test set, to the MLflow tracking server, to that logging API, to not only account for the fixing of your data but tracking and logging, hey, here’s where you can find it, here’s how somebody could recreate it in the future, if they needed to.
And so, now that we’ve addressed the data piece, let’s move on and start to address the code piece. So, if we consider the same scenario, that we had gone through before the same timeline, we talked about how the data is changing, right? But also, within this timeline, the code itself is changing. As we test different hyperparameters, as we change to model families, the code itself is being changed as well. And so, in order to account for the code changing, we can leverage the pieces of MLflow to help do that. So, the first way to ensure reproducibility within your code is, to ensure that your code is organized into different pipelines. And so, normally, we’d see this be a feature engineering pipeline, a training pipeline and then an inference pipeline. That way, you can reproduce each of these steps and isolate them to ensure that you’re getting consistent results over the different pipelines.
Secondly, MLflow tracking, that logging API, allows you to track code versions as well. And this looks a little bit different depending on how and where you’re running MLflow. If you’re running MLflow in Databricks or Pie Chart, it will actually automatically log the exact version of the code that was used to generate a given model. If you’re using something like Jupyter or VS Code, you can still make use of the MLflow logging API to log code versions but you’d have to combine it with something like a git version control tool and log something like the git hash to ensure that you’re consistently tracking the version of code that was used to generate a given model. So, using MLflow in combination with traditional version control, you’re able to reproduce the exact code required to reproduce the model itself. Now, I’ll turn it over to Gray to talk a little bit more about reproducing environments and the compute.
Gray Gwizdz: So, next let’s look at the environment that we use. You’ll see kind of a smattering of questions here. What operating system did I use? What version of Python, libraries inside of Python? What version of Pandas? Even code versioning libraries. Which version of scikit-learn was used with Pandas? And even some of the underlying environment itself, lots of times with these libraries, you can set an environment variable or configuration and that will actually affect how some of these libraries will do their execution. So, how do we track all of these different elements, as part of reproducing machine learning? MLflow has environments built into it. So, just along the right side here, you’re going to see a specification for the environment itself. Along the top, you’re going to see which URLs it will try and search for the existence of these packages, so you aren’t getting different versions from different locations.
From here. dependencies, you’ll see that Python is named as a particular version, PySpark is named as a particular version. And even further than that, you can pull in dependencies from pypy.org, using pip and you’ll see that MLflow is listed as a dependency here. And then, the last portion here is actually the compute, the actual horsepower that was used to create our model. And as a thought exercise, I have kind of a diagram that walks through how training test data gets split up. So, I have my full data set here, maybe I do a 70/30 split, 90/10 split, regardless, I have 30 rows that are part of my training site here. Because I’m in a distributed environment, I have three workers that are part of my cluster, maybe 10 rows go to this worker, 10 rows go to this worker, 10 rows go to this worker. In parallel, They run the statistics, run the machine learning model and then aggregate the results to a master, which are then combined.
Lots of times, this works pretty well, in this example, you’ll see the sum operation and that, sum is actually something that you can do in a distributed fashion but for machine learning models, that may be taking hyperparameters may actually be trying to create some priors, some predicates, things like that. Let’s imagine that we’ve removed one of these nodes from our distributed compute engine, so instead of each of these workers having 10 rows, the top worker has 15, the second worker has 15 and from there, the results that are generated inside of each of these workers is slightly different. Which means that, the data that is loaded to our master is slightly different, meaning that our model is slightly different. So, how do we track this? Luckily, there are tons of tools out there, as part of the cloud computing world, that make this really, really easy.
If you’re part of the Azure ecosystem, ARM templates can be used to create an Azure resource for the following setting, which will then provision the appropriate resource underneath the covers. You’re part of the Google cloud platform, cloud deployment manager, AWS would be cloud formation. Terraform’s another really interesting pieces of open-source technology that allows you to provision computer infrastructure in a deterministic way. So, a Terraform is another one of those things that you could define your computing infrastructure in a Terraform file and use that to provision it. And, of course, Databricks itself has all of your cluster configurations saved in a JSON file, that could be versioned, saved to MLflow and further than that, inside of the Databricks platform, there is a reproduce run button that will automatically track all of these things for you. So, what does this actually look like in practice? Let’s send it back to Mary Grace for the first couple of components.
Mary Gracy Moes…: Right. So, say let’s… We are building a model and we’ll use a couple of data sources, we’ll use one from Kaggle, to see if we can predict changes in jobs in data scientists, amidst the COVID-19 pandemic. So, we really want to be able to answer the question, is there any relationship between a data scientist changing jobs and the number of COVID-19 cases in their given city? And so, let’s start with the data. So, the Kaggle data is pretty stable, it’s not changing too much but if we think about the COVID-19 data, the CDC is reporting numbers on COVID-19 daily, across the country. And so, whether it be inserting new rows from the day before or we’ve seen how the CDC can correct records from previous days, depending on corrections and case counting, et cetera. But moral of the story is that, our data is changing daily.
And so, we can leverage Delta Lake for this, by specifying a specific version number for our COVID data. So, in this case, say, we want to use version one of our COVID data. And when we read in our COVID 19 Delta table, all we have to do is specify this version as of option, pass it a specific version number and then, we can go ahead and split our training a test data and write it out to process of location. Now, here we’re splitting based on version number, you also have the ability to split based on a timestamp, so you could split via a version number or a given timestamp. So, now we have our fixed data, let’s go ahead and start training our model and build and develop a model and keep track of all of the hyperparameter configurations with MLflow tracking. And so, again, because I’m also tracking, is this logging API, we can keep track of all the different hyperparameters, metrics and resulting artifacts associated with different runs of our data science code.
And so, this code [inaudible] here just shows how to actually call the MLflow APIs, to say, log a parameter, log metrics and log the model itself. And, again, this is all going to be logged to the MLflow tracking server. Now, one of the nice parts of this as well is that, it also limits unintentional run repetition. So, because all of this is logged through tracking server, because you can easily compare things, run over run, you can see what you’ve already tried, what types of models you’ve already tried, what types of hyperparameters you’ve already tried. That way, you’re not repeating runs of really similar models and hyperparameter configs that you’ve already tried. And so, not only does tracking everything to MLflow provide a lineage of, hey, here’s the way that all these different experiments were executed, it also can help increase the efficiency of your actual development process, so you’re not repeating steps over and over and not wasting compute, essentially. And so, I’ll toss it back to Gray to talk a little bit more about how to manage the environments and compute and then, give an example.
Gray Gwizdz: All right. And so, for environments, reproducing environments is something that’s pretty easy to do with MLflow. As part of walking through, I showed you the Conda YAML file and how it tracked different library versions and dependencies and where they were actually installed from. One of the things that’s really nice is, MLflow supports three different project based environments, there is the Conda-based environment that I’ve talked a little bit about. You can also bundle everything as a Docker container. So, if you want to integrate everything, from operating system version, installed libraries, even things that you can’t source from publicly available repositories, using a Docker containers as a really clever approach for you to be able to do this. And then, from there, you can obviously just take what is available as part of my current system. So, perhaps, you have a testing environment that’s been set up a very, very specific way.
You might not have Docker or Conda availability, you can just say, I’m running in the same environment, everything will be the same. And what’s really nice is that, this can be kind of containerized, built out and whether you’re doing real time serving, batch inference, you’ll be able to ensure that those same dependencies are available. And then, the last component here is the compute. As I mentioned before, managed MLflow does have a reproduced run button up here, that allows you to bring up the compute, run the same code, track using the hyperparameters that were run before, with a single click. And then, from there, along with this, Databricks clusters, as I had noted, have a JSON configuration. So, for my machine learning cluster here, you’ll see that I had some auto scaling going from four to eight workers and a whole bunch of other attributes along with PySpark versioning here.
So, to kind of bring all of these things together, whether it comes to data, code, environment or compute, some of the features of Delta Lake, asset compliance, performance and reliability, the ability to specifically time travel and say, I would like to look at this table as a version 500 or what did this table look like last Tuesday? Really, really helps with data reproducibility. When it comes to code, MLflow tracks the code that you wrote with, it tracks the hyperparameters that were input to the code, it tracks the metrics associated with the models that you created for easy compare ability. Environments, we walked through the couple of different flavors, Conda flavors, Docker flavors, current system environment. And then, from there, compute, you can easily track using Terraform, version that with your code, input it to MLflow and really bring all these components together, to ensure that your machine learning models are reproducible. So, with that, I’d like to say, thank you very much for attending this session. We are available for further questions and answers, so please drop them in the chat and otherwise, thank you very much for your attention.
Gray is a member of the Resident Solutions Architect team at Databricks. He works directly Databricks strategic customers to build big data cloud architectures, Spark solutions and everything else in ...
Mary Grace Moesta is currently a Data Science Consultant at Databricks working with our commercial and mid market customers. As a former data scientist, she worked with Apache Spark on projects focuse...