To get good results from Machine Learning (ML) models, data scientists almost always tune hyperparameters—learning rate, regularization, etc. This tuning can be critical for performance and accuracy, but it is also routine and laborious to do manually. This talk discusses automation for tuning, scaling via Apache Spark, and best practices for tuning workflows and architecture. We will use a running demo of Hyperopt, one of the most popular open-source tools for tuning ML in Python. Our team contributed a Spark-powered backend for scaling out Hyperopt, and we will use this tool to discuss challenges and demonstrate best practices. After a quick introduction to hyperparameter tuning and Hyperopt, we will discuss workflows for tuning.
How should a data scientist begin, selecting what to tune and how? How should they track their work, evaluate progress, and iterate? We will demo using MLflow for tracking and visualization. We will then discuss architectural patterns for tuning. How can a data scientist tune single-machine ML workflows vs. distributed? How can data ingest be optimized with Spark, and how should the Spark cluster be configured? We will wrap up with mentions of other efforts around scaling out tuning in the Spark and AI ecosystem. Our team’s recent release of joblib-spark, a Joblib Apache Spark Backend, simplifies distributing scikit-learn tuning jobs across a Spark cluster. This talk will be generally accessible, though knowledge of ML and Spark will help.
– Well, thanks everyone for listening today. I want to talk about Tuning Machine Learning Models, focusing on hyperperimeter tuning, but it will be very relevant for those interested in speeding up performance where hyperperimeter tuning is often very computationally expensive, and also in terms of course, tuning the accuracy or, metrics around models. This is also critical, so just quickly about, where I’m coming from I spent most of my time at Databricks as a software engineer, working on ML and advanced analytics and analysis solutions architect. One requisite slide about, the company Databricks is proud to have been fundamental in building out these important open source projects, Apache Spark Delta Lake, and Mlflow, Cool so to get into it, I want to spend two minutes quickly going over what hyperparameters are, if you’re not familiar with the material, hopefully this’ll help if you are, I think it will be nice in terms of setting some perspectives.
Cool, so I’m going to be using later a demo, looking at Fashion-MNIST, which has a bunch of images of clothing. So let’s take a look at this one, it looks like a shoe. My first model that I fit said, “this is a dress”, probably not correct. The next model I fit said, “this is a sneaker” and that’s actually pretty accurate. And what’s the difference here? Well model to actually have better hyperparameter settings, the learning rate model structure and so forth. So stepping back, what is a hyper parameter? Well, the statistical sort of definition or perspective, which I’d like to give is assumptions you make about your model or data to make learning easier.
From a practical perspective, this (mumbles) your ML library does not learn from data. There are a bunch of parameters in a model, which the library does learn from data, but it also takes manual inputs and (mumbles) those hyperparameters cause there nodes, which can be tweaked. Some of these, I would call algorithmic. These are more sort of problem dependent configs, which may affect say a computational performance or speed, but might not really be super related to the statistics or modeling perspective.
Cool, so how do we tune these well, there’s sort of three perspectives I’ll give, matching those from before, if you’re a statistician or application domain expert, you might bring knowledge, which lets you set some of these settings, a priority looking at sort of that practical perspective. There’s an ML library, it takes some inputs, it gives an output model which can be tested and we can optimize it like a black box. This is a pretty common way to do tuning and quite effective if you do it effectively.
And the final way is to basically ignore until needed. And I think this is a common approach I take with some of the more algorithmic, aspects if a config really only affects the speed of learning, then I may not bother tweaking it until needed. Cool, so I’m not going to cover in this talk statistical best practices or overview, different methods for tuning and that’s because there were talks at, Spark and AI Summit last year, covering some of these I’ll give references at the end in this talk, I want to focus on data science, sort of architecture and workflow, best practices and tips around the big data and Cloud computing space.
Cool, so hyperparameter tuning of course is difficult else I wouldn’t bother giving a talk on it and I wanna outline a few reasons why the first is that the settings are often unintuitive. if you’ve taken an intro to ML class, the first hyperparameter you probably touched on was regularization. This sort of is intuitive, avoiding overfitting limiting model complexity. But if you take any given application, it’s really hard to say priority how you should set regularization. And of course let’s not even get into neural net structure.
The next challenge is that this involves non-convex optimization, I here I took an example problem where on the x-axis is learning rate and on the y-axis is test accuracy and you can see it jumps all over the place. Well, this is a bit contrived in that, a lot of this stochasticity is from the optimization process itself, some randomization there, but it kind of drives home the point that this isn’t a nice simple curve you can use traditional optimization methods on you really need some specialized techniques. And the final element is curse of dimensionality since this is a non-convex problem, then as we increase the number of hyperparameters (mumbles) for an example problem on the x-axis with seven possible settings for each hyperparameter, we can see that the fraction of coverage of the hyperparameter space for a given budget of say a hundred hyperparameter settings drops exponentially as we go to the right. And so by the time we hit maybe three hyperparameters, we can cover about a third of the space. And after that we can hardly test any of the actual settings for hyperparameters and this leads of course, to high computational costs if you try to push this coverage up.
So given these challenges, I want to step back and see how we can address them. I won’t be able to give, neat solutions for all of them, but I do want to talk about useful architecture, patterns, workflows and sort of a bag of tips at the end.
So starting off with architectural patterns at a high level, a lot of this boils down to single machine versus distributed training, and I’ll break it into three workflows, which, 99% of the customers I’ve seen in the field, fall into or use cases fall into single machine training, distributed training and training one model per group, where a group might be a customer or a product, or what have you.
So looking at the first one, single machine training
is often the simplest, involving these or other popular libraries where you can fit your data and train a model on a single machine. Now, in this case, if I’m doing on my laptop, I’ll just take, say, scikit-learn, wrap it and tuning that tuning could either be scikit-learn’s tuning algorithms or another method and run it, but if I want to scale it out via distributed computing, it’s also pretty simple where I can train one model per Spark task, like in this diagram where the drivers aren’t doing much, but each of the workers is running Sparked tasks where each one is spitting one model and one possible set of hyperparameters. And I wrap the entire thing and in a Tuning Workflow. This allows pretty simple scaling out and is implemented in a number of tools. So hyperopt which I’m going to demo later has a Spark integration we built, allowing the driver to sort of do this coordination across workers, (mumbles) also built a job lib integration with a Spark backend, which can power scikit-learn’s native tuning and algorithms. And then finally this can be done manually via Pandas UDFs and Spark.
The next kind of paradigm is around distributed training and there, if your data or model or combination thereof are too large to train on one machine, you may need to use Spark ML or (mumbles) XGBoost or something which can, train a model using a full cluster. If you do wanna scale this out further than, rather than training one model at a time, you can train multiple ones in parallel. And so that might look something like this, where, you wrap the training process, which is orchestrated from the driver, say by Spark ML, with tuning and the possible tools which come into play here are for example, Apache spark ML, which has its own tuning algorithms. And those actually do support a parallelism parameter allowing you to fit multiple models in parallel to scale out further, you can also note that from the driver’s perspective, like you can actually use any sort of black box tuning algorithm or libraries such as hyperopt, because it can make calls from the driver and never really need to know that these calls are using a full Spark cluster.
The final kind of paradigm is training one model per group. And here are the interesting cases where there are enough groups that we really get back to that, single machine training case where we can scale out by distributing over groups and train each groups model per Spark task.
So here’s kind of a diagram of it. And in this I’m recommending him using Spark to distribute over groups within each Spark task, doing tuning for a model for that group. You can do tuning jointly over groups, if it makes sense, though, important tools to know about herE are of course the Apache Spark and Pandas UDFs, or (mumbles) to do this aggregation and coordination. And with any each worker you could use scikit-learns native tuning hyperopt or whatever.
Cool, so that touches on the main architectural patterns I’ve seen in the field and getting into workflows. I’d really like to touch on common workflows and particular tips for each.
So in order to get started, my first piece of advice is start small biased toward smaller models, fewer iterations, and so forth. This is partly being lazy, small is cheaper, and it may suffice, but it also gives a baseline and some libraries such as say, tf.keras, which I’ll demo later support early stopping and algorithms, which can take that baseline into account.
It’s also important to think before you tune and really by this, I mean a collection of kind of good practices.
So make sure you observe good data science hygiene, separate your training validation and test sets also use early stopping or smart tuning wherever possible. I often see people use say Karas where they are tuning the number of epochs, but it’s often better practice to fix the number of epochs at a large setting and use only stopping to stop adaptively to be more efficient.
Also pick your hyperperimeters carefully. Now this is pretty vague, but I’ll give at least an example, sort of a common mistake I’ve seen made with an (mumbles) tree models, they have two parameters, hyperparameters, which are important for controlling the depth or size of the tree next step, and many instances per node. These serves somewhat overlapping functions and I’ve seen them tuned jointly, but it’s often better to fix one and tune the other, depending on what you’re going for.
Finally, I’ll mentioned picking ranges carefully. This is hard to do a priori, but I think speaks to the need for tracking carefully during initial tuning and improving that later on the next set of workflow is I’ll talk through are models versus pipelines.
So for this one important best practice is to set up the full piepline before tuning. And the key question to ask there is at what point does your pipeline compute the metric, which you actually care about because that’s really the metric that you should tune on related to that is whether you should wrap tuning around a single model or around the entire pipeline. And my advice there is to generally go bigger that’s because, featurization is really critical to getting good results in ML. You know, I had a professor in computer science for an AI course back in early two thousands, make the joke that, if you had the best features possible, then ML would be easy, for example, if one of your feature columns where the label you’re trying to predict you’d be done and of course that’s facetious, but it does get to the good point that taking time to do featurization in particular to unit is pretty critical.
So optimizing tuning for pipelines can come into play when you start wrapping tuning around the entire pipeline. And my main advice there is to take care of to cache intermediate results when needed the final set of workflows is really around evaluating and iterating the efficiently.
And there I’ll say validation, data and metrics are critical to take advantage of and just a priori record as many metrics as you can think of on both training and validation data, because they’re often useful later tuning hyperparameters independently versus jointly is a pretty interesting question where since you do have this (mumbles) of dimensionality, it is tempting to say, well, let me tune one hyperparameter, fix it at a good value, then the next, then the next that’s sort of more efficient, but some of those hyperparameters depend on each other. And so smarter hyperparameter search algorithms like an hyperopt, which we’ll demo later, can sort of take advantage of that and be pretty efficient. The final thing that’s around tracking and reproducibility, along the way of producing all of these models an tuning you generate in lot of data code, params metrics, et cetera, and taking time to record these as well as using a good tool like MLflow, to record them can save a lot of time and grief later. One tip is to parameterize code to facilitate tracking that way as you’re tweaking what you’re running, you’re not tweaking complex code, you’re tweaking simple inputs to that code. Cool, so I’d really like to demo this. So I’ll switch over to a Databricks notebook.
Here I am in a Databricks notebook, keep in mind this code and the tools I’ll be using are open source. So you could run and whatever venue.
And in this, I’m going to demo scaling up a single machine ML workflow, the goal being to show some of the best practices which we had talked through in the slides, I’ll be working with Fashion-MNIST, classifying images of clothing, using TensorFlow dot Keras and the tuning tool I’ll use as hyperopt. This is open source provides black box hyperperimeter tuning for any Python ML library has smart adaptive search algorithms. And it also has the Spark integration, for distributing and scaling out, which my team contributed awhile back. Well, so I’m going to skip through some of this initial code for loading the data. There are 60K images in training, 10k in test labels are integers zero through nine corresponding to different clothing items. Just to get a sense, here are the types of images we’d seen from the slides earlier, here’s our sneaker not a dress. And so I pulled an open source example from the TensorFlow website and looked at it for what hyperparameters I might tweak well modeled dot fit had a few interesting ones, batch size, and epochs jumped out at me, of course, for epochs I’m going to use early stopping instead and that way I don’t have to bother tuning it. The optimizer Adam takes a learning rate, which I’ll tweak and the model structure for that. I picked three examples, structures, small, medium, and large in terms of size and complexity. There are others which, are future work. So taking a look at tuning these I’m not going to go through the code in detail, but I do want to highlight the beginning, where i’ve taken care to parameterize this workflow this way as I tweak batch sized learning rate and model structure, I can just pass those in as parameters rather than changing my code itself. I’m gonna skip through most of this code. I will note that I’m taking care of two log things with MLflow at the end, and know that I can run this training, code with some parameters. This just an example run. So now I’ll skip down to tuning with hperopt. So my search space here is going to be specified using the hperopt API, if you’re not familiar with it, don’t worry about it here I’m saying there are three hyper parameters I want to tweak. And each is sort of given a range and a prior over that range where I want to search. And here’s a good example of taking care of choosing those ranges carefully here I’ll call out that I used the log uniform distribution for the prior for learning rate (mumbles) towards smaller learning rates, because I’ve seen success with in the past. And then I call hyperopts minimize function, where I minimize the loss of this (mumbles) by this training code over a search space, using a smart search algorithm. Here’s the Spark backend, saying let’s fit at most eight models. Testing possible hyperparameter settings in parallel at a time 64 possible models total and run it. So I already ran this because on this toy cluster which had two workers, no GPS, it took about an hour. Good example of why it’s important to make sure you record beforehand, the metrics you really care about.
It’s also nice to take a look at those metrics of course, and here I’m using the MLflow experiment data source where I’m loading, the runs for the experiment for this notebook, adding in a column duration. And, we can take a look at this. Here’s a histogram of the duration of tuning models most ran pretty quickly, but if you took the better part of an hour and so you might want to consider whether we want to spend that time. Well, so I’m going to open up the MLflow UI for this notebooks experiment. And here we see, I have this one run for hyperopt with a bunch of child runs for each of the models we fit under the hood for different batch sizes, learning rates and model structures, different metrics and also of course, the model artifact and MLflow UI, I’m going to compare these, I’ve already set this up and it’s pretty interesting to kind of take a step back and look at the correlations between batch size learning rate and model structure with our loss, if we look at some of the best losses, we can see those came from this media model structure, small learning rates and so forth. But if we increase this, we can see pretty good losses actually came from that largest model structure.
But what’s interesting is when we look at actually the loss or metric that we really care about the test accuracy and there if we look at the best possible model, it actually had kind of a middling loss and it was from this largest model structure. So it really speaks to the need to kind of take time to record extra metrics and especially the ones who really care about, and so that you can go back later and maybe explore this model structure further. Well, so I’m gonna flip back to the slides. Cool, so those demo give you a bit of a taste of what we’ve been talking about in architecture and workflows. I’m gonna go quickly through these last slides because they’re really a grab bag of tips and details, and largely have quite a few references for looking at later on, so as far as handling code getting code to workers is an interesting problem where it’s generally simple use Pandas UDFs or integrations, but debugging problems can be tricky.
And there my main recommendations are to look in worker logs and in Python import libraries with enclosures passing configs and credentials is also, can be a tricky here I didn’t show it because Databricks is handling the MLflow runs and credentials under the hood. But this helpful resource has some info on some of these topics.
Moving data is also an interesting problem where for single machine, ML broadcasting that data or loading from blob storage can both be good options depending on the data size caching data can also come into play and becomes even more important and distributed ML, blob storage data prep is really important as data sizes start to grow and there are Delta Lake, Petastorm and TFRecords are key tech to be aware of. And these links to resources, are nice in terms of walkthroughs of these different options.
Then finally for configuring clusters, here are the main discussion points, not tips really, but discussion points around single machine ML where sharing machine resources and inflecting machine types can be pretty important for sharing resources. They’re really talking about the question of if you have multiple Spark tasks that in different models on the same worker machine, how are they sharing resources? So thinking about that beforehand can be critical for distributed ML, rightsizing clusters and sharing clusters can be important topics. And for the latter, I’d really say, take advantage of the Cloud model and spin up resources as needed so that you don’t need to share clusters and complicate things and again, this resource has a bit more info.
Cool so to get started, I’d recommend first taking a look at the different technologies which you’ve been working with,
tools to know about here are, listed on the right for each of the technologies on the left. I won’t go through these, but definitely take a look at them. If any of these texts call out to you.
And finally, the slides and notebook are in tiny URLs up there in the upper right. And those slides will have the links to these many resources I’ve listed. There are a few more listed here and yeah, I hope they’re useful including those talks from last year, which go into a bit more of the fundamentals of hyperperimeter tuning. With that, like to say, thank you very much for listening.
Joseph Bradley works as a Sr. Solutions Architect at Databricks, specializing in Machine Learning, and is an Apache Spark committer and PMC member. Previously, he was a Staff Software Engineer at Databricks and a postdoc at UC Berkeley, after receiving his Ph.D. in Machine Learning from Carnegie Mellon.