Hyperparameter tuning is critical in model development. And its general form: parameter tuning with an objective function is also widely used in industry. On the other hand, Apache Spark can handle massive parallelism, and Apache Spark ML is a solid machine learning solution.
But we have not seen a general and intuitive distributed parameter tuning solution based on Apache Spark, why?
In this talk, we are going to show how using Fugue-Tune and Apache Spark together can eliminate these painpoints
In the demo, you will see how to do any type of tuning in a consistent, intuitive, scalable and minimal way. And you will see a live demo of the amazing performance.
Han Wang: Hello everyone. My name is Han Wang, today I’m going to talk about the parameter tuning problems and how to make them distributed using Spark and Fugue. I will give a brief introduction of the tuning problems and how we categorize them. Then I will go through the common algorithms for non-iterative problems. In the demo, I’m going to show how to do hybrid searches using the tune package. After that, I will go through the common algorithms for iterative problems. In the demo, I will show how to make these algorithms work on Spark using tune and Fugue. Today’s focus will be tune and obstruction layer for parameter tuning. It relies on Fugue and Spark to achieve its functionality, but it encapsulates most of the complexities. So for many use cases, you may not need to know Spark or Fugue at all. Please notice that both tune and the Fugue are open sourced and you can directly pip install both of them.
Let’s start. Here are some common questions or confusions about parameter tuning. Is this a machine learning problem? Are there common ways for tuning? Why is it so hard to leverage a distributed framework to do perimeter tuning? So, assume this is a general parameter tuning problem. For machine learning, we call it hyper parameter tuning and it is a sub problem of the general problem. Insight machine learning, we have classical models and deep learning models, both need a tuning. So assume this is a general parameter tuning problem. For machine learning, we call it hyper parameter tuning and it is a sub problem of the general problem. Inside machine learning, we have classical models and deep learning models, both need tuning. Let us take a step back, let us categorize the general tuning problems in another way. Non-iterative problems refer to the objective that will run just once to get a metric that stops. Iterative problems, refer to the objectives were run multiple iterations and each iteration, it will generate a metric.
If we categorize in this way, we find tuning for most of the classical models, are non iterative problems. The tuning for all deep learning models can be considered as iterative problems. Some classical models, such as boosted decision tree, can be treated as either one. With different categories, let’s see why distributed parameter tuning is so hard. First of all, obviously not everything can be parallelized, iterative problems count. And some search algorithms have to be sequential too. Secondly, to tune even the simplest of problems, you have to describe the searching space, define the objective, care about distributed [inaudible] status, how to collect results, how to prune. It seems you have to define everything. Popular tuning frameworks initially do not consider distributed cases. So later on the designed for distributed cases becomes awkward and complicated for normal users. Retune, maybe an exception, but it’s locked with ray, not as popular as Spark. Spark is not designed for tuning particularly. For iterative problems, real-time synchronous communication becomes the biggest challenge. It does not even align with Spark’s design philosophy.
Now let’s talk about how we build abstraction layers to help solve these challenges. There are different computing from us Spark, Dask, Pandas and on GPU Rapids and Blazing SQL. Fugue is an abstraction layer unifying the computation. On top of Fugue, we already have Fugue SQL and the validation, they are not today’s topic. On the other hand, there are existing hyper parameter optimization tools such as Hyperopt and the Opotuna. The cure framework is built upon Fugue and exists in HPO tools. It redefines the space concept and adapts to the HPO frameworks and also leverage Fugue for distributed computing. So for users, they only need to use a small set of unified APIs to interact with all the great underlying frameworks and to switch freely.
Here are the high level goals for the tune framework. For non-iterative problems, we unify grid and random search. And for all other algos, we make them pluggable. For iterative problems we generalize state-of-the-art algorithms such as Hyperband and ASHA. Bring them to all execution engines, Fugue supports. So users will have much higher flexibility to configure their search. For both, we will make the search execution seamless between local and distributed environment. Tuning is difficult, so we make the development Iterable and the testable. We minimize the moving parts, we don’t need a database or a queue or ray or Spark. We minimize the interface from majority of use cases. You only do single API calls and Spark and Fugue are totally hidden from you.
Now let’s talk about non-iterative problems. Grid search is probably the most commonly used tuning method, it is straightforward, cross-product all choices are all parameters to get all combinations. It’s deterministic and it can cover each value of a parameter with equal probability. But the search space size for complex problems can be very large and sometimes unnecessary. Random search relies on how many samples needed from a search space, so we can control the size of the compute, and it is also great for continuous variables. However, it may not be deterministic and for complex problems, we still need large sample size to cover combinations.
Bayesian Optimization is one of the other master searching algorithms. The search starts with random guesses, but for each step it takes into account the previous search results. So the guesses will be more reasonable and faster to converge. It normally takes much less compute size to achieve the comparable results compare with random and research. However, due to the nature of the algorithm, it has to be sequential. So it’s not necessarily faster than grid and the random searches when you have a distributed system, none of the algorithms is perfect. The best way is to flexibly, combine them together as a single search task. In order to do that, we redefined the concept of space. We treat grid search and the random search as the first level optimization in which they simply provide a collection of configurations and we parallelize on this collection of configurations and we treat all other algorithms such as Bayesian Optimization as Second Level Optimization.
Each of the configuration and we can plug in third party solutions for the Second Level Optimization. Here is an example. This search contains, Models sweeping, Grid search, Random search, and a Bayesian Optimization. Grid and the Random configurations are generated before execution and the Bayesian Optimization is done in their own time.
Now let’s discuss the iterative problems and we are going to use Keras modal tuning as our examples. Because Kaggle machine is now so powerful, we are going to use the Boston housing data set in this example. If you have ever trained a deep learning model, you would know that it is more complex than the classic home models. You have to implement different things for a deep learning model. For example, you have to construct a model, you have to care about how to compile it, how to fit it and how to handle output metric.
In order to make kids more organized, we create a Keras training spec template class. And this class does not really add anything, but just to help you reorganize your Keras training code. Now let’s go through the logic. For the initialization, we can consume the data frame from the data frames passing by the distributed system. Or on the other hand, in this case, we just load the data from the Kara’s data sets and the full constructing model. We build two dense layers and for the first level we set the number of cells to be L one and L two is the number of cells on the second for the second layer. Our objective is to optimize L one and L two in order to produce the best validation metric. And full compile, we have specified our final metric. What we care is the mean absolute error. Here for the fifth metric, this is how we get the mean absolute error from the validation set.
For sort metric, notice that for Keras output the metrics can be larger better or smaller better. But for sort metric, it must be smaller better. For errors, we can just return original metric, but if it is accuracy, then we need to return to negative value of the metric. Was this training Spec? We are able to do hyper parameter tuning for the Keras models, but before tuning. It will be great if we can validate everything we have implemented here is working locally.
So here is how we do that, we just gave some dummy hyper parameters and data frames, and then we just call this function, compute sort metric. This function is designed for testing purpose, it will for sure cover everything you have written here and to generate the final sort metric. It is an end to an execution that is very similar to the execution of a single run. Now let us write, as you can see, the result is not good, but that does not matter. The thing we care is if it can run smoothly from end to end and this seems it is working. Deep learning training is very expensive, so it is better to do validations as early as we can, before we use a lot of computer resource to do expensive jobs.
Now we have validated that our training spec is working as expected. Now let’s just input the utility functions for Keras model tuning. It is very similar to how we tune the secular models. Okay, now let’s start with the simplest, the case for Keras model tuning.
First of all, we just need to find the space, right? Space sequel to Keras, space and housing spec L one. Let’s do a random search. Okay, so we want randint from eight to 32. L two from randint from eight to 32. So this is our space and how can we run the simplest as successive halving just need to do this.
We need the space and we need a plan. You remember, we took in the slides, we need a plan for reducing the candidates and how many epochs to run. So let’s try the simplest one. So this means for all the candidates from space, we will run two epochs and then I will select just one of them to promote because we have only one wrong. This one results will be the final result. I forgot one thing, which is we need to make this a random search space. So we need a sample let’s use 16. Okay, this is everything you need to do, to do a successive halving on top of a Keras model. If you just want to do things locally and we can do an additional thing is we can set restricted top to just two values. In the end we just output the results, print the results here.
So as you can see, actually, we also think this is another type of validation after this validation, this validation has no knowledge of tuning, and this is the simplest tuning case. So we value, you can see that we are able to validate things locally step-by-step and to get the best to resort, as you can see, even was this time type of simple run, we get much better result than the previous one. Right? Okay. So now let us run a more traditional successive halving. As you can see, for this tradition of success of halving, we increase the epochs by the factor of two, and we reduce the number of remaining candidates by the factor of two, this is a lot of compute. So we will see that it will take a while to finish. Now let us think about it is a lot of compute, how can we bring this to Spark? So again, it’s just like before we only need to specify the execution engine include of Spark.
And because of fugal, this Spark is a shortcut to start a local Spark cluster on Kegel. Oh. As you can see, the previous result is already here. It is pretty good it is much better than the sample plan. This plan, this traditional plan is great. Okay, now this is the way we can run things on top of Spark. There is another thing. What if we also want to do real time monitoring? Again, it’s just like before we only need to enable the callback for this execution engine. And we tell the system, we want to monitor “rungs”. Let us see what will happen here. Here, what we will monitor for each trial, or just have the metrics for each rungs. And here we just plot all the trials we have collected so far. You will see a typical pattern for successive halving in real time, it’s a live shot.
And then as you can see that for each run, we reduced the number of candidates to half. So you see less, less and less. And finally we get a result. This is how successive halving is working. We have visualized it. What if we want to do hyper band? As I have said, the hyper band is just to rung a collection of plans. We only need to change the plan to plans. And each one is successive halving like. Also we can increase the number of samples because there are more searches. So we need, it’s better to feed, feed the system with more samples here and all other things will be the same.
And another thing I want to mention is that the monitor, although we have a shortcut here, you can fully customize the monitor by implementing the monitor clips. You can connect this system with [inaudible] . You can push all the metrics to [inaudible] , to visualize or to save. As you can see, the hyper band is quite similar to successive halving, but it was just to try more things because we have a collection of plans. We use 24 samples will be assigned proportionally to all these plans.
Now what if we want to do continuous or synchronous successive halving, again, you only need to change this function. And then that’s changed back the plan to a single array. As you can see, as you can see, we have a warmup here. You can set warmup for any of these three algorithms. It makes more sense to set warmup for a synchronous success of halving and success of halving a for hyper band itself. It has some warmup idea already. So probably it’s not necessary. Also, I need to mention that practically for continuous a synchronous success of halving it is way more efficient than the other two algorithms. So here you can sample way more candidates for it to start.
Now, let’s talk about iterative problems. The most, the typical examples are deep learning models tuning. Solving iterative problems is all about how to quickly stop the bad experiments and to move computer resource to the promising ones. So the first challenge is how to do early stopping if for each configuration and each iteration, it tries to get an immediate decision. Then different trials will need to communicate in real time, a synchronously. How can we make this happen on traditional distributed frameworks, such as, Spark? If for each configuration, and each iteration, we use checkpoint in the file system, then overhead of saving and the loading and idling can cost inefficiency. Especially when you have a large computer resource.
Other challenges include each single problem can be parallelized and a lot of boilerplate code to make it work, on any framework. Now let’s quickly go through some well-known early stopping algorithms. The most basic one is successive halving in this particular case, we start with eight configurations, either randomly generated or specified by user. Then after running white epoch, we keep the top four, then run two ephocs keep two then run four epochs you pass one. In the end, we run eight epochs on this configuration to get the final model, we can use the initial number of trials, initial number of epochs, and a factor for reduction to describe the traditional successive halving. And in this case, it is eight, one, two. But in my opinion, this problem is oversimplified. In our framework, you can fully customize the process with number of trials and the number of epochs in each rung, here is an example.
We start with eight configurations rung four epochs, keep five, keep six rung two epochs, keep two rung six epochs, keep one. With this design, the traditional success of halving it’s just an array of top hosts. And even with the warmup, it is still an array of top hosts and you can invent your own array. Hyper band is roughly a grid search of successive halving, for a single success of halving, if it’s rungs, just a one epoch to discuss candidates, it could be too early, but if it rungs full epochs to discard candidates, it could be a waste. So hyper band is to try different starting points while keeping each attempt using roughly the same amount of computer resource. Again, we have generalized the hyper band to be an array of array of couples. The traditional one is just a special case. You can also invent your own arrays, no need to follow the original formula.
A synchronous successive halving is a big improvement over successive halving based on a simple idea, if we have eight workers for each rung, each worker does not wait for all others to finish the same rung, but just to compare itself with the already finished ones and to decide whether to promote to the next or rung, or to stop the current configuration and pick up a new configuration to start over. We will see a few things. One, trials finished earlier, we’ll have higher channels to get promoted. Two all workers, keep busy without waiting. Three great trials, won’t be promoted to the end, even if early trials have reached the end. So the final result is still good. They’re not definitely waste in this case, but practically speaking do utilization of computer resource is much higher and this is more important.
As you can see that a synchronous success of halving will never wait for other rungs. It was just any Walker would just get promoted or get killed in real time. And the continuous success of a synchronous successive halving means that we do not do any checkpoint. So for the other two algorithms, we’re always rely on checkpoint. We save the models to disk, wait, retrieved the model, and continue to training [inaudible] continuous synchronous success of halving. We don’t check point any of the training during the process for each end of the rung, the worker would just ask the driver, should I continue on this trial? Or should I just quit? And the start with a totally new trial. It will get a decision immediately. So it does not have to check point anything in this case.
And as you can see that the result is also a little bit better than the other two algorithms. Actually, I didn’t show the time. So practically speaking, if you have a lot of computes resource, a synchronous success of halving will outperform hyper band and the successive harming are almost everything, the speed, the cost and the quality of the output model. But if you have very limited resource, for example, you have only one machine, you can only train the models on your local machine using a single GPU, then probably hyper ban and success of halving. They are the better choices for you. And for those cases, you just to remove this, [inaudible] to run everything. Using the single GPU.
To sum up [inaudible] Is an obstruction layer, unifying the core concepts for parameter optimization. We divided into non-iterative and iterative problems. Based on the two categories, we’re working on specialized solutions for all popular machine learning frameworks. Our goal is to provide non-invasive and minimal functional APIs for all major tuning scenarios. And the scientists will be able to do tuning without requiring the knowledge of distributed systems. Meanwhile, we will enable advanced users to create fully customized and platform agnostic and the skill of the agnostic tuning pipelines using our lower level APIs. Please let us know if you are interested in collaborating with us to democratize distributed tuning. Thank you.
Han Wang is the tech lead of Lyft Machine Learning Platform, focusing on distributed computing and machine learning solutions. Before joining Lyft, he worked at Microsoft, Hudson River Trading, Amazon...