Tuning a Spark ML model with cross-validation can be an extremely computationally expensive process. As the number of hyperparameter combinations increases, so does the number of models being evaluated. The default configuration in Spark is to evaluate each of these models one-by-one to select the best performing. When running this process with a large number of models, if the training and evaluation of a model does not fully utilize the available cluster resources then that waste will be compounded for each model and lead to long run times.
Enabling model parallelism in Spark cross-validation, from Spark 2.3, will allow for more than one model to be trained and evaluated at the same time and make better use of cluster resources. We will go over how to enable this setting in Spark, what effect this will have on an example ML pipeline and best practices to keep in mind when using this feature.
Additionally, we will discuss ongoing work in progress to reduce the amount of computation required when tuning ML pipelines by eliminating redundant transformations and intelligently caching intermediate datasets. This can be combined with model parallelism to further reduce the run time of cross-validation for complex machine learning pipelines.
Session hashtag: #DS6SAIS
Nick Pentreath is a principal engineer in IBM's Center for Open-source Data & AI Technology (CODAIT), where he works on machine learning. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations. He has also worked at Goldman Sachs, Cognitive Match, and Mxit. He is a committer and PMC member of the Apache Spark project and author of "Machine Learning with Spark". Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.
Nick has presented at over 30 conferences, webinars, meetups and other events around the world including many previous Spark Summits.
Bryan Cutler is a software engineer at IBM's Spark Technology Center, where he works on big data analytics and machine learning systems. He is a contributor to Apache Spark in the areas of ML, SQL, Core and Python and a committer for the Apache Arrow project. His interests are in pushing the boundaries of software to build high performance tools that are also a snap to use.