SESSION

Spark RAPIDS ML: GPU Accelerated Distributed ML in Spark Clusters

OVERVIEW

EXPERIENCEIn Person
TYPEBreakout
TRACKData Science and Machine Learning
INDUSTRYEnterprise Technology
TECHNOLOGIESAI/Machine Learning, Apache Spark
SKILL LEVELIntermediate
DURATION40 min

Spark MLlib is a key component of Apache Spark™ for large-scale machine learning and provides built-in implementations of many popular machine learning algorithms. These implementations were created a decade ago and do not leverage modern computing accelerators like GPUs. In this talk, we present Spark RAPIDS ML (https://github.com/spark-rapids-ml), an open source Python package for enabling GPU acceleration of Spark distributed machine learning applications. It is built upon the proven RAPIDS cuML c++/python-based library (https://github.com/rapidsai/cuml), implementing GPU-accelerated versions of classical ML algorithms for regression, classification, clustering, and dimensionality reduction. For such algorithms also in Spark MLlib, Spark RAPIDS ML provides essentially no-code-change Spark MLlib DataFrame API compatibility. We share benchmark results demonstrating up to 100x speedup and 50x cost savings over baseline Spark MLlib in compute-intensive regimes.

SESSION SPEAKERS

Erik Ordentlich

/Sr. Manager
NVIDIA

Jinfeng Li

/Senior Engineer, Machine Learning
NVIDIA