Spark RAPIDS ML: GPU Accelerated Distributed ML in Spark Clusters
OVERVIEW
EXPERIENCE | In Person |
---|---|
TYPE | Breakout |
TRACK | Data Science and Machine Learning |
INDUSTRY | Enterprise Technology |
TECHNOLOGIES | AI/Machine Learning, Apache Spark |
SKILL LEVEL | Intermediate |
DURATION | 40 min |
DOWNLOAD SESSION SLIDES |
Spark MLlib is a key component of Apache Spark™ for large-scale machine learning and provides built-in implementations of many popular machine learning algorithms. These implementations were created a decade ago and do not leverage modern computing accelerators like GPUs. In this talk, we present Spark RAPIDS ML (https://github.com/spark-rapids-ml), an open source Python package for enabling GPU acceleration of Spark distributed machine learning applications. It is built upon the proven RAPIDS cuML c++/python-based library (https://github.com/rapidsai/cuml), implementing GPU-accelerated versions of classical ML algorithms for regression, classification, clustering, and dimensionality reduction. For such algorithms also in Spark MLlib, Spark RAPIDS ML provides essentially no-code-change Spark MLlib DataFrame API compatibility. We share benchmark results demonstrating up to 100x speedup and 50x cost savings over baseline Spark MLlib in compute-intensive regimes.
SESSION SPEAKERS
Erik Ordentlich
/Sr. Manager
NVIDIA
Jinfeng Li
/Senior Engineer, Machine Learning
NVIDIA