ROCm and Distributed Deep Learning on Spark and TensorFlow

Download Slides

ROCm, the Radeon Open Ecosystem, is an open-source software foundation for GPU computing on Linux. ROCm supports TensorFlow and PyTorch using MIOpen, a library of highly optimized GPU routines for deep learning.

In this talk, we describe how Apache Spark is a key enabling platform for distributed deep learning on ROCm, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end machine learning pipeline. We will analyse the different frameworks for integrating Spark with Tensorflow on ROCm, from Horovod to HopsML to Databrick’s Project Hydrogen.

We will also examine the surprising places where bottlenecks can surface when training models (everything from object stores to the Data Scientists themselves), and we will investigate ways to get around these bottlenecks. The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on Hopsworks with ROCm.

 

Try Databricks
See More Spark + AI Summit in San Francisco 2019 Videos


« back
About Jim Dowling

Jim Dowling is CEO of Logical Clocks and an Associate Professor at KTH Royal Institute of Technology. He is lead architect of the open-source Hopsworks platform, a horizontally scalable data platform for machine learning that includes the industry’s first Feature Store.

About Ajit Mathews

As the Corporate Vice President of Machine Learning software engineering, Ajit is the engineering leader responsible for design, development of ROCm (Radeon Open Compute) Machine Intelligence software spanning Deep Learning Frameworks, Compilers, Language Runtimes, Libraries and Linux Compute Kernel. Ajit is also responsible for the Machine Learning Software Roadmap and Strategy. Ajit is passionate about distributed machine learning and high performance computing. Ajit holds Masters in Computer Science and MBA from Kellogg.