Travis Addair is a software engineer at Uber and technical lead for the Deep Learning Training team as part of the Michelangelo AI platform. In the open source community, he serves as lead maintainer for the Horovod distributed deep learning framework and is a co-maintainer of the Ludwig auto ML framework. In the past, he’s worked on scaling machine learning systems at Google and Lawrence Livermore National Lab.
November 17, 2020 04:00 PM PT
Ray (https://github.com/ray-project/ray) is a framework developed at UC Berkeley and maintained by Anyscale for building distributed AI applications. Over the last year, the broader machine learning ecosystem has been rapidly adopting Ray as the primary framework for distributed execution. In this talk, we will overview how libraries such as Horovod (https://horovod.ai/), XGBoost, and Hugging Face Transformers, have integrated with Ray. We will then showcase how Uber leverages Ray and these ecosystem integrations to simplify critical production workloads at Uber. This is a joint talk between Anyscale and Uber.
Speakers: Travis Addair and Richard Liaw
June 23, 2020 05:00 PM PT
Data processing and deep learning are often split into two pipelines, one for ETL processing, the second for model training. Enabling deep learning frameworks to integrate seamlessly with ETL jobs allows for more streamlined production jobs, with faster iteration between feature engineering and model training. The newly introduced Horovod Spark Estimator API enables TensorFlow and PyTorch models to be trained directly on Spark DataFrames, leveraging Horovod's ability to scale to hundreds of GPUs in parallel, without any specialized code for distributed training. With the new accelerator aware scheduling and columnar processing APIs in Apache Spark 3.0, a production ETL job can hand off data to Horovod running distributed deep learning training on GPUs within the same pipeline.
This breaks down the barriers between ETL and continuous model training. Operational and management tasks are lower, and data processing and cleansing is more directly connected to model training. This talk covers an end to end pipeline, demonstrating ETL and DL as separate pipelines, and Apache Spark 3.0 ETL with the Horovod Spark Estimator API to enable a single pipeline. We will demonstrate 2 pipelines - one using Databricks with Jupyter notebooks to run ETL and Horovod, the second on YARN to run a single application to transition from ETL to DL using Horovod. The use of accelerators across both pipelines and Horovod features will be discussed.