Rong Ou

Principal Engineer, NVIDIA

Rong Ou is a Principal Engineer at Nvidia, working on machine learning and deep learning infrastructure. He introduced mpi-job support into Kubeflow for distributed training on Kubernetes. Prior to Nvidia, Rong was a Staff Engineer at Google.

Past sessions

XGBoost is one of the most popular machine learning library, and its Spark integration enables distributed training on a cluster of servers. In Spark+AI Summit 2019, we shared GPU acceleration of Spark XGBoost for classification and regression model training on Spark 2.x cluster. This talk will cover the recent progress on XGBoost and its GPU acceleration via Jupyter notebooks on Databricks. Spark XGBoost has been enhanced to training large datasets with GPUs. Training data could now be loaded in chunks, and XGBoost DMatrix will be built up incrementally with compressions. The compressed DMatrix data could be stored in GPU memory or external memory/disk. These changes enable us to train models with datasets beyond GPU size limit. A gradient based sampling algorithm with external memory is also been introduced to achieve comparable accuracy and improved training performance on GPUs. XGBoost has recently added a new kernel for learning to rank (LTR) tasks. It provides several algorithms: pairwise rank, lambda rank with NDC or MAP. These GOU kernels enables 5x speedup on LTR model training with the largest public LTR dataset (MSLR-Web). We have integrated Spark XGBoost with RAPIDS cudf library to achieve end-to-end GPU acceleration on Spark 2.x and Spark 3.0. We achieved a significant end-to-end speedup when training on GPUs compared to CPUs. Accelerated XGBoost turns hours of training into minutes with a relatively lower cost. We will share our latest benchmark results with large datasets including the publicly available 1TB Criteo click logs.

Data science workflows can benefit tremendously from being accelerated, to enable data scientists to explore more and larger datasets. This allows data scientist to drive towards their business goals, faster, and more reliably. Accelerating Apache Spark with GPU is the next step for data science. In this talk, we will share our work in accelerating Spark applications via CUDA and NCCL.

We have identified several bottleneck in Spark 2.4 in the areas of data serialization and data scalability. To address this we accelerated Spark based data analytics with enhancements to allow large columnar datasets to be analyzed directly in CUDA with Python. The GPU dataframe library, cuDF (, can be used to express advanced analytics easily. Through applying Apache Arrow and cuDF, we have achieved over 20x speedup over regular RDDs.

For distributed machine learning, Spark 2.4 introduced a barrier execution mode to support MPI allreduce style algorithms. We will demonstrate how the latest Nvidia NCCL library, NCCL2, could further scale out distributed learning algorithms, such as XGBoost.

Finally, an enhancement of Spark kubernetes scheduler will be introduced so that GPU resources can be scheduled from a kubernetes cluster for Spark applications. We will share our experience deploying Spark on Nvidia Tesla T4 server clusters. Based on the new NVIDIA Turing architecture, the T4, an energy-efficient 70-watt small PCIe form factor GPU, is optimized for scale-out computing environments and features multi-precision Turing Tensor Cores and new RT Cores.