Min Shen is a tech lead at LinkedIn. His team’s focus is to build and scale LinkedIn’s general purpose batch compute engine based on Apache Spark. The team empowers multiple use cases at LinkedIn ranging from data explorations, data engineering, to ML model training. Prior to this, Min mainly worked on Apache YARN. He holds a PhD degree in Computer Science from University of Illinois at Chicago.
May 26, 2021 12:05 PM PT
The number of daily Apache Spark applications at LinkedIn has increased by 3X in the past year. The shuffle process alone, which is one of the most costly operators in batch computation, is processing PBs of data and billions of blocks daily in our clusters. With such a rapid increase of Apache Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workloads efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on HDDs.
To tackle those challenges and optimize shuffle performance in Apache Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Apache Spark. Our paper on Magnet has been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will share our experiences of productionizing Magnet at LinkedIn to process close to 10 PB of daily shuffle data.
June 23, 2020 05:00 PM PT
Over the past 3 years, Apache Spark has transitioned from an experiment to the dominant production compute engine at LinkedIn. Within the past year, we have seen a 3X growth of daily Spark applications. Nowadays, it powers many use cases ranging from AI to data engineering, to analytics. 1000+ active Spark users launch 10s of thousands of Spark jobs on our clusters processing PBs of data on a daily basis. Throughout this journey, we have faced multiple challenges in scaling our Spark compute infrastructure and empowering our fast-growing users to develop working and efficient Spark applications: Remove the major infrastructure scaling bottlenecks by optimizing core Spark components such as shuffle and Spark History Server Balance between the limited compute resources and users' ever increasing compute demands by improving cluster resource scheduler Improve users' development productivity without falling deep into the 'support trap' by automating job failure root cause analysis Boost users' Spark jobs efficiency without hurdling their development agility that comes with repeated tuning of the jobs. In this talk, we will share the work we have done that tackles these challenges and what we have learnt during this process.