Guanzhong is a Software Engineer at Facebook’s Big Compute team. At Facebook, he has worked on Cosco for large data shuffling for half a year. Before Facebook, he worked at Google Ads and Youtube for large data processing. He received his master’s degree in computer science from Shanghai Jiao Tong University.
November 18, 2020 04:00 PM PT
Uneven distribution of input (or intermediate) data can often cause skew in joins. In Spark, this leads to very slow join stages where a few straggling tasks may take forever to finish. At Facebook, where Spark jobs shuffle hundreds of petabytes of aggregate data per day, skew in data exacerbates runtime latencies further to the order of multiple hours and even days. Over the course of last year, we introduced several state-of-art skew mitigation techniques from traditional databases that reduced query runtimes by more than 40%, and expanded Spark adoption for numerous latency sensitive pipelines. In this talk, we'll take a deep dive into Spark's execution engine and share how we're gradually solving the data skew problem at scale. To this end, we'll discuss several catalyst optimizations around implementing a hybrid skew join in Spark (that broadcasts uncorrelated skewed keys and shuffles non-skewed keys), describe our approach of extending this idea to efficiently identify (and broadcast) skewed keys adaptively at runtime, and discuss CPU vs. IOPS trade-offs around how these techniques interact with Cosco: Facebook's petabyte-scale shuffle service (https://maxmind-databricks.pantheonsite.io/session/cosco-an-efficient-facebook-scale-shuffle-service).
Speakers: Suganthi Dewakar and Tony Xu