Suganthi is a software engineer at Facebook, Menlo Park. She is part of the data infra team focussing on Spark core and SQL engine. Before Facebook, she worked at Google, NetApp and Ericsson. She received her Masters in Computer Science from North Carolina State University.
November 18, 2020 04:00 PM PT
Uneven distribution of input (or intermediate) data can often cause skew in joins. In Spark, this leads to very slow join stages where a few straggling tasks may take forever to finish. At Facebook, where Spark jobs shuffle hundreds of petabytes of aggregate data per day, skew in data exacerbates runtime latencies further to the order of multiple hours and even days. Over the course of last year, we introduced several state-of-art skew mitigation techniques from traditional databases that reduced query runtimes by more than 40%, and expanded Spark adoption for numerous latency sensitive pipelines. In this talk, we'll take a deep dive into Spark's execution engine and share how we're gradually solving the data skew problem at scale. To this end, we'll discuss several catalyst optimizations around implementing a hybrid skew join in Spark (that broadcasts uncorrelated skewed keys and shuffles non-skewed keys), describe our approach of extending this idea to efficiently identify (and broadcast) skewed keys adaptively at runtime, and discuss CPU vs. IOPS trade-offs around how these techniques interact with Cosco: Facebook's petabyte-scale shuffle service (https://maxmind-databricks.pantheonsite.io/session/cosco-an-efficient-facebook-scale-shuffle-service).
Speakers: Suganthi Dewakar and Tony Xu