홈페이지Data + AI Summit 2022 로고
Watch on demand

Apache Spark AQE SkewedJoin Optimization and Practice in ByteDance

On Demand

Type

  • Session

Format

  • Hybrid

Track

  • 데이터 엔지니어링

Difficulty

  • Intermediate

Room

  •  Moscone South | Level 2 | 215

Duration

  • 35 min
Download session slides

개요

Almost all distributed computing systems cannot avoid data skew. If data skew is not dealt with, the long-tail task will seriously slow down the execution of the job, or even cause the failure of the job.
In this talk, we will introduce how Spark AQE processes SkewedJoin and how we optimize the implementation based on workload in ByteDance.



The main points are as follows:

• Address the risks associated with increasing statistical accuracy to solve the problem of not being able to identify data skew

• Optimize the split logic of skew data to achieve a better optimization effect

• Compared to the community's implementation, more complex optimization scenarios are supported, which has basically covered all SkewedJoin scenarios



By February 2021, Spark AQE SkewedJoin optimization covers 13000+ Spark jobs per day in ByteDance. The average performance of optimized Spark jobs increased by 35%.

Session Speakers

Headshot of Liu Thomas

Liu Thomas

소프트웨어 엔지니어

字节跳动

Data+AI Summit 하이라이트 보기

Watch on demand