Cost-Based Optimizer in Apache Spark 2.2

Download Slides

Apache Spark 2.2 ships with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Leveraging these reliable statistics helps Spark to make better decisions in picking the most optimal query plan. Examples of these optimizations include selecting the correct build side in a hash-join, choosing the right join type (broadcast hash-join vs. shuffled hash-join) or adjusting a multi-way join order, among others. In this talk, we’ll take a deep dive into Spark’s cost based optimizer and discuss how we collect/store these statistics, the query optimizations it enables, and its performance impact on TPC-DS benchmark queries. Talk contributors: Zhenhua Wang (Huawei Technologies) and Wenchen Fan (Databricks) Session hashtag: #SFdd2

« back