Juliusz Sompolski joined Databricks in January 2017, as one of the founding software engineers of Databricks Amsterdam European Development Centre. He is working on optimizing the SQL performance of Databricks Runtime. Most recently, he concentrates on performance of Business Intelligence workloads.
October 15, 2019 05:00 PM PT
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.