HomepageData + AI Summit 2022 Logo
Watch on demand

Spark Data Source V2 Performance Improvement: Aggregate Push Down

On Demand

Type

  • Session

Format

  • Hybrid

Track

  • Data Engineering

Difficulty

  • Intermediate

Room

  • Moscone South | Level 2 | 205

Duration

  • 35 min
Download session slides

Overview

Spark applications often need to query external data sources such as file-based data sources or relational data sources. In order to do this, Spark provides Data Source APIs to access structured data through Spark SQL.



Data Source APIs have optimization rules such as filter push down and column pruning to reduce the amount of data that needs to be processed to improve query performance. As part of our ongoing project to provide generic Data Source V2 push down APIs, we have introduced partial aggregate push down, which significantly speeds up spark jobs by dramatically reducing the amount of data transferred between data sources and Spark. We have implemented aggregate push down in both JDBC and parquet.

Session Speakers

Headshot of Huaxin Gao

Huaxin Gao

Software Engineer

Apple

Headshot of DB Tsai

DB Tsai

Software Engineering Manager

Apple

See the best of Data+AI Summit

Watch on demand