HomepageData + AI Summit 2022 Logo
Watch on demand

Spark Data Source V2 Performance Improvement: Aggregate Push Down

On Demand


  • Session


  • Hybrid


  • Data Engineering


  • Intermediate


  • Moscone South | Level 2 | 205


  • 35 min
Download session slides


Spark applications often need to query external data sources such as file-based data sources or relational data sources. In order to do this, Spark provides Data Source APIs to access structured data through Spark SQL.

Data Source APIs have optimization rules such as filter push down and column pruning to reduce the amount of data that needs to be processed to improve query performance. As part of our ongoing project to provide generic Data Source V2 push down APIs, we have introduced partial aggregate push down, which significantly speeds up spark jobs by dramatically reducing the amount of data transferred between data sources and Spark. We have implemented aggregate push down in both JDBC and parquet.

Session Speakers

Headshot of Huaxin Gao

Huaxin Gao

Software Engineer


Headshot of DB Tsai

DB Tsai

Software Engineering Manager


See the best of Data+AI Summit

Watch on demand