ホームData + AI Summit 2022 のロゴ
Watch on demand

Spark Data Source V2 Performance Improvement: Aggregate Push Down

On Demand

Type

  • Session

フォーマット

  • Hybrid

Track

  • データエンジニアリング

Difficulty

  • Intermediate

Room

  • Moscone South | Level 2 | 205

Duration

  • 35 min
Download session slides

概要

Spark applications often need to query external data sources such as file-based data sources or relational data sources. In order to do this, Spark provides Data Source APIs to access structured data through Spark SQL.



Data Source APIs have optimization rules such as filter push down and column pruning to reduce the amount of data that needs to be processed to improve query performance. As part of our ongoing project to provide generic Data Source V2 push down APIs, we have introduced partial aggregate push down, which significantly speeds up spark jobs by dramatically reducing the amount of data transferred between data sources and Spark. We have implemented aggregate push down in both JDBC and parquet.

Session Speakers

Huaxin Gao

ソフトウェアエンジニア

Apple

DB Tsai

Software Engineering Manager

Apple

Data+AI サミットの様子をご覧いただけます

Watch on demand