ホームData + AI Summit 2022 のロゴ
Watch on demand

Apache Spark SQL Aggregate Improvement at Meta (Facebook)

On Demand

Type

  • Session

フォーマット

  • Hybrid

Track

  • データエンジニアリング

Difficulty

  • Intermediate

Room

  • Moscone South | Upper Mezzanine | 155

Duration

  • 35 min
Download session slides

概要

Aggregate (group-by) is one of most important SQL operations in data warehouses. It is required when we want to get aggregated insights from input datasets. Over the last year, we added a series of aggregate optimizations internally at Facebook Spark SQL, and we started to contribute back to Apache Spark recently.

(1).sort aggregate (SPARK-32461): add code generation to improve query performance, replace hash with sort aggregate when child is sorted, etc.
(2).object hash aggregate (SPARK-34286): adaptive sort-based fallback based on JVM heap memory usage during query execution.
(3).hash aggregate (SPARK-31973): adaptive bypass partial aggregate when aggregate reduction ratio is low.
(4).data source aggregate push down (SPARK-34960): aggregate push down to ORC data source by utilizing column statistics
(5).files statistics aggregate: aggregate output files (and all columns) statistics distributively when writing query output

we’ll take deep dive of above features and lessons learned.

Session Speakers

Cheng Su

ソフトウェアエンジニア

Meta (Facebook)

Shipra Agrawal

ソフトウェアエンジニア

Meta

Data+AI サミットの様子をご覧いただけます

Watch on demand