HomepageData + AI Summit 2022 Logo
Watch on demand

Apache Spark SQL Aggregate Improvement at Meta (Facebook)

On Demand


  • Session


  • Hybrid


  • Data Engineering


  • Intermediate


  • Moscone South | Upper Mezzanine | 155


  • 35 min
Download session slides


Aggregate (group-by) is one of most important SQL operations in data warehouses. It is required when we want to get aggregated insights from input datasets. Over the last year, we added a series of aggregate optimizations internally at Facebook Spark SQL, and we started to contribute back to Apache Spark recently.

(1).sort aggregate (SPARK-32461): add code generation to improve query performance, replace hash with sort aggregate when child is sorted, etc.
(2).object hash aggregate (SPARK-34286): adaptive sort-based fallback based on JVM heap memory usage during query execution.
(3).hash aggregate (SPARK-31973): adaptive bypass partial aggregate when aggregate reduction ratio is low.
(4).data source aggregate push down (SPARK-34960): aggregate push down to ORC data source by utilizing column statistics
(5).files statistics aggregate: aggregate output files (and all columns) statistics distributively when writing query output

we’ll take deep dive of above features and lessons learned.

Session Speakers

Cheng Su

Software Engineer

Meta (Facebook)

Shipra Agrawal

Software Engineer


See the best of Data+AI Summit

Watch on demand