Apache Spark SQL Aggregate Improvement at Meta (Facebook)
- Data Engineering
- Moscone South | Upper Mezzanine | 155
- 35 min
Aggregate (group-by) is one of most important SQL operations in data warehouses. It is required when we want to get aggregated insights from input datasets. Over the last year, we added a series of aggregate optimizations internally at Facebook Spark SQL, and we started to contribute back to Apache Spark recently.
(1).sort aggregate (SPARK-32461): add code generation to improve query performance, replace hash with sort aggregate when child is sorted, etc.
(2).object hash aggregate (SPARK-34286): adaptive sort-based fallback based on JVM heap memory usage during query execution.
(3).hash aggregate (SPARK-31973): adaptive bypass partial aggregate when aggregate reduction ratio is low.
(4).data source aggregate push down (SPARK-34960): aggregate push down to ORC data source by utilizing column statistics
(5).files statistics aggregate: aggregate output files (and all columns) statistics distributively when writing query output
we’ll take deep dive of above features and lessons learned.