A Deep Dive into the Catalyst Optimizer

Download Slides

Catalyst is becoming one of the most important components in Apache Spark, as it underpins all the major new APIs in Spark 2.0, from DataFrames, Datasets, to streaming. At its core, Catalyst is a general library for manipulating trees. Based on this library, we have built a modular compiler frontend for Spark, including a query analyzer, optimizer, and an execution planner. In this talk, I will introduce the core concepts of catalyst by working through a few examples. I will also show how new and upcomming features are implemented using Catalyst. The audience will walk away with a deeper understanding of how Spark analyzes, optimizes and plans a user’s query.

« back