Catalyst Optimizer

At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer. Catalyst is based on functional programming constructs in Scala and designed with these key two purposes:

Easily add new optimization techniques and features to Spark SQL
Enable external developers to extend the optimizer (e.g. adding data source specific rules, support for new data types, etc.)

Catalyst Optimizer Diagram

Here’s more to explore

Big Book of Data Engineering

Learn essential data engineering best practices.

Read now

The Data Engineer’s Guide to Apache Spark and Delta Lake

For data engineers looking to leverage Apache Spark™’s and Delta Lake’s immense growth to build faster and more reliable data pipelines.

Get the eBook

Learn data engineering now

Watch 4 videos and pass a quiz to earn a badge.

Get started

Catalyst contains a general library for representing trees and applying rules to manipulate them. On top of this framework, it has libraries specific to relational query processing (e.g., expressions, logical query plans), and several sets of rules that handle different phases of query execution: analysis, logical optimization, physical planning, and code generation to compile parts of queries to Java bytecode. For the latter, it uses another Scala feature, quasiquotes, that makes it easy to generate code at runtime from composable expressions. Catalyst also offers several public extension points, including external data sources and user-defined types. As well, Catalyst supports both rule-based and cost-based optimization.

Back to Glossary