Catalyst: A Query Optimization Framework for Spark and Shark

Query optimization can greatly improve both the productivity of developers and the performance of the queries that they write. A good query optimizer is capable of automatically rewriting relational queries to execute more efficiently, using techniques such as filtering data early, utilizing available indexes, and even ensuring different data sources are joined in the most efficient order. By performing these transformations, the optimizer not only improves the execution times of relational queries, but also frees the developer to focus on the semantics of their application instead of its performance. Unfortunately, building an optimizer is a incredibly complex engineering task and thus many open source systems perform only very simple optimizations. Past research[1][2] has attempted to combat this complexity by providing frameworks that allow the creators of optimizers to write possible optimizations as a set of declarative rules. However, the use of such frameworks has required the creation and maintenance of special “optimizer compilers” and forced the burden of learning a complex domain specific language upon those wishing to add features to the optimizer. Instead, we propose Catalyst, a query optimization framework embedded in Scala. Catalyst takes advantage of Scala’s powerful language features such as pattern matching and runtime metaprogramming to allow developers to concisely specify complex relational optimizations. In this talk I will describe the framework and how it allows developers to express complex query transformations in very few lines of code. I will also describe our initial efforts at improving the execution time of Shark queries by greatly improving its query optimization capabilities.
[1] Graefe, G. The Cascades Framework for Query Optimization. In Data Engineering Bulletin. Sept. 1995.
[2] Goetz Graefe , David J. DeWitt, The EXODUS optimizer generator, Proceedings of the 1987 ACM SIGMOD international conference on Management of data, p.160-172, May 27-29, 1987, San Francisco, California, United States

Additional Reading:

  • Deep Dive into Spark SQL’s Catalyst Optimizer

    « back
  • About Michael Armbrust

    Michael Armbrust is committer and PMC member of Apache Spark and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and Databricks Delta. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization. [daisna21-speakers]