This week, many of the most influential engineers and researchers in the data management community are convening in-person in Philadelphia for the ACM SIGMOD conference, after two years of meeting virtually. As part of the event, we were thrilled to see the following two awards:
- Apache Spark was awarded the SIGMOD Systems Award
- Databricks Photon was awarded the Best Industry Paper award
We thought we would take this opportunity to discuss the background to this and how we got here.
What is ACM SIGMOD and what are the awards?
ACM SIGMOD stands for Association of Computing Machinery’s Special Interest Group in the Management of Data. We know, long name. Everybody just says SIGMOD. It is the most prestigious conference for database researchers and engineers, as many of the most seminal ideas in the field of databases, from column stores to query optimizations, have been published in this venue.
The SIGMOD Systems Award is given annually to one “system whose technical contributions have had significant impact on the theory or practice of large-scale data management systems.” These systems tend to have large-scale real-world applications as well as having influenced how future database systems are designed. The past winners include Postgres, SQLite, BerkeleyDB, and Aurora.
The Best Industry Paper Award is awarded annually to one paper based on the combination of real-world impact, innovation, and quality of the presentation.
Apache Spark’s Data and AI Origin
About a decade ago, Netflix started a competition called Netflix Prize, in which they anonymized their vast collection of user movie ratings and asked competitors to come up with algorithms to predict how users would rate movies. The $1m USD trophy would go to the team with the best machine learning model.
A group of PhD students at UC Berkeley decided to compete. The first challenge they ran into was that the tooling simply wasn’t good enough. In order to build better models, they needed a fast, iterative way to clean, analyze, process large amounts of data (that didn’t fit on a student laptop), and they needed a framework expressive enough to compose experimental ML algorithms on.
Data warehouses, which were the standard for enterprise data, could not deal with the unstructured data and lacked expressiveness. They discussed this challenge with another PhD student, Matei Zaharia. Together, they designed a new parallel computing framework called Spark, with a new innovative distributed data structure called RDDs. Spark enabled its users to run data parallel operations quickly and concisely.
Or put it differently, it’s fast to write code in and fast to run. Fast to write is important because it makes the program more understandable, and can be used to compose more complex algorithms easily. Fast to run means users can get feedback faster, and build their models using ever-growing data.
It turned out the students were not alone. These were the early days of data and AI applications in the industry, and everybody faced similar challenges. With popular demand, the project moved to the Apache Software Foundation and grew into a massive community.
Today, Spark is the de facto standard for data processing, and growing:
- It has been downloaded 45 million times last month, in PyPI and Maven Central alone. This represents a 90% year-over-year growth in downloads.
- It is used in at least 204 countries and regions.
- It is ranked the #1 in top paying technologies in Stack Overflow’s 2021 developer survey.
The SIGMOD Systems Award is a validation of the project’s adoption as well as its influence over the generations of systems to come to think of data and AI as a unified package.
Photon: New Workloads and Lakehouse
As Apache Spark grew in popularity, we found that organizations wanted to do more than large-scale data processing and machine learning with it: they wanted to run traditional interactive data warehousing applications on the same datasets they were using elsewhere in their business, eliminating the need to manage multiple data systems. This led to the concept of lakehouse systems: a single data store that can do large-scale processing and interactive SQL queries, combining the benefits of data warehouse and data lake systems.
To support these types of use cases, we developed Photon, a fast C++, vectorized execution engine for Spark and SQL workloads that runs behind Spark's existing programming interfaces. Photon enables much faster interactive queries as well as much higher concurrency than Spark, while supporting the same APIs and workloads, including SQL, Python and Java applications. We've seen great results with Photon on workloads of all sizes, from setting the world record in the large-scale TPC-DS data warehouse benchmark last year to offering 3x higher performance on small, concurrent queries.
Designing and implementing Photon was challenging because we needed the engine to retain the expressiveness and flexibility of Spark (to support the wide range of applications), never slower (to avoid performance regressions), and significantly faster in our target workloads. In addition, unlike a traditional data warehouse engine that assumes all the data has been loaded into a proprietary format, Photon needed to work in the lakehouse environment, processing data in open formats such as Delta Lake and Apache Parquet, with minimal assumptions about the ingestion process (e.g., availability of indexes or data statistics). Our SIGMOD paper describes how we tackled these challenges and many of the technical details of Photon's implementation.
We were thrilled to see this work recognized as the Best Industry Paper and we hope it gives database engineers and researchers good ideas about what's challenging in this new model of lakehouse systems. Of course, we have also been very excited about what our customers have done with Photon so far -- the new engine has already grown to a significant fraction of our workload.
If you are attending SIGMOD, drop by the Databricks booth and say hi. We would love to chat about the future of data systems together. In return, we will give you a “the best data warehouse is a lakehouse” t-shirt!