Spark + AI Summit 2020: Wednesday Morning Keynotes

Ali Ghodsi – Intro to Lakehouse, Delta Lake (Databricks) – 46:40
Matei Zaharia – Spark 3.0, Koalas 1.0 (Databricks) – 17:03
Brooke Wenig – DEMO: Koalas 1.0, Spark 3.0 (Databricks) – 35:46
Reynold Xin – Introducing Delta Engine (Databricks) – 1:01:50
Arik Fraimovich – Redash Overview & DEMO (Databricks) – 1:27:25
Vish Subramanian – Brewing Data at Scale (Starbucks) – 1:39:50

Realizing the Vision of the Data Lakehouse
Ali Ghodsi

Data warehouses have a long history in decision support and business intelligence applications. But, data warehouses were not well suited to dealing with the unstructured, semi-structured, and streaming data common in modern enterprises. This led to organizations building data lakes of raw data about a decade ago. But, they also lacked important capabilities. The need for a better solution has given rise to the data lakehouse, which implements similar data structures and data management features to those in a data warehouse, directly on the kind of low cost storage used for data lakes.

This keynote by Databricks CEO, Ali Ghodsi, explains why the open source Delta Lake project takes the industry closer to realizing the full potential of the data lakehouse, including new capabilities within the Databricks Unified Data Analytics platform to significantly accelerate performance. In addition, Ali will announce new open source capabilities to collaboratively run SQL queries against your data lake, build live dashboards, and alert on important changes to make it easier for all data teams to analyze and understand their data.

Introducing Apache Spark 3.0:
A retrospective of the Last 10 Years, and a Look Forward to the Next 10 Years to Come.
Matei Zaharia and Brooke Wenig

In this keynote from Matei Zaharia, the original creator of Apache Spark, we will highlight major community developments with the release of Apache Spark 3.0 to make Spark easier to use, faster, and compatible with more data sources and runtime environments. Apache Spark 3.0 continues the project’s original goal to make data processing more accessible through major improvements to the SQL and Python APIs and automatic tuning and optimization features to minimize manual configuration. This year is also the 10-year anniversary of Spark’s initial open source release, and we’ll reflect on how the project and its user base has grown, as well as how the ecosystem around Spark (e.g. Koalas, Delta Lake and visualization tools) is evolving to make large-scale data processing simpler and more powerful.

Delta Engine: High Performance Query Engine for Delta Lake
Reynold Xin

How Starbucks is Achieving its ‘Enterprise Data Mission’ to Enable Data and ML at Scale and Provide World-Class Customer Experiences
Vish Subramanian

Starbucks makes sure that everything we do is through the lens of humanity – from our commitment to the highest quality coffee in the world, to the way we engage with our customers and communities to do business responsibly. A key aspect to ensuring those world-class customer experiences is data. This talk highlights the Enterprise Data Analytics mission at Starbucks that helps making decisions powered by data at tremendous scale. This includes everything ranging from processing data at petabyte scale with governed processes, deploying platforms at the speed-of-business and enabling ML across the enterprise. This session will detail how Starbucks has built world-class Enterprise data platforms to drive world-class customer experiences.

Talk Transcript

– On behalf of Databricks, welcome to The Spark and AI Summit 2020. My name is Ali Ghodsi, I’m the CEO of Databricks, and I’m looking forward to being your host, over the next couple of days. We’re excited to bring together 60,000 people, from around the globe, for what has quickly become one of the biggest data and AI events in the world. Now more than ever, Spark and AI Summit is truly a global community event. We’ve got an incredible lineup of speakers, content, and experiences planned for you this week. We have over 250 trainings, tutorials, lots of different sessions, both live and on demand, and I’m especially excited about our guest keynote lineup, featuring people like Nate Silver, from 538, Professor Jennifer Chayse, from UC Berkeley, Kim Hazelwood from Facebook, and Dr. Phillip Atiba Goff, co-founder of the Center for Policing Equity.

Also, please explore everything else, that this virtual summit has to offer, including; our developer hub, advisory lounge, and much more. You can find it all in the left nav, of your virtual summit dashboard. Also, I wanna take a moment to thank all the organizations, who are helping put this event on, we wouldn’t be here, and we couldn’t pull this off, without these important sponsors, please make sure to pay them a visit, in the Dev Hub of the Expo. With that, I want to spend some time talking about, why this event is so important to us. At Databricks, our mission has always been to help data teams solve the world’s toughest problems, this is a mission that we share, with the entire data community. Nobody understands the power and vast potential of data and AI, like this group, and this event is a great manifestation of that, it’s an incredible opportunity to learn from each other, share new ideas and move the industry forward, and today, I’m sure you’ll all agree, that solving the world’s toughest problems, has never felt more urgent, as an example, racial injustice continues to be a systemic issue, that is impacting the lives of so many. Recent events and protests around the world, remind us, that individuals, organizations and governments, must continue to do more, to bring awareness to racial injustice, and drive meaningful change. While it’s of course, so much more than a data problem, some organizations are using data science, to tackle some of its systemic roots, a great example is the Center for Policing Equity.

Thank you to our sponsors!

The Center for Policing Equity uses data science, to reduce the cause of racial disparities, in law enforcement, they use advanced analytics to diagnose gaps in policing, and help police build healthier relationships, with the communities they serve. I hope you’ll all attend the afternoon keynote sessions, especially the one where CPE co-founder, Dr. Philip Atiba Goff, will talk about the important role that data community can play, in addressing racial and social challenges. We are inspired by the work that the Center for Policing Equity is doing, and have decided to sponsor them, here at the Spark and AI Summit.

In addition to CPE, we’ve also decided to sponsor the NAACP Legal Defense and Educational Fund. The NAACP was formed over 100 years ago, and its Legal Defense and Educational Fund, has emerged as one of the country’s best legal organizations that fights for justice, on behalf of the African-American community. Through litigation, advocacy, and public education, LDF seeks structural changes, to expand democracy, eliminate disparities and achieve racial justice, so, in order to support both of these organizations, we’re setting an ambitious fundraising goal for this event. With your help, we want to raise $100,000, by the end of this event, we made it super easy for you to donate at any given time, by clicking ‘Donate Now’ button on your navigation panel, but I’m also really happy to share, that database will match all donations up to $100,000. So, I sincerely hope, you’ll consider making a contribution, it would be amazing if we, together, could raise $200,000 over the next few days.

When we talk about solving the world’s toughest problems, it’s also hard not to think about the disruptive impact of COVID-19.

The virus itself, is of course a massive problem, but so, are the many issues associated with mitigating its impact, not to mention; the lasting impact it will have on entire industries, on where and how we live, and the global economy, we are living in an unprecedented time.

What inspires me personally, and hopefully it inspires all of you, is the role that we can all play, in solving problems like the ones we face today, more than ever before, this is our time.

It’s the data community’s time, to unlock the full potential of data and AI. But why now? Well, one, because the world understands and acknowledges that potential more than ever before, two, because technology has evolved, and things are possible now, that were unimaginable only a couple years ago, and third, because as a data community, we’re more than ready, willing and capable, than we’ve ever been before.

As it relates specifically to the COVID crisis, it has been inspiring to see, how data teams are answering the call. As an example, Rush University Medical Center, is one of the top-ranked hospitals in the United States, they are doing data analytic services, for capacity-tracking, across multiple hospitals in Chicago. What stands out about their story, is how they were able to build this so quickly, across multiple hospitals, with multiple data teams, and multiple data sets, this probably wouldn’t have been possible a couple years ago. And while technology advancements have a lot to do with this, it also has a lot to do, with how projects like this are tackled, and at Databricks, what we’ve learned from working with organizations like Rush, is that data and AI is a team sport. Every company wants to be a data company, and if you think about it, what that actually means, it requires a new way of empowering people working with data, enabling them to organize around the data they need, collaborate, and get to the answers they need more quickly, it’s about the right people coming together, with the right data, as quickly as possible. Unfortunately, most organizations aren’t able to operate this way today, most data teams aren’t set up for success, because they’re stuck in silos, defined by closed and proprietary systems. They’re working with fractions of the data that they need, often incomplete or inconsistent, they spend more time chasing data, looking for latest versions, figuring out who can give them access, and not nearly enough time exploring, analyzing, or deriving value from it. So, in order to succeed, data teams need to evolve, they need to evolve from being separated by data, to being unified by data, agile, dynamic, connected to each other, so that they can move fast and do their best work, and that’s why our theme for this year’s conference, is “Data Teams Unite.”

Data is a team sport

But today, most data teams aren’t setup for success

It’s a call to unite around the data, unite to do our best work, and help others do theirs, unite to innovate faster, and to do so, at a time when the world needs it the most. In fact, we technically launched this theme, a couple months ago, with our Data Teams Unite Hackathon, in it, we asked data teams to come together, and innovate for social good by focusing on solutions, to help address the COVID-19 crisis, climate change, and social issues in their communities. As prizes, Databricks, we’ll be making donations to the charities of the winners’ choice, $5,000 will go to the third place, $10,000 to the second, and $20,000 to the first. We’ve received a number of great submissions, and I will announce the winners, in tomorrow morning’s opening keynote, please don’t miss it.

Data Teams Unite, also captures the spirit of this event, and of the open-source data community. Bringing data teams together, has long been the vision of Apache Spark, and it’s been really exciting, to watch the evolution of this community, in fact, this year, Apache Spark turned 10 years old, and there’s nobody better suited to talk about it, than the guy who created it, Matei Zaharia.

Matei, happy 10 years old!

This is a Special Year for Apache Spark

– Oh, thanks, Ali, this looks great! So, today’s a very special year for Apache Spark, Spark 3.0 is out, but it’s also 10 years, since the first open-source release of Spark, so I wanna talk to you today, about what got us here with the open source project, and what did we learn in the process, that we’re using to contribute to Apache Spark development in the future. And to do that, I’ll start by telling you my story with big data, how I got into this space, when I started my PhD at UC Berkeley. So, I started my PhD in 2007, and I was very interested in distributed systems, I had actually been working on peer to peer networks before this, but I wanted to try doing something different, and as I looked around at what’s happening in the industry, I got very excited about data center-scale computing, these large web companies, were starting to do computations on thousands of machines, and store petabytes of data, and do interesting things with them. And I wanted to learn more about these, and so, I actually worked with some of the big data teams, at Yahoo, and Facebook early on, and in working with these, I realized that, there’s a lot of potential for this technology, beyond web companies, I thought this would be very useful for scientific data sets, industrial data sets, many other things. But it was also very clear, that working with them, these technologies were too hard to use, for anyone who was not at a large web company, in particular, all the data pipelines to use these technologies, had to be written by professional software engineers, and also, all the technologies, were really focused on batch processing. There was no support for interactive queries, which is very important in data analysis, no support for machine learning and advanced algorithms. And when I came back to Berkeley, after spending some time working on these, I also found that there were local research groups, who wanted a more scalable engine for machine learning in particular, so I thought this was a great opportunity to try building something in this space, and, in particular, I worked early on with Lester Mackey, who was one of the top machine-learning students, at Berkeley, in my year, who was on a team, for the Netflix Price Competition, this million-dollar competition, to improve Netflix’s algorithm, and so, by seeing the kind of applications that he wanted to run, I started to design a programming model, that would make it possible for people like Lester, to develop these applications, and I started working on the Spark Engine in 2009. Now, once I started the work, the feedback on it was great, these machine-learning researchers at Berkeley, were actually using it, and so we decided to open source the project in 2010, and this is what the initial version of it looked like. So, it was a very small open source project, really just focused on, MapReduce-style computing, with a cleaner and faster API, but, the exciting thing is that, within two months, of us open sourcing it, we started getting Pull Requests, from other community members. So, there were really people picking this up, trying to use this very early project, and doing interesting things with it, and so, we spent the next two years, working with early users in the bay. Yeah, I spent a lot of time, actually visiting these companies, and organizing meetups, and trying to to make this a great platform, for large-scale data processing. And in working with these early users, we were blown away by some of the early use cases, people were doing things that we, as researchers, had never anticipated, and we thought that, there’s the potential to do a lot more here. So, for example, several of the early users, were powering interactive operational apps, like a user interface, using Spark on the back end, for example, there was a group of neuroscientists at Janelia Farm, that were monitoring data from the brains of animals, and other sensors, in real time, while they’re running neuroscience experiments, and building that up using Spark. That was something we had never anticipated, it’s obviously very cool, to be able to build these operational apps for exploring large datasets. Another interesting group, was from this startup company, Quantifying, that had built a product for its customers, to analyze social media and other data sources, and they actually were using Spark, to update data sets in memory, and implement streaming computation, this was before we added a streaming engine to Spark, so they had just built streaming computation on this platform. And finally, there were groups in the industry, for example, Yahoo’s data warehousing groups, that were running SQL queries over Spark, or our data science workloads, and other workloads, and they were opening up the Spark engines to tens, or even hundreds of more users, than we initially had been targeting. So, based on these use cases, and what Spark was able to do for these companies, we spent the next few years, really working on expanding access to Spark, making sure that anyone who works with data, can somehow connect to this engine, and be able to run computations at scale. And there were three major efforts there; the first one, was around programming languages, so we developed Python, R and SQL interfaces to Spark, that today, are by far, the most popular ways of using the engine. The second was around libraries, we wanted to have great built-in libraries, for machine learning, graphs and stream processing, and these provide a lot of the value of the platform today. And finally, we also got a lot of feedback on the API, what users found difficult, for example, how to set up mapInReduce, and other operators, to do a distributed computation, and we decided to make this big change, to the high-level API, called DataFrames, which runs most of the APIs in Spark, on top of the Spark SQL engine, so you get the same kind of optimizations, and query planning that a SQL engine will do, but in these easy-to-use programming language interfaces, and that quickly became the dominant way to use Apache Spark. So, if you look at Apache Spark today, these changes have all had a huge impact on the project, let’s start with support for Python.

Apache Spark Today: Python

Among the users of interactive notebooks at Databricks, where we can measure this, 68 of the commands, coming into the platform, are in Python. This is more than six times the amount of Scala, it’s by far the most widely used programming language, on Databricks today, and it means a wide range of software developers, are able to connect their code to Spark, and execute it at scale. Interestingly, the next most common language in notebooks, is SQL, there’s also a similar amount of commands, to the total that we have in notebooks, is just coming in through SQL connectors or directly, so that’s also very widely used.

Apache Spark Today: SQL

Next thing is SQL, so even when developers are you using Python or Scala, about 90% of the Spark API calls are actually running on Spark SQL, so these DataFrame APIs and other APIs, which means they benefit from all the optimizations, in the SQL engine, and just on Databricks alone, we’re seeing exabytes of data queried per day, using Spark SQL, as a result, the community has invested a lot in the Spark SQL engine, and making it one of the best open-source SQL engines on the planet. So for example, this block here shows the improvement, in performance on the TPC-DS benchmark, the leading benchmark for data warehousing workloads. And you can see with Spark 3.0, Spark is now running two times faster, than the previous version, and also about 1.5 times faster than Presto, very highly-regarded SQL engine, for large-scale data sets.

And another exciting piece of news, that happened with SQL this year, is that earlier this year, Alibaba has set a new record for the TPC-DS benchmark, using the Spark SQL engine, to exceed the cross performance of all the other engines, that have been submitted to this benchmark, so it’s really the basis of a very efficient, and powerful SQL engines today.

And the final one I wanna talk about, is streaming, just on Databricks alone, we see about 5 trillion records per day processed, with structured streaming, which makes it super easy, to just take a DataFrame or SQL computation, and turn it into a streaming one, and this number has been growing, by close to a factor of four, year over year, so, we’re seeing very fast adoption of this high-level Streaming API.

Major Lessons

Okay, so given these changes of the project, what are the major lessons that we learned? For me at least there were two big lessons. The first one is to focus on ease of use, prioritize that, for both data exploration and production, we found that a lot of the applications, that people build quickly became these are operational apps, or streaming apps, or repeated reports, and we wanted to add a lot of features into the engine, to make it easy for these to keep going, and to tell you if something changes, or something breaks, in a way that’s easy to fix, so a lot of the work in Spark now, is to make that super easy. The second big lessons we had, was to really design the system around APIs, that enable software development best practices, and integrate with the broad ecosystem. So, we designed all the APIs in Spark, so that they fit into standard programming environments, like Python and Java, and so that you can use best practices like composition of libraries, into an application, testability and modularity, and you can build packages that a user in your company can safely use, to do a computation, or have this wide open source community around them, and this is something where we’ve done a lot of improvements over time as well.

Apache Spark 3.0

Okay, so given these lessons, what’s happening in Apache Spark 3.0? This is our largest release yet, with over 3000 patches to the community, and it’s actually designed to be easy to switch to, from Spark 2.x, so we definitely encourage you to check it out, and switch to it when you can. This chart here shows where the patches have gone, and you can actually see almost half the patches, are in the Spark SQL engine, both for SQL support itself, and because it’s the underlying engine, for all these DataFrame API calls, so, this is the most active piece of development, but there’s, of course, a lot else going on as well. So, I just wanna highlight a few of the features that I’m excited about, focusing specifically on SQL and Python, but of course there’s quite a bit of other stuff going on, in 3.0, as well. And I’m gonna start with a major change, to the Spark SQL engine, the largest change in recent years, which is called Adaptive Query Execution. So, this is a change where the engine can actually update the execution plan for a computation at runtime, based on observed properties of the data, for example, it can automatically tune the number of reducers when doing an aggregation, or either join algorithms, or it can even adapt to skewing the data, to plan the computation as it’s going. and this makes it much easier to run Spark, because you don’t need to configure these things in advance, it will actually adapt and optimize based on your data, and also leads to better performance in many cases. So, to give you a sense of how it works, let’s look at a simple example, of setting the number of reducers. We found that today about 60% of clusters, the users tune the number of reducers in them. so that’s a lot of manual configuration, that you wanna eliminate and automate. And so, with AQE, what happens is, after Spark runs the initial phase of an aggregation, it actually observes their result size, you can have some number of partitions there, and it can set a different number of partitions for the reduce, for example, coalesce everything down, to five partitions, and optimize it for the best performance, based on what kind of data came out of that aggregation. Now, even more interesting things happen with joins, and more complicated operators. For example, when you’re joining two tables, even if you have high-quality statistics about the data, it’s hard to know how many records will end up before the join, and using AQE, Spark can actually observe this, after the initial stages of the join have happened, and then choose the join algorithm, that best optimizes performance downstream. And, in fact, it can adapt both to the size of the data, and to the skewer on different keys, so you don’t have to worry about treating skewed keys, in a special way anymore, and this results in really big speed ups on SQL workloads. So, for example, on TPC-DS queries, we’ve seen up to a factor of eight speed up with AQE, and even better, it means you can actually run Spark on a lot of data sets, even without big computing statistics, and get great performance, because it discovers the statistics as you go along.

3.0: SQL Engine

Spork’ 3.0: SOL Performance

Okay, that’s just one of the changes that affects both SQL usability and performance, there’s quite a bit more, so on the performance side, we have dynamic partition pruning, quake compile time speedups, and optimizer hints, and as I said earlier, these have led to our two x reduction in execution time, for TPC-DS, and a lot of improvements on real workloads. And finally, there’s been a big effort in 3.0, around SQL compatibility, in particular, an ANSI SQL dialect, that follows the standard SQL convention, in many areas of the language, and that makes it very easy to pull it in workloads from other SQL systems, and this is one of the areas that we’re continuing to invest in, but we’ve made significant strides in 3.0.

3.0: 3.0 Compatibility and Python Usability+Performance

Okay, so that’s a bit about SQL, the other thing that I wanted to highlight is around Python, so we’ve done quite a bit of work, on both Python usability and performance. For usability, we’ve made it much easier to define Pandas user-defined functions, using type hints in Python, which will actually specify the format of data you expect, so, in Spark, it’s easy to make a function that takes in batches of data, as pandas.Series, or DataFrames, for example, or even as an iterator of series, and you can just specify these with type hints. For comparison, the previous API requires you to write a lot of boilerplate code, in order to say what kind of input your function expects, and that was very error prone, so now I can go away, and it’s super straightforward, to define and use these functions. All right, there’s also been a lot of work on performance, using Apache Arrow, we’re seeing about 20 to 25% speedups, in Python UDF performance, using the latest enhancements in Apache Arrow, and also up to a 40 times speedup in Spark R, by using Arrow to exchange data within R and Spark, and these are all transparent to the end user, you’re just upgrade it and you get these speedups, and we have quite a few new APIs, for combining Pandas with Spark as well.

3.0: Other Features

And finally, there are features throughout the engine, including our new Structured Streaming UI, for monitoring computations, a way to define custom observable metrics, about your data, that you wanna check in streaming jobs, SQL Reference Guide, and a powerful new API for data sources. So, if you wanna learn more about Apache Spark 3.0, I invite you to check out Xiao Li’s talk at this summit, and many of the talks on these individual features.

Other Apache Spark Ecosystem Projects

And the Spark project itself, is not the only place where things are happening, there’s actually a bold community around it, and I wanted to highlight some of the changes. So, at Databricks, last year, we released Koalas, which is a Pandas API, that can run directly over Spark, to make it very easy to port workloads, and that’s evolved a lot in this year, I’ll talk about it. Of course there’s Delta Lake for reliable table storage, and we’ve also done quite a bit of work, to add Spark as a scale-out back end, in popular libraries including; Scikit Learn, Hyperopt, and Joblib, so if you use these for machine learning, you can just scale out your jobs on a Spark cluster. We’re also collaborating with Regeneron, to develop Glow, a widely-used library for large-scale genomics, and Nvidia has been developing Rapids, which provides a wide range of data science, and machine-learning algorithms, that you can call from Spark, that use GPU acceleration, and really speeds up these workloads. And finally, we’ve also done a lot of work on Databricks, on improving connectivity with both, commercial and open-source visualization tools, so that users can build interactive dashboards, using Spark as the back end.

. What is Koalas?

So, I’ll just dive into some of the changes in Koalas, specifically, if you’re not familiar with Koalas, it’s an implementation of the Pandas API over Spark, to make it very easy to port data science code, in this really popular library. We launched it actually a year ago, at Spark and AI Summit, and it’s already up to 850,000 downloads per month, which is about a fifth of the total downloads of PySpark, so we’re really excited, with how the community has adopted this library.

Announcing Koalas 1.0!

We’re investing quite a bit more into Koalas, and at this event, we’re actually excited to announce Koalas version 1.0. This new release has close to 80% of the API coverage of Pandas, it’s also quite a bit faster, thanks to the improvements in Spark 3.0, and it supports a lot of features that were missing before, including missing values, NAs, and in-place updates, and it’s also got faster distributed index type. It’s very easy to install Koalas on PyPI and get started, and if you’re a Pandas user, we believe this is the easiest way, to migrate your workloads to Spark.

So, that’s enough of me talking, we’d also like to show you demos of all these new things, and I’m really excited to invite Brooke Wenig, the machine learning practice lead at Databricks, to give you our demos of the new features, in Koalas and Spark 3.0.

Demo: Koalas 1.0 and Spark 3.0 Brooke Wenig, Machine Learning Practice Lead @ Databricks

– Hi, everyone! These are some crazy times, and while we’re all still social distancing, many of us are staying at home, trying to figure out, “what are we going to cook for dinner tonight?” I know my whole SF office, has gone through the banana bread and sourdough craze, but now we need some new recipes, to impress our co-workers with, on our food slack channel, in particular, we need to find a recipe for Matei, Matei is a very busy person, who loves to optimize everything.

Given the increase of people contributing recipes, we now have millions of recipes, that we need to analyze in a scalable manner, so that might take can make a data-driven decision, on what he’s going to prepare tonight. Along the way, we’ll explore some new features, in Spark 3.0 and Koalas, let’s get going!

Koalas Demo: Exploratory Data Analysis (EDA)

Like any data scientist, I’m gonna start off with some exploratory data analysis with Pandas. Pandas is a great way to get started with exploring your data, because it’s simple to use, has a great documentation, and a fantastic community. Let’s go ahead and visualize the subset of our data.

You can see here, that our recipes data contains the name of the recipe, when it was contributed, nutrition, ingredients, etc. However, we don’t just have one file, we have a whole directory of Parquet files, we need to read in, so let’s go ahead and load in our entire data set with Pandas.

Loading Data with Pandas

Unfortunately, though, Pandas wasn’t prepared to handle the rising recipe count, so if we allow this query to continue on, it’ll crash, trying to load in 30 gigabytes of data, onto a single machine, let’s instead cancel it.

Loading Data with Koalas

Now, let’s use Koalas to load in our data set instead, Koalas provides the Pandas-like syntax and features, that you love, combined with the scalability of Apache Spark. To load in our data set with Koalas, we simply replace the pd logic with ks, for Koalas, and now you can see just how quickly we can load in our entire data set with Koalas, but let’s say Matei’s pretty busy tonight, and he wants to find a recipe, that takes less than 30 minutes to prepare, you can see I’ve already written the Pandas code, to filter out for recipes that take less than 30 minutes, and then visualize the distribution, of the number of steps, these recipes take, to convert it to Koalas, I simply replace my panda’s DataFrame, with my Koalas DataFrame.

Visualizing Big Data at Scale

And voila! I no longer need to down-sample my data, in order to visualize it, I can finally visualize big data at scale.

We can see here the distribution of the number of steps, while most recipes take less than 10 steps, you can see the x axis extends all the way up to 100, there’s a universal muffin mix, that takes 97 steps to prepare, Matei won’t be cooking that one tonight.

In addition to our recipes data, we’re also interested in the ratings for those recipes. Because Koalas runs on Apache Spark under the hood, we can take advantage of the Spark SQL engine, and issue SQL queries against our Koalas DataFrame. You’ll see here that we have our ratings table, now we want to join our ratings table, with our recipes table, and you’ll notice, I can simply pass in the Koalas DataFrame, with string substitution, let’s go ahead and run this query.

Wow, we can see that this query took over a minute to run, let’s see how we can speed it up! I’m going to copy this query and move it down below, after I enable Adaptive Query Execution.

With Adaptive Query Execution, it can optimize our query plan, based on runtime statistics, let’s take a look at the query plan that’s generated, once it’s enabled.

I’m going to dig into the Spark UI, and take a look at the associated SQL query.

Here we can see that it passed on skew handling hint, to the sort-merge join, based off of these runtime statistics, and as a result, this query now takes only 16 seconds, for about a four x speed up with no code changes.

In addition to using Koalas to scale your Pandas code, you can also leverage Spark in other ways, to scale your Pandas code, such as using the Pandas function APIs. We’re going to use the Pandas function APIs, to apply arbitrary Python code to our DataFrame, and in this example, we’re going to apply a machine-learning model. After seeing that universal muffin mix take 97 steps, and 30 minutes, I’m a little bit skeptical, of the time estimates for these recipes, and I want to better understand the relationship, between the number of steps and ingredients, with the number of minutes. As a data scientist, I built a linear regression model, in our data set, using Scikit Learn, to predict the time that a recipe will take, now I want to apply that model in parallel, to all records of our DataFrame. You’ll notice here, just how easily I can convert our Koalas DataFrame, to a Spark DataFrame. To apply our model, I’m going to use the function called mapInPandas, mapInPandas accepts a function you want to apply, with the return schema of that DataFrame. The function I want to apply is called predict, predict accepts an iterator of Pandas DataFrames, and returns an iterator of Pandas DataFrames, I’m then going to load in our model. If your model is very large, then there’s high overhead, to repeatedly load in the model over and over again, for every batch in the same Python worker process. By using iterators, we can load the model only once, I just apply that to batches of our data, so, now let’s go ahead and see how well our model performs.

We can see here that the predicted minutes, is pretty close to the number of minutes for some recipes, this case, the true number minutes is 45, we predicted 46, or the true value is 35, we predicted 39, but in some cases, our time estimates are a bit off, and we can actually see in the description, “Don’t be dissuaded by the long cooking time.” So, Matei’s decided, he doesn’t want to cook anything tonight, he just wants to make a smoothie, with the shortest preparation time, so Matei, which recipe will you be preparing?

Looks like Matei will be preparing a berry berry smoothie, with only four ingredients, and a grand total of zero minutes. Wow, he’s really optimizing for speed! Stay tuned for a very delicious surprise. – Wow, thanks, Brooke! That was an amazing demo and the smoothie turned out, (gulps) delicious, it’s great!

O’Reilly “Learning Spark” 2nd Edition

One other announcement I wanted to make, that also ties in the book, is that we’ve been working to publish a new edition of “Learning Spark,” the popular book from O’Reilly Book, he’s one of the co-authors actually, and we’re giving away a free copy of the eBook to every attendee of the summit. So, we invite you to check this out, if you wanna learn about the new features in Spark 3.0.

What’s Next for the Apache Spark Ecosystem

Okay, and then the final thing I wanna end on today, is what’s happening next, in the Apache Spark Ecosystem.

If we step back, and look at the state of data and AI software today, it’s obviously made great strides over the past 10 years, but we still think that data and AI applications, are just more complex to develop than they should be, and we think that we can make that quite a bit easier, and we have a lot of ongoing efforts, in open-source Apache Spark, to do this, building on these two lessons, ease of use in exploration and production, and APIs that connect with the standard-board software ecosystem.

OSS Spark Development Initiatives at Databricks

So, I’m just going to briefly talk about three big initiatives that we are working on, at Databricks, for the next few versions of Spark., so the first one is what we’re calling project Zen.

Project Zen

The goal is to greatly improve Python usability, and Apache Spark, because it is the most widely used language, now, we wanna make sure that Python developers have a fantastic experience, that’s familiar with everything else that they do in Python. and there are quite a lot of areas, where we think we can improve the experience. We’ve called this project Zen, by the way, after the Zen of Python, the set of principles for designing Python, that have led to it being such an amazing environment today. So, some of the things we’re working on, are better error reporting, porting some of the API changes from Koalas, we got to experiment with them there, and we think they’re useful enough, that we want them to just be part of Spark, improve performance, and Pythonic API design for new APIs, I’ll give a few examples of what we’re doing with this next.

Adaptive Query Execution (AQE)

We also have a continued initiative, around Adaptive Query Execution, we’ve been really pleased with what it’s doing so far, and we think we can cover more and more, of the SQL optimizers decisions adaptively, and really dramatically reduce the amount of configuration, or preparation of the data needed, to get great performance with Spark. And finally, we’re also continuing to push on ANSI SQL, the goal is to run unmodified queries, from all the major SQL engines, by having dialects that match these, and we think we’ve been working a lot with the community, to build these, and we think it will provide a lot of benefits in Spark.

OSS Spark Development Initiatives at Databricks

Python Error Messages

Just to give you a couple of examples of the projects, and features, one of them is around error messages. So, if you run a Python computation today, and you have an error on your worker process, such as dividing by zero, you get this very scary-looking error trace in your terminal, lots of stuff going on, and if you look at it closely, there’s a lot of Java stuff, and maybe you can see part of the error trace, is actually about the problem that happened in your Python worker, which in this case, was division by zero, but it’s pretty unfriendly, especially if you’re just a Python developer, it’s hard to see exactly what was going on. So, as part of project Zen, we’re simplifying the behavior in Python, of a lot of common error type, so that if it’s a Python-only error, or SQL-planning error, you’ll see a small message, that lets you directly fix the problem. So you can see, this has just the Python relevant bits of that error, and you can see that there is a division by zero. Of course, this might be a bad example, because if you really accept project Zen, you will come to a much better understanding of emptiness, and you will probably never get divisions by zero again, but it’s just one example.

New Python Docs

And then, the second change I wanna show, is a new documentation site for PySpark, that’s designed to make it really easy to browse through, find Python examples, and it’s built on kind of the latest best practices for Python documentation, and we think this will make it easier to get started, for many of our users as well. So these are some of the changes, we’re really excited about what’s next for Apache Spark, and we look forward to seeing what you’ll do with it.

Back to you, Ali!

Thanks, Matei! (sighs) So, with Spark 3.0, we’ve now added broad support for SQL, we’ve also integrated it with Python, and in some sense, it’s really unifying now SQL analytics, or data warehousing, and data science, and that’s really exciting, because that’s where we think the future is going. In fact, our customers were unifying these two things for a while, and actually, a pattern has emerged, in which they call this the lakehouse paradigm, so I wanna talk a lot about that today. But before I dive into the details of the lakehouse, I wanna provide you some context of why we need it.

Data Warehouses were purpose-built

So, data warehouses were built 40 years ago, and they’re purpose built for BI reporting, and they worked really great for that, however, they don’t have any support for video data, audio data and text data, which is a lot of the kind of data that we have today. And as a result of that, it’s very hard to do data science or machine learning, on these datasets in the data warehouses. Also, they have limited support for real-time streaming, which is becoming increasingly important, and finally, they’re closed and proprietary formats. So, the data is either locked in, in a data warehouse, or you have to take it out, if you wanna operate on it with other tools. Therefore, today, most of the data, is actually first stored in a data lake or a blob store, and a subset of it is moved into a data warehouse, so let’s look at these data lakes.

Data Lakes

So, the data lakes, they’re great, in that they can handle all your data now, and you can do data science and machine learning on it, however, they’re bad at exactly the things, that the data warehouses were good at, so they have really poor BI support, they’re very complex to set up and configure, you need the Advanced Data engineers to do that, and then, once you start using them, if you try to do BI ever, on a data lake, you’re gonna find out that the performance is really, really, bad. Hence, as a result, what we’re seeing, is that a lot of these data lakes, have turned into unreliable data swamps, and actually, most organizations, unfortunately, have to have both of these things, side by side.

So, that is why we believe in the “Lakehouse Paradigm”. The lakehouse paradigm brings the best of both these worlds, you get the structure, you get the BI, and reporting advantages of data warehouses, and you also get the data science, machine learning, AI, and real-time capabilities of data lakes on all of your data.


In fact, the lakehouse is distinct from data lakes, it actually starts out like a data lake, so it looks at the bottom layer, like a data lake, but the crucial difference is that, it has something that we call a structured transactional layer. This layer brings quality, structure, governance, and performance, to data lakes. And at Databricks, we believe that open-source Delta Lake project, is a great way to get structural, transactional layer on top of your data lake, so I’m gonna talk a little bit about this next.

So, before I go into the details, of the Delta Lake open-source project, I wanna tell you a little bit about its history, and how it came about.

The Emergence of Data Lakes

So, Delta Lakes are built on top of data lakes, and as we know, data lakes, they’re great, they’re really cheap, in fact, they have 10 nines of durability, they can scale infinitely, they also can store now, all kinds of data, video, audio text, it can also store structured, semi-structured, and unstructured data in these, and finally, they’re based on open, standardized formats. In recent years, we’ve seen in particular, the Parquet format, becoming the standard in this ecosystem, and lots of different tools, can now directly operate on these Parquet files.

Enterprises spent millions building data lakes using Spark with the aspiration to do data science and ML!

So that’s awesome, and as a result of this, organizations have now been promised, that if they just build a data lake, they can get all these benefits, they can get real-time analytics, they can get data science, they can get machine learning, the can get data warehousing, reporting, BI, everything. So they’ve been very excited, they’ve build up these data lakes

The majority of these projects are failing!

Unfortunately though, most of these projects are actually failing, and at Databricks, we started to actually see lots of lots of demand for professional services, and we had to put our own solutions architect on these customers, to look at what’s the problem that they’re facing. And what we did is, we rank-ordered the problems that we have to address, with these customers, and I’m gonna walk you through each of these problems, so there’s nine of them, in the order of sort of severity that we saw, and then I’m gonna talk about how we address them, with the Delta Lake project.

Challenges with Data Lakes

Okay, so the first problem that we actually noticed, that is happening over and over again, is actually a very trivial, simple problem, but it’s actually that we were getting pulled in, and customers were saying, “We have a hard time appending new data into the data lake, “while at the same time correctly reading it.” Turned out some of the new data comes in, if you try to read it at the same time, because of eventual consistency, you might not see the results, so our solutions architects, were creating copies of the data in different directories, and switching it over when it was ready, needless to say, this was really, really complex and costly. Second, modification of existing data is really difficult with data lakes, and this is exacerbated, when companies are trying to satisfy GDPR or CCPA. So in GDPR and CCPA, if the customer says, “I don’t want you to store any records, “in your data sets, about me,” you have to go and scrub those and delete those. Well, that’s really hard with traditional data lakes, because they’re basically batch-oriented systems, so you have to run a big Spark job all over your data set, once a week, to scrub it to make it compliant, this was again, extremely costly. Third, some of our customers were having problems that were hard to diagnose, after we looked into the logs, and did diagnosis, it turns out that many years back, maybe some Spark job failed halfway through, some of the data that it was ingesting had made it in, the rest was missing, and as a result, the whole project was failing.

Four, real-time operations are really, really, hard, on data lakes, this is a special case of problem number one, but mixing real-time streaming with batch, lead to inconsistencies, in particular, if you’re appending new data, in batch operation, but at the same time in real time, you’re trying to read it, you’re gonna have problems. And this one is actually hard to solve, it’s not just to copy files or copy directories. Five, organizations wanna keep historical versions of their data, especially regulated environments, you have to reproduce your datasets, ’cause there’s auditing and governance that happens. So, here again, the poor man’s version was, “keep copies of all your data sets, “in different directories with different date names.” Very, very costly! Six, difficult to handle the metadata itself, now, as these data lakes were growing, the data itself was growing into petabytes, the metadata for these systems, was itself now becoming terabytes in size. And as a result of this, the catalogs, they’re storing the metadata, were falling over, they were not scalable, so we were pulled into the professional services, again, to fix that.

Seven, the “too many files” problems, because data lakes were file-oriented, you could end up in a situation, where you had millions of tiny files, or a few gigantic files, again, leading to lots of problems, when we’re trying to consume it. Eight, it was hard to get great performance, because you’re just dumping your data into the data lake, without thinking about how it’s gonna be used later, how do you know if the data layout, is sort of optimized, for the performance that you need later? And finally, the most important problem, in my opinion, but a hard one to actually know, was data quality issues. It was constant headaches, where customers would pull us in, and it turns out that the data was changing in a subtle way, or the semantics was changing over time, or all of it wasn’t there, so this was leading to all kinds of problems with these projects.

A new standard for building data lakes

So, at Databricks, we wanted to take an opinionated approach. We said, “We wanna build a system, that from the get go, “tries to get all these things right, “so that we don’t have to ever configure them, “or ever have to address these issues ever again.” So, the Delta Lake open-source project addresses this, by adding massive reliability, quality, and performance, to your data lake, and the way it does it, is by bringing the best of data warehousing, and data lakes to one place together.

And finally, it’s based on open source Parquet format, so you can keep all your data on your data lake, in this open-source format, with all these benefits. So how does it do this? So these were the nine problems that I mentioned earlier, and I’m gonna go through each of these, and mention, how we actually address this with the Delta project.

Nine Problems with Traditional Data Lakes

ACID Transactions

Well, luckily it actually turns out the first five problems, can all be addressed with the same technique, by having a transaction log, that stores all the operations that you’re doing, on your data lake, you can actually now make it fully atomic. That means that every operation that you do, on your data lake, either fully completes, or it aborts, and cleans up any residue, you can try it later. The way that actually happens, is that there is a Delta log, and it stores every operation, so, for instance, here we have an example, where one operation is adding two Parquet files, they will atomically be added, then after that, maybe that first file is deleted, and a third file is added. The key thing here, is that delta always makes sure, that each of these are happening, as if it was sequentially executed.

Okay, so that’s awesome, now we can make sure that one, if we’re appending new data into the system, it’s always either going to be there, or it’s going to be canceled, and you can read it at the same time, you won’t have any issues. Two, modifying existing data, if you want to get GDPR or CCPA compliance, actually is really easy now, because transactions enable you to do fine-grained up certs. So, you can go in and modify small data in your data set, in a transactional manner. Jobs failing midway, that also goes away, because the job either fully succeeds or fully is aborted, and then real-time operations is the same thing. All operations, whether real time or batch, are consistent with the transaction log. Now, finally, because we have all the deltas, of all the operations that happened on your data set, we can actually implement something called time travel. Time travel, lets you submit queries, and then say, “I want the answer to this as of the time two years ago,” and it gives you the response, as if the query was submitted two years ago, so this is great now for compliance and governance. Okay, so that’s awesome, now we solved those first five problems.

Spark Under the Hood

How do we deal with the metadata problem, the fact that metadata is getting large, and the catalogs are falling over? Well, luckily, it turns out that there is a project called Apache Spark, and it’s really, really good, at large-scale parallel operations. So, under the hood, we use Spark, and all our metadata is stored in Parquet format, right next to the rest of your Parquet files, that way, you can actually move your data from one place to another, by just copying the data. You don’t have to worry about catalogs that are out of sync with the data that you have, so that’s awesome. So, okay, now we can scale the system, with the Delta Lake project, to really large metadata. What about the “too many files” problems?


Well, the too “many file problems,” and the performance problems that we’re seeing, with data lakes, we now automatically fix those, because we automatically optimize the layout of your data. How do we do that? Well, a bunch of indexing techniques, one, we partition, of course, your data, but two, we also do data skipping. Data skipping enables you to prune the files, based on statistics that you have on numerals, so you know exactly what’s the max and the minimum, of every data that you have in the files, so that you can actually avoid reading a lot of the files, when you get a query. And then, finally, we have something called Z-ordering. Z-ordering actually enables you to optimize multiple columns at the same time, so if you have a year, month, date index, you now can actually search on just date, and it can be as fast as if you searched on year, which is hard with partitioning. Okay, so this is awesome, now we get all these advantages, that we had in data warehouses as it comes to performance, what about the last problem, the most important and the hardest problem, the quality issues?

Schema Validation and Expectations

Well, it turns out that we can actually solve that, with schema validation and evolution.

So, all data in Delta Tables have to adhere to a strict schema, can be a star schema, snowflake schema, but as soon as you have the schema, we make sure that all your data 100% adheres to this schema.

If any data appears, that’s not satisfying the schema that you have, we actually put it on the side, in a quarantine, that you can then look at, you can clean it up, and when you cleaned up the quarantine, the data makes it back into Delta, but, now you know, that all your data in a Delta Table is always pristine, and finally, we’ve actually added something called Delta Expectations. Delta Expectations is a novel way, in which you can actually express quality expressions, using SQL or user-defined functions, UDFs, where you can specify pretty much any business logic, for quality that you like, and now, we guarantee that your table satisfies all those quality metrics always.

Curated Data Lake

And with expectations and schema validations, the pattern that we’re seeing our customers use, in data lakes, is that they’re actually, now creating a curated data lake, the curated data lake now has bronze tables, where you might have the raw data coming in, and then that gets refined, into filtered, cleaned, and augmented tables that we call silver tables, and then, finally, the most optimized, business-level aggregated data sets, are in so called gold tables, and those are the ones that you’re really sort of serving into your BI needs.

Review of Solutions

So, zooming out, the Delta Lake project, solves the problems we had with the data lakes, using four techniques; first, it uses a ACID transactions, second, use a Spark under the hood, to get the scale for the metadata, third, it uses indexing, and lots of different indexing techniques, to get speed ups, that we traditionally used to have in data warehouses, and then, finally, it uses schema validation and expectations, to get the quality that we expect to have, if we’re using a data warehouse. We really want this to become the standard for how people create their lakehouses, so as a result of this, we’ve made this project work well, with all these other systems, so you can now actually read and write both from Delta, and Hive, and Spark, Presto, Amazon Redshift, Amazon Athena, and Snowflake. And creating data with Delta is really easy, in SQL, instead of saying using Parquet, you just now in SQL say, using Delta, and you’re just using Delta.

So, with Delta Lake, we’re able to fill in critical layer, of the modern lakehouse architecture, bringing structure and reliability to your data lakes, in support of any downstream-data use case, but from a performance perspective, indexing alone isn’t enough, for many analytic workloads that involves small queries, that need to be executed very fast, you need more.

High Performance Query Engine

You want a high-performance query engine, that can deliver the performance required for traditional analytic workloads. I’m pleased to announce that at Databricks, we’ve been working on a solution, that we’re calling Delta Engine. So, I wanna welcome on stage, Reynold Xin, co-founder and chief architect of Databricks, and he’s gonna tell us what Delta Engine looks like, under the hood.

Delta Engine

Thank you, Ali! I’m really excited to be talking to you today, about Delta engine, a high-performance query engine for data lakes.

Delta Engine builds on Apache Spark 3.0., and it’s fully compatible with Spark’s APIs. This includes Spark SQL APIs, Sparks DataFrame APIs, and by extension, the Koalas DataFrames API. It delivers massive performance, for SQL and DataFrame workloads, with three important components. The first component is improved query optimizer, the second component is a native vectorized execution engine that’s written from scratch in C++, for maximum performance, and the third component is a caching layer, that sits between the execution engine, and cloud Object Storage, for faster IO throughput.

Delta Engine’s Improved Query Optimizer

Delta engines improve query optimizer, extends Spark’s cost-based optimizer, as well as Spark’s adaptive query execution optimizer, with more advanced statistics, and this technique can actually deliver up to 18 times, performance increase, for star schema workloads.

Delta Engine’s Caching

Delta engine’s caching layer, again sits between the execution engine, and Cloud Object Storage, where we build our data lakes, and it can automatically chooses, what input data to cache for the user, and it’s not just a dump bytes cache, that caches the raw data, it transcodes data into a more CPU-efficient format, that is faster to decode and for query processing later, and with this, we can at fully leverage the throughput, of the local NVMe SSDs, in a lot of the cloud virtual machines these days. And this delivers up to five times scan performance increase for virtually all workloads.

Spark Native Execution Engine

For the rest of the talk, I want to focus on the Native Execution Engine, and walk you through some of our thought exercises, in designing and implementing this engine. Every time when we design a piece of software, for performance, it’s important to look at two aspects, the first aspect is while the hardware trends, how are hardwares changing, and where will hardware go in the future, and that’s important, because ultimately, the piece of software we write, will run on some piece of hardware. The second aspect are the workloads, what are the characteristics of the workloads, that we need to run for/on? And this is extremely important, because without knowing that, we don’t know what we’re optimizing for.


So let’s first take a look at hardware, for some of you that were here five years ago, in Spark Summit, 2015, you might remember a keynote slide, in which I compared the hardware spec from 2010s, compared with the 2015, and I did that along three dimensions; storage, network and CPU. And we found that, along the dimension of storage and network, which are basically IO throughput, the IO throughput has gone up by an order of magnitude, in those five years, so it’s gone up by 10x, where CPU clock frequency had largely remained the same, at around three gigahertz. So, as IOs are becoming faster and faster, more and more bottlenecks are happening on the CPUs.

So, in response to that, we launched project Tungsten in Apache Spark, which is aimed at substantially speeding up execution, by optimizing for CPU efficiency.

Hardware Changes since 2015

It’s been five years since 2015, let’s take a look at what has changed with hardware, in the last five years.

Hardware Changes since 2015

Interestingly, IO throughput has continued to go up, as a matter of fact, they went up by another 10x. These days it’s really not difficult for any of us, with a swipe of a credit card, to launch a new virtual machine, in any of their public cloud provider’s infrastructure, that can give you 16 gigabytes of NVMe SSD, and with 100 gigabit network, but CPU clock frequency remained the same, at around three gigahertz.

So, as every other dimension’s becoming faster and faster, the CPU become more of the bottleneck, even after five more years. And the first important question we asked ourselves, in designing the execution engine for Delta Engine, is how do we achieve the next-level performance, given CPU has continued to be the bottleneck?

Workload Trends

Now, the second aspect, workloads, in the past few years, it’s becoming more and more obvious, that businesses are moving faster and faster. As a result, data teams are given less and less time, in property modeling and preparing their data, if your business context is changing every six months, or every year, and you’re given sort of new business problems to solve, there’s really no point spending six months to a year, modeling your data, like how you would do it, back in the 90s, or 2000-era data warehouses. As a matter of fact, these days, most columns don’t have the constraints defined, strings are used everywhere, because strings are so easy to manipulate, and many of the columns, for example, such as date, or timestamp, are just stored as strings these days. And, unfortunately, a lot of this characteristics, of the new workloads and lack of data modeling, and not really benefiting the performance of the query execution, because many of the query engines are designed in an era, in which data are very well modeled. And we have all learned, “Hey, in order to actually achieve better performance, “this model that did it better” So, the second important question we asked ourselves, in designing the new execution engine is, “Can we actually get both agility and performance?” “Can we get great performance for very well modeled data, “but also see a pretty good, “or great performance for a not a so-well-modeled data?”


And, Photon is actually our answer to those two questions, it is a new execution engine for Delta Engine, and designed to accelerate Spark SQL workflows, it is built from scratch in C++, so we can get maximum control of how the underlying hardware would behave, in order to extract the maximum performance, and we really leveragely choose, the very important principles. The first is called vectorization, and the idea here is to exploit data-level parallelism, and instruction-level parallelism for performance. And the second is, we really wanted to design something that really worked very well for model data, and we spent a lot of time thinking about it, and so, making the habit to optimize, for not so well modeled data.

CPU 101

Before I explain to you how Photon works, it’s important for me to give you a refresher, of how modern CPUs work, so it consider it a CPU 101. While we found out from some earlier slides, the CPU clock frequencies are not getting higher, there are some dimensions of the CPUs they’re improving, and most of these are around the degree of parallelism, and the first is data-level parallelism, and the second, instruction-level parallelism.

For data-level parallelism, you might have heard of the term SIMD, and what SIMD does is, through a single instruction, now, a CPU is capable of processing multiple data points at once, and that degree of parallelism is measured, by what we call a SIMD register width.

When the concept SIMD first came out, with MMX, SSE instructions, in the 90s, a SIMD register width was 128 bits, and 128 bits register width mean, is that a single instruction is capable, for example, of processing 432 bit integers.

And when AVX2 came out, the SIMD register width, was actually doubled, and AVX-512 came out, which is the most recent SIMD instructions set, the register width, doubled again. So, a degree of parallelism from data-level parallelism, through SIMD instructions, has been doubling.

And the second part is instruction-level parallelism. What that means is, when the CPU receives the instructions to execute, it doesn’t actually execute them in the order, when they come in… or in any sequential order. As a matter of fact, the CPU will look ahead, in this what we call out-of-order window, which is the number of instructions, the CPU can look ahead for, and you’ll will figure out, for example, can the CPU automatically reorder the instructions, and sometimes run multiple instructions in parallel, if they don’t have dependency, and sometimes even running in it parallel, with dependencies speculatively, in order to actually increase the throughput, and performance of the saved software.

And as you can see on the slides, in the past few years, with sort of newer generation of Intel CPUs, like Sandy Bridge, Haswell, Skylake, the out-of-order window has also been increasing. And about 15 years ago, there was actually a seminal paper published, inside the CIDR conference, which one of the most prestigious database conferences, on their DB/X100, it’s written by Professor Peter Bonez, out of this institution called CWI, in the Netherlands. And this paper really, is the classic paper that detailed, how we can exploit the maximum performance, with the technical vectorization that focuses on getting; data-level parallelism and instruction-level parallelism. So, for the past two and half years, the Databricks engineering team have been working very closely with Professor Peter Bonez, in designing what a new vectorized execution engine will look like, in the 2020s. With all this new workload characteristics.

So, first I want to walk you through some of the techniques we have used in Photon, in order to exploit performance, and I’ll start with data-level parallelism.

Vectorization: Columnar In-memory Format

The first step in designing a vectorized execution engine, to exploit performance, is to think about the in-memory data format. Most naturally, humans think about data in a row-oriented format, that there’s the first row, and then there’s stated values in them, but in order to vectorization to work really well, we will transpose a row-oriented format, into a column-oriented format, in which case all the values for the same column, are laid out together, consecutively, and as shown on the screen here.

And one of the nice benefit of the this sort of column-oriented memory format, is when we actually do any type of compute, for example, in this case, I’m just showing you how to sum up the value of two columns, we could write a very simple tight loop here, just loop over the data and regenerate a new column.

And if you look at this very simple snippet of our code, which actually mirrors what the real code would look like, in a query engine, it’s very compact, and it accesses memory in exactly a sequential order, because it’s a column-oriented format. If the data is laid out in-memory, using a row-oriented format, then, every time when we go from one row to another, and what I assume increments, we’ll have to skip column two, which is the length of string, and that will actually lead to worse cache behavior and worse prefetching.

And the other benefit is, with such a simple code snippet, what we call sort of kernel, it’s very easier for the modern compilers, to be optimizing such simple loops, and you apply techniques such as loop unrolling. It’s also in most cases, that compilers can recognize this pattern, and generally SIMD instructions, automatically for us. To demonstrate to you what that means, for the SIMD instructions, I compiled this snippet of code, using sort of two different compilers; one without SIMD, and one with SIMD. The version without SIMD, you can actually see it’s pretty short, and that kind of the body of the loop, I’ve color-coded it, to show you the body of the loop.

And you don’t have to understand all the actual instructions being generated by the compiler, but you can probably tell in the SIMD instruction, well it’s significantly longer, mostly because of loop unrolling, the instructions started mostly with a V, and V here, basically means vectorized. It means for each one of those instructions, they are effectively processing four, or sometimes eight, or even 16 values at once, rather than compare with the left side there, just processing one value at once.

Exploiting Data Level Parallelism

And the performance difference of the SIMD, to leverage this kind of data-level parallelism, is actually massive, with a fully SIMD core, manages even end-to-end, so with a lot of the overhead of sort of the system set up, and all that, this sort of a simple add arithmetics, could generate four times speed up, so now we can process almost eight billion rolls per second, just on the single core, whereas the non-SIMD version could only process, even though there’s also native C++ code, can only process 2 billion.

Hashofable: Most Important Data Structure

Now data-level parallelism, SIMD, I want to move on now, to instruction-level parallelism. In order to explain that, I need to sort of bring up a slightly more complicated example. Hash tables. Hash tables are one of the most important data structures, in-memory, for data processing, it is used very frequently in aggregates, it is used in very frequently in joins. Here I’m just showing you a very simple example, as a matter of fact, overly simplified, that computes the sum for some values, grouped by a certain key.

And the way this would work, so very naturally, if I would ask you to just write this code, you probably write something similar to what I have on the screen, what you would do is you would create a hash table. And then you would loop over the set of your input data, and for each of the row, you would compute a hash bucket, and you would do a probe, to look at whether the hash bucket key, is the same as the current laws key, and they are the same, you would sort of aggregate the data value, into the hash table itself, so the hash table storing a bunch of partial sums.

And to simplify the cost template, I don’t even deal with key collision here. Now, if we look at this code, it turned out, most of the time are actually spent in the highlighted green part ht[Bucket]. And the reason is the nature of the hash table means, whenever you scan through some data, you’re gonna trigger a lot of random memory loads, that is the buckets are just random, in different random locations, and random memory loads, are actually very expensive for modern CPUs to do, because modern CPUs are really optimized for sequential fetches, even for in-memory data, because CPUs run so fast, if CPUs have to wait for some data to arrive in a random location in-memory, that wave is significantly longer, than just doing, for example, arithmetics.

When we profiled this, we actually found that, about two-thirds, so two-thirds of the time, for this code snippet, the CPUs are just waiting for data derived from memory, and the CPU is really spending only a third of the time, doing useful work.

How do we optimize the following?

So the question now is, how do we optimize this code? In order to optimize it, you kind of need to know, what’s wrong with it from a performance point of view, and I’ve color coded it, so the different parts of the inner loop for you, and then, we’ll soon realize it’s very obvious that, there’s a lot of different concerns done in a single loop. We are computing hash code, we’re doing key comparison, we’re doing some memory look-ups, to load data from memory, and we’re also doing additions or aggregates.

And when there’s such a very large loop body, this actually becomes a very unfriendly behavior, for the modern CPUs, and the reason is, if you remember why brought up earlier, that CPU says instruction-level parallelism has a constant out-of-order window, which is roughly around 200 instructions, the CPU can look ahead, to reorder or run in parallel. If the body of the loop is very, very long, the CPU could probably only look at either one or two loop iterations. Which means, at the maximum, the CPU can actually only reorder one or two iterations, and so, run two of the most expensive parts, of the computation, which is do a memory load at once, so only one there true.

So then, understanding that, now we might have a pretty simple solution to this problem, how do we make this snippet of code fast?

Even though this was little bit unintuitive, the way to make this loop fast, is to break one single fat loop, into multiple small loops. It’s unintuitive because we are now with multiple smaller loops, we actually running through the loop multiple times, it seems like more overhead, for example, just increment in I, but the reason this is interesting, if any better is the most extensive part right now, the green part, is its own loop, and the body of the loop is extremely simple. And with such a simple loop, the CPU can actually, in this out-of-order window, can predict, and loop that many, many loop bodies. And as a result, the CPU can see, “hey, I will need to fetch from,” for example, “these 12 different memory locations,” and it can launch all 12 memory fetches in parallel, instead of waiting for one to complete, and launch another one, it can actually launch all of them in parallel and do all the fetches at once, and this actually substantially speed up, the actual computation time of this code snippet. As a matter of fact, with these techniques, plus a number of other things, such as minimizing TLB misses with huge pages, we managed to significantly speed up, this core aggregation functions performance, in the Photon engine, and the amount of actual wasted memory stores have also significantly reduced.

Exploiting Instruction Level Parallelism

TPC-DS 30TB Queries/Hour (Higher is better)

So, we’ve talked about data-level parallelism, and instruction-level parallelism, but how this all goes, if we put it all together, to run through some traditional SQL queries, and we run the experiment just with Photon on and Photon off and the rest is just the Delta Engine. So it already includes a lot of the great query optimization improvements, that provides often some milder manual speed ups, and we can see we actually achieved a 3.3 times speed up, just at a physical execution, in the engine layer with the Photon engine.

So, so far we have talked about the techniques of vectorization, it’s not actually a new technique, as a matter of fact, a lot of other databases leverage it. Now one thing that’s very different, and we spend probably more time optimizing, for the modern hardware, but the other aspect is, we have spent a lot of time thinking about, how we can actually optimize for the more modern workflows, in which people want to move faster, and don’t have to always operate well modeled data.

Faster String Processing

I won’t have time to actually go into all of it, all of the tricks and techniques, that we have leverage to make string go faster, but I want to give you one example, so you can understand the (mumbles).

When we first re-rolled setup string functions, from sort of some code, that runs on the JVM, through native C++ code, we observed quite a bit of speed, for example, the upper function, we achieved almost 50% speed up, and the substring function was almost a three times speed up. But we’re gonna stop there, we asked ourselves, “Hey, can we do better?”

UTF-8 String Encoding

Now I have to introduce you to how strings are encoded, in order to understand, so one important technique we’ve used, and UTF-8 is actually, the most ubiquitous string encoding, different from a lot of other encodings, UTF-8 is a variable-length encoding. What it means, is a single character could take anywhere, from one byte all the way to four bytes, and I’m showing you some examples here, for example, character A is one byte, the copyright symbol is two bytes, the Chinese character Sheng is three bytes, and the pile of poo, which is a legitimate UTF-8 character, is four bytes.

The reason UTF-8 became popular, and the reason it’s designed with variable length encoding, is it’s great for memory, a lot of the data, as a matter of fact, probing most of the data, especially in the Western world, are just ASCII characters, they would all fit in one byte. So when using UTF-8, you would only be wasting one byte per character, whereas with a lot of other fixed length schema, you might have to use two bytes or three bytes always. But unfortunately, this variable-length encoding while it’s great for memory saving, because it’s very compact, it’s actually computationally very expensive. It just takes sort of the substring functions example, in the fixed-length encoding, the substring function simply returns the number of bytes. If we know for example, that every character is three bytes, we’ll just return the three times however many characters you want to return, but in a variable-length encoding, we have to look at each of the character, or each of the bytes one by one, to determine what the character boundaries are.

Fixed Length vs Variable Length Performance

Sometimes, some databases actually give the users, the option of defining why your character said this, because often, some users will only have ASCII character, but we feel that’s not very realistic for two reasons; one is, users don’t often actually do that, because they sometimes don’t even know there’s optimization, they can do by themselves manually, the second is, often data is actually, maybe say, a vast majority of your data is ASCII, but then, every once in a while, say, one row come in, it has a special symbol, and as a result, now you’re gonna corrupt your data, if you don’t declare a property of the character set. So, the real question is, given the fact that, most data, especially in the Western world, are just ASCII character set, but with the occasions of UTF-8 beyond ASCII, can we get both the savings of UTF-8 in-memory compactness, and the performance of ASCII?

Faster String Processing

Or I put it differently, can we only pay for the performance penalty of UTF-8 when necessary, when we actually have that data?

And we actually found a pretty interesting trick, that we’d leverage in this case, and the idea is actually, that we’ll separate any string processing into two steps, the first step is a fully SIMD, ASCII detection algorithm, that’s run on a string, and this sort of detection algorithm kernel, is optimized so well, they leveraged it. So if it’s handwritten with AVX, so the instructions, it can run at 60 gigabyte per second per core, so this is almost running at memory bandwidth, this is so fast as negligible overhead, in virtually all cases. And the second step is, depending on the result of the ASCII detection, if we realize a string can fit in, or the entirety of the string can fit in ASCII characters-set, we can run the special fixed-length version, of our string function, otherwise you run the variable-length version. And with this technique, we achieve an order of magnitude speed up, for most of our string functions, compared with the original JVM variant, and in many cases, even compared with a SIMD version, the C++ variant, when the data is for example, does fit in ASCII character set. And this is done without the users having to manually do anything, it just an automatic algorithm, for example, opera and substrings are significantly faster, but this is just one technique, I’ll just show you this one more example, which they’re regular expressions. It is used pretty frequently, when people are dealing with or massaging data, it’s very rare you will see regular expression showing up, in the actual benchmark, but we have found the use of it, in the wild so much, they have spent a lot of time optimizing it, as a matter of fact, regular expression in Photon, is more than four times faster, than regular expressions elsewhere.

Faster String Processing – Simpler Functions

Faster String Processing – Regex

Just to wrap up, Photon is a purpose-built, execution engine for maximum performance, it’s built in C++, it leverages vectorization, so we can take advantage of the data-level parallelism, and structure-level parallelism presented by modern CPUs. And it’s also really designed, to optimize for modern workloads, in which there is a lot of strings everywhere, and I’ve shown you some examples of that.

Delta Engine

And to conclude, Delta Engine delivers best in-class performance, with all the expressiveness of Spark. It has three important components; a query optimizer, an execution engine, and a state-of-the-art caching layer, both the query optimizer and the caching layer, is available now on Databricks, and you can actually contact us, to get access to the new execution engine, it’s called Photon.

And thank you, I’d like to pass the talk back to Ali.

– Thank you, Reynold, so this is great, this assumes you have your data in Delta Lake, how do you actually get it into Delta Lake, in the first place? I’d like to spend some time to talk about that. So at Databricks, we have something called the Data Ingestion Network. Data ingestion network, is a tightly integrated partnership, with things like ADF, and companies like Fivetran and Stitch. When you go inside Databricks, there’s something called Data Ingestion Network Gallery, when you click on that, you can now see an icon of all the different operational databases, that an organization might have their data in, you click on it, and it can now automatically transfer that data into Delta Lake, but we also wanna make it easier for you to get data that is in a blob store into Delta Lake.

Databricks Ingest: Data Ingestion Network

Databricks Ingest: Auto Loader

Before, all of this was manual, you would have to manually connect to a notification system, register that whenever files appear, please kick off a job scheduler, that then runs maybe a Spark job, that takes that data and converts it into Delta Lake. But what we’ve done, is that we have now something called the auto loader. The auto loader automatically connects, with the underlying notification systems, and figures out when new files appear, and automatically convert them into Delta Lake. So that’s awesome, now you have all your data in Delta Lake, but how do we actually keep this secure? So in Databricks, we provide fine-grained access control on top of Delta Lake, this is standard SQL based ACLs, you can now submit revoke and grant queries in SQL, you can create groups and add individuals to them, and it tightly integrates, with the cloud vendor’s Active Directory. That means you don’t need any passwords, or tokens, or certificates, the identity goes from the Active Directory, straight to Databricks, and then you access the files seamlessly.

Robust security controls

So let’s put all this together, so we talked a lot about how Delta Lake, solves a lot of this complexity around managing data, and how Delta Engine improves the performances of all your analytics workloads, but once you have all the data, there’s only a handful of analysts, who have the tools to make sense of it in an organization, zooming out, data teams are a small part of the equation. In most organizations, we have organizations like marketing, HR, finance, they don’t want to use Python, or SQL, or Scala. Today to consume data, the rest of the organization has to really jump through lots of hoops, if we want data teams to truly unite, we have to make it easier to consume that data, for the rest of the folks in these organizations. Recently, a really fascinating open-source project has emerged, that’s starting to solve these problems, Redash is an easy-to-use open-source dashboarding and visualization service, for data scientists and SQL analysts. With it, data teams are better able to democratize sharing data across them, within organizations. The community behind this project is absolutely amazing, this project started in 2013, just like Databricks, they have over 300 open-source contributors, and they’re really passionate, about democratizing access to data across organizations, and it’s working in a really, really big way.

Companies aspire for every team to be data driven Data analysts reach into many parts of an organization

Global adoption in production workloads

Every day, millions of users and thousands of organizations around the world, are using Redash to get insights, and make data-driven decisions, they’ve seen a ton of success, over 7000 deployments around the world, and the engineering team standing behind this great project, is absolutely fantastic, and they’ve built something great, that aligns completely with the Databricks values. So it’s my great pleasure to announce today, that Redash has joined Databricks, and we’re super excited to align these two open-source communities, the Spark community, with the Redash community. To tell us more about how this will work, it is my great pleasure to introduce Arik Fraimovich, the founder of Redash, Arik.

Arik Fraimovich Founder of Redash

– Thank you, Ali. I’m really excited to be joining both Databricks, and the larger open-source community of Spark. Seven years ago, I started Redash as a hackathon project, and I didn’t expect to be here today, sharing the story with such a large community. Back then, we just moved our data into data lake, and we needed somehow to use the data, so we tried some SQL workbench tool, and this was really missing collaboration. We realized that what we need, is some web application, where we can type in the query, get the results, visualize, and share them, and I decided to give it a try, that hackathon, and that’s what started Redash. Working with Databricks earlier this year, I realized how much alignment we have, on a lot of these core ideas that I had, when I started the project. Make it easy to collaborate with others around the data, and democratize data for all the teams, but beyond the product and vision alignment, it’s the amazing culture in Databricks, that really makes me and the team excited, about joining Databricks, but now let’s talk a bit about what Redash actually is.

Redash helps you make sense of your data

So, Redash gives you a SQL interface, to query your database in its natural syntax, but we give you some tools to make it easier, like the schema browser auto-complete and query snippets. Now once you have the data that you queried, you can visualize it with a wide variety of visualizations, and then group these visualizations into dashboards. You can also set up refresh schedule, so that you get the most fresh data without waiting, and when it comes to databases, Redash has you covered.

Query all of your SQL, NoSOL, big data, and API! data sources

So, Redash supports over 40 types of databases, data warehouses, data lakes and different types of APIs. So, it’s likely that whatever data source, your organization uses, you can connect it to Redash. Now, let’s check out how you can use Redash with Databricks, let’s say that I’m a data analyst in a SaaS company, and I’m looking to help the business understand; revenues usage, and how to drive conversions better, and reduce churn. We have at least three types of data; account data, payments data and usage data, but this data is coming from three different sources, accounts data is in the operational database, payments is with the payments processor, and only usage data is in our data lake. For many meaningful analysis of the data, we will need to join the data from all three sources., we can do this join in Redash, but this has limits and performance issues, what we really want is a single source with all the data, and this is exactly what we can achieve with a Delta Lake. Now offer data in a single location, and we can easily use the data together, to create more meaningful insights, and have the most fresh data, and now, with that intro, let’s check out the actual demo.

Redash Demo with Delta Lake and Databricks

Right, so here we have the Redash query screen, and let me type in a query, to just show you a bit how it works. So a handout a complete helping me be here, although it doesn’t save me from typos, now, here on the left, we have the schema browser, that shows us the tables and their columns. Now, we got the data back, we can look at it, but it’s much nicer to look at a visualization, it’s easier to understand this way, so let’s create one real quick.

So, here we see the trend, of our monthly recurring revenue over time, and we can see that it’s growing, but we don’t know why or how it’s growing, so it probably makes sense, to see a breakdown of what’s contributing for it.

So, let’s create a bit more complex query this time, so this one is a bit long, so I’m using a query snippet for it.

Now, a query snippet is a very simple construct, you have this keyword that triggers it, and you have a definition of what the snippet will be, and then when you trigger it, it gets inserted into your query. I got the query back, and this time I have much more information, but again, a visualization can really help understand it. So, here’s a breakdown of our monthly recurring revenue, and we can see that most of the time, it’s growing due to new monthly recurring revenue, but we had some successful campaigns around expansion, in a few months, and we have some regular expansion over time, and unfortunately, we also have some churn over time.

Now, from the same set of data, I can create some other visualizations like, again, the growth of monthly recurring revenue over time, some counters, and the average revenue per user over time. Next, sometimes would like to look at these data, at a daily level, so here we have another visualization, that’s showing us that a rolling window of 30 days, of the month in recurring revenue at that day, and that’s useful, when we want to track it in more real time fashion. For example, when we’re running some campaign, or we are aware of some issue, that we want to track more closely.

Once we have a few of these visualizations, we would like to create a dashboard, in these dashboards, we see different facets of our monthly recurring revenue, like this breakdown that we saw before, the growth over time, breakdown by different plan types, and breakdown by geography. Now, looking at the revenue is great, but we also want to help our customer success team, to help drive this revenue forward, so here we have a dashboard for our customer success team, that shows them two types of accounts. One type is promising trial accounts, that’s basically, accounts that are currently in their trial period, didn’t convert yet, but showing good indication of usage, so you might want to reach out to them, and help them switch over to a paid account. On the other hand, we have the paid accounts, that showing a decline in usage, where we would like to reach out to them, and see if we can help them get back on track, to help the customer-success agents, each row here is a link, to this other customer insights dashboard, and this dashboard is using a parameter, to filter out and show us only these accounts.

Here we can see some details about the account, like; their usage over time, type of account, the number of widgets they created, and we can definitely see the decline, and we can drill deeper to understand why it might be, or just reach out to the customer. Now, this was a really quick taste of Redash, we’ll have a more in-depth session, later today, that I recommend you checking out, thank you very much, and back to you Ali!

Thanks, Arik, I’m super excited about this integration.

Summary: Delta Lake, Delta Engine, Redash, MLflow and Unified Data Analytics

So, to zoom out in summary, we have Delta Lake, as a structured transaction layer on top of your data lake, it allows you to overcome a lot of the challenges, that data lakes have historically presented, by using ACID transactions, indexing, fine-grained control, and schema validation and expectations, so that’s Delta Lake project. Then, to ensure that you have all your analytic workloads, and that they can be executed really fast at low latency, we have on top of this, Delta Engine. We think Delta Engine with it’s vectorized query engine, query optimizations, custom connectors, combined with all the enhancements, that substantially improve performance and price, for analytic workloads, will make Databricks, the ideal platform to deploy Delta Lakes. And finally, all of this comes together in the ways that data is actually made useful, Redash is going to simplify dashboarding, for many analysts and data scientists, and tomorrow, we’re going to deep dive into how enhancements tto MLflow, the open-source end-to-end machine learning platform, will make Databricks workspace even better, and I’m really looking forward, to what we have to share there, the data workloads and the data platform, working together end-to-end, is the heart of Databricks Unified Data Analytics, and our vision for data teams.

Unified Data Analytics at Starbucks

To share how this vision is coming to life within Starbucks, to drive collaboration and innovation, it is my great pleasure to introduce Vish Subramanian, Director of data analytics and engineering. Vish, what’s brewing at Starbucks?

Thank you, Ali. Hello, all welcome to Spark and AI Summit, 2020. My name is Vishwanath Subramanian, I’m the director of Data and Analytics Engineering, at Starbucks. As Ali mentioned earlier, as we’re all navigating through these challenging times, we strive to make fact-based decisions with data, I’m here to talk to you about our data journey. At Starbucks, our mission is to inspire and nurture the human spirit, one person, one cup, one neighborhood at a time, you may be familiar with some of our 30,000 stores worldwide, that may be a part of your experience. We have over 30,000 employees who we call as partners, who ensure our standing, as one of the world’s most admired companies.

BRAND STRENGTHS: Exceptional Unmatched Coffee, Craft & Industry-Leading Customer Affinity, Consistency at Scale

In order to adhere to that quality of service and product, the key focus areas that we focused on, are making sure we have elevated partner and customer connections, breakthrough in beverage innovation, and also accelerating our customer engagement across channels, including digital, however, this story is driven by our ability to harness data, to power these customer experiences. Amazing teams across Starbucks, continue working on the next breakthrough innovation, to ensure we build the right customer experience, and an enduring brand appeal.

Data is the Fuel for this Success

These extend across our multiple domains, partners, products, store portfolios, and digital, there are tons of awesome teams working on big ideas, and enabling them to be brewed at scale. As much as it’s about the product, it’s about elevating customer connections and experience, that is driven by the convergence of data.

For example, on mobile, we use rapid A/B testing, to improve recommendations using reinforcement learning, models or the BrewKit platform. On drive thru, we cross sell modifications, using engines like (mumbles), on-store applications, we serve near real-time transaction data, with billions of data points, to internal reporting applications, so that actual store personnel, can monitor transactions near real time.

More importantly, data is extremely crucial, as we extend channels for delivery and new norms, in today’s new era. The ability to serve these use cases, is only possible through a sound data strategy, our data strategy and guiding principles, are built on three pillars; a single version of the truth, providing trusted store of information, across customer, product and stores, data analytics enablement, as well as trusted data, to ensure data quality, privacy, security, and the right data definitions of access to this data. How do we execute our strategy? We had to deal with several modernization challenges, on the architecture side, we were dealing with petabyte scale of data, and fragmentation across systems, or the ingestion and processing, the ability to process near-real-time data products, and a huge variety of sources.

Guiding Principles

Challenges Architecture for massive scale data/Al

The inability to implement updates, merges, on fast staging data in an optimal fashion, also non-optimal engineering experiences, with services that did not perform at scale, for example, taking a long time to provision or scale.

Challenges a

Also, on the consumption side, there was no real single source of truth, a lack of a unified user experience, and an impedance mismatch, between data and model development and operations, blocking, experimentations and reproducibility. To illustrate how we dealt with all these challenges, BrewKit was our zero-friction analytics framework, this is built on us strong foundation of Azure, Databricks, and Delta.

Zero Friction Analytics

BrewKit would be a unified analytics platform, to reduce our impedance mismatch between access to data, the ability to perform data science and operationalization. We wanted to make sure the smallest of teams, at Starbucks, had the ability to do at-scale, data science and data engineering with this framework. BrewKit, our massive scale Data Framework, has all the necessary capabilities in a box, to perform at scale, data science and data engineering, if we leverage multiple Azure services, and homegrown Starbucks engineering, to light up a functioning environment, multiple skews translate into data engineering, and data science templates, all this generated declaratively in a matter of minutes.

Brewkit: Massive Scale Data Framework

We have services like MLflow and Azure ML, that help our model management life cycles, all this back on storage services, such as Delta Lake, we use Azure service such as Azure keyboard, to help secure key management across different environments. From an infrastructure-essential standpoint, we have Databricks as the core engine for on demand compute, and the Notebook collaborative experience, helps teams collaborate and experiment at a rapid fashion.

The AI-serving layer is powered by services such as Cosmos DB, that help us enable internet scale, low latency survey. We also provide automation, and start the Notebooks, and templates across a variety of use cases, so that users can bootstrap quickly onto the environment. All activities are audited for governance reason, as well as return on investment.

Ingestion & Processing

Moving on to our second challenge, data and ingestion, and huge focus areas for teams has been to ingest data at scale, using custom Spark utilities, with massive parallelism, with inherent benefits of delta, such as ACID transactions, metadata handling, and schema enforcement, we could use services now like Azure Event Hubs, and Spark Structured Streaming, that helps us process millions of transaction at scale, per second, for long running pipelines, fault-tolerance is trivial now with structured streaming.

More importantly, Delta has now helped us build out, our historical data and live aggregations together, to make sure we are now giving our store partners, real-time insights on data, based on history and on current time. Consider pipelines that hold transaction data, that are cleansed, audited and processed, to be sent right back to the stores within minutes, to ensure that our partners have this latest data, to make decisions with, the ability to perform stateful and stateless aggregations, helps us tailor trade offs based on use cases.

Features like Compaction, Auto Optimization, help us now have the right partition keys, in our partition Delta Lake, to get us 50 to 100x performance gains, as well as storage optimization.

It’s crucial for us to support time travel, and versioned data, for data and model provenance.

Operations on data such as mergers, updates and deletes, help us build out our own anonymization frameworks.

We can also have logical separation on the Delta layer, with discovery and integration zone, where we can apply schema enforcement, based on the source of the data, as well as apply retention policies. This translates to a highly industrialized process, where we aggregate data across different zones, and load data from a variety of sources. Using the workspaces, clusters, and job APIs, and orchestration tools like Airflow.

We have predefined patterns for these ingestion processes, that help us define retention policies, as well as partition according to the use case.

Delta helps us with features, like either potency with merge operations, and with fast change data-capture use cases, time travel provides snapshot isolation, for fast changing tables, we also go on, we make sure we now go running an enterprise data lake, with the right permissioning and access control. Over all, the strategic view has been here, to commoditize data ingestion to such an extent, so that the teams can focus on business problems up the value chain, rather than focusing on moving data from point A to point B.


On the consumption side, which was the third challenge we had, we wanted to make sure your experience is as good, as you walk into one of our premier experiences, we sell over 50 plus workspaces on the enterprise, with different business units, with logical separations between them, this gives us flexibility to monitor, secure, and segregate these environments, based on the use case. We also run our own version of our metadata sync process, across these 50 plus environments, to sync across these workspaces at scale. So all users are looking at one governed version, of published data, with a single source of truth.

AI / ML Foundation

The successful data enablement, and BrewKit journey, has now lead us to focus on our AI/ML strategy, we have enabled convergence of data and enabled BrewKit, for the creation of models. In the past, our goals were to deploy environments, in a matter of minutes, to deploy data pipelines in a matter of minutes, which we have largely achieved, now our focus is now to democratize machine learning, and how do we enable our data science tests, to enable machine learning models in a matter of minutes? We have our framework AI Reserve, that we are in the process of deploying, with a goal to achieve exactly that.

Powered by MLflow and Azure services, AI Reserve will be guiding larger constructs in our enterprise, including a model marketplace.

This will be used in many use cases across the board, such as store operations, quality of service analysis, personalization experience and much more. The ML stack comes together with analytics out of the box, with our broken environments, users have a plethora of environments, that they can develop their code in. We can use version control, to deploy builds, based on integration with Azure services like Azure DevOps, based on use cases like on schedule or drift scenarios, users can use Azure ML or Mlflow, for their model lifecycle management, and all this, backed on a persistent data store on the Delta Lake.

We also make sure AI Reserve integration, allows our data scientists to containerize their solutions, and use Azure cognitive services, to essentially sell our REST APIs, which leads to the larger construct of the AI Reserve of model marketplace. Today’s real-time data along with third party data, is driving our decisions across the board, including store reopening strategies.

Unified Data + AI Solution

All this is being supported, by a unified data and AI solution, we now have petabytes of data located on Delta Lake, at massive scale, hundreds of data products built on Spark, an average deployment time pipelines and environments in minutes, over thousand plus pipelines, across engineering and user workspaces, and more importantly, these are all highly reliable and fault tolerant.

From a data team collaboration and productivity standpoint, this has been huge, the tooling is collaborative, we also now foster a culture of experimentation, and self-service, and maintain shared responsibility across our environments, and the focus is now on outcomes, and not technology or infrastructure.

Additional Talks

I would encourage everyone to check out additional talks, by Starbucks, at the summit, including how we operationalize our big data pipelines, as well as our machine learning.

Please be sure to visit, for latest news and initiatives. I hope you have a great Spark and AI Summit 2020, thank you very much, back to you Ali.

– Thanks, Vish, and I also want to thank all the speakers, and the demos today, and before you go, a few things I want to remind you of.


Please attend as many of today’s breakout sessions as you can today, tons of interesting content, and speakers can be found, visit the Dev Hub, and the Expo for live demos, and engage with our technology partners, network! This is a great opportunity to meet and learn from thousands of people like you. you can head to the advisory lounge, “Birds of a Feather Networking Experiences,” theater sessions, and my personal favorite, “Ask Me Anything” sessions, with Matei, and other industry leaders, share your thoughts, lessons learned, and selfies on social media.

Be sure to use our hashtag #SparkAISummit and #Datateams, and finally, please join us to support the Center for Policing Equity, and the NAACP Legal Defense and Education Fund, click donate now, in your dashboard, and Databricks will match all donations upto $100,000.

You can hear directly, from the founder of CPE, Dr. Philip Atiba Goff, along with Professor Jennifer Chayes, from UC Berkeley, and Nate Silver from 538, in this afternoon’s keynote session, you don’t wanna miss this, so be sure to join us at 1 p.m. pacific time.

I hope you’ve found today’s session informative, and I hope you enjoy the rest of Spark and AI summit 2020, thank you! (upbeat music)

Watch more Spark + AI sessions here
Try Databricks for free
« back
Ali Ghodsi
About Ali Ghodsi


Ali Ghodsi is the CEO and co-founder of Databricks, responsible for the growth and international expansion of the company. He previously served as the VP of Engineering and Product Management before taking the role of CEO in January 2016. In addition to his work at Databricks, Ali serves as an adjunct professor at UC Berkeley and is on the board at UC Berkeley’s RiseLab. Ali was one of the original creators of open source project, Apache Spark, and ideas from his academic research in the areas of resource management and scheduling and data caching have been applied to Apache Mesos and Apache Hadoop. Ali received his MBA from Mid-Sweden University in 2003 and PhD from KTH/Royal Institute of Technology in Sweden in 2006 in the area of Distributed Computing. [daisna21-speakers]

Matei Zaharia
About Matei Zaharia


Matei Zaharia is an Assistant Professor of Computer Science at Stanford University and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley in 2009, and has worked broadly in datacenter systems, co-starting the Apache Mesos project and contributing as a committer on Apache Hadoop. Today, Matei tech-leads the MLflow development effort at Databricks in addition to other aspects of the platform. Matei’s research work was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE). [daisna21-speakers]

Brooke Wenig
About Brooke Wenig


Brooke Wenig is Director, Machine Learning Practice. She leads a team of data scientists who develop large-scale machine learning pipelines for customers, as well as teach courses on distributed machine learning best practices. She is a co-author of Learning Spark, 2nd Edition, co-instructor of the Distributed Computing with Spark SQL Coursera course, and co-host of the Data Brew podcast. She received an MS in Computer Science from UCLA with a focus on distributed machine learning. She speaks Mandarin Chinese fluently and enjoys cycling. [daisna21-speakers]

Reynold Xin
About Reynold Xin


Reynold is an Apache Spark PMC member and the top contributor to the project. He initiated and led efforts such as DataFrames and Project Tungsten. He is also a co-founder and Chief Architect at Databricks. [daisna21-speakers]

About Vishwanath Subramanian


Vishwanath Subramanian is Director of Data and Analytics Engineering at Starbucks. He has over 15 years of experience with a background in applied analytics, distributed systems, data warehouses, product management and software development. At Starbucks, his key focus is providing Next Generation Analytics for the enterprise, enabling large scale data processing across various platforms and powering Machine Learning workflows for amazing customer experiences.