The Rise of the
Databricks Co-founder and CEO Ali Ghodsi discusses why data warehouses and data lakes weren’t designed for today’s use cases, and how the Lakehouse builds on these technologies to better unlock the potential of your data in this keynote.
In this keynote summary, learn why data warehouses are not well suited for modern data management, such as GDPR and CCPA requirements, audio and video data sets and real-time operations and get insights on how to build curated data lakes optimized for reliability, quality and performance — on any size data set.
Want to watch instead of read? Access the Rise of the Lakehouse Paradigm Keynote video here.
Hello, I’m Ali Ghodsi, co-founder and CEO of Databricks. I’m going to talk about the lakehouse today. And I know it’s a little cliche, but I’m going to start with this famous quote because I think in the data management industry, we keep building faster and faster horses, but there’s actually an engine of the future to be built here. There’s an unfulfilled promise to have one platform to do all your data analytics, all your data science, and all your machine learning. And that’s what I’m going to talk about here today.
History of Data Management
So let’s start with the history of this. It all started in the eighties. Business leaders were going blind. They didn’t know how their business was doing. And we came up with a paradigm of data warehousing. The way it worked was let’s take all of the data that we have in our operational data stores, the Oracles and the MySQLs of the world and let’s ETL it all using an ETL tool into a central location and put it in clean, strict schema and format. And then we can actually start business intelligence and reporting out of it. And then business leaders will know how their organization is doing.
So this was awesome. And it’s fantastic technology. And it’s been around now for many decades. But over the times what’s happened is there are new requirements and new data sets that are challenging the data warehouse.
One, we see more and more video and audio datasets and organizations are collecting these and data warehouses cannot store these.
Also, most organizations now want to use machine learning, data science, AI, to do predictions. Oftentimes on those datasets, the video and the audio data they have, or in the text data that they have, and data warehouses do not have predictive capabilities built into them. And they’re also really difficult if you want to actually do something in real time, if you want to have real time streaming support, that’s not what they were built for because they’re sort of require you to ETL the data first into a location.
And then finally, the data warehouses are closed proprietary systems. You move your data into them and it kind of gets locked in there. Therefore, most organizations started storing all of their data in giant data lakes stored on top of blob stores. So with the data lake, you could now handle all kinds of data. You could store data to do data science, machine learning. You could have video, audio, all of your data could be stored there. And in fact, every organization we know of has a data lake where they’ve stored their data.
But these data lakes themselves also have lots of challenges. In fact, on the data lake, you cannot do BI. So it’s not possible to run business intelligence tools efficiently and easily on top of data lakes. For complex to set up, and oftentimes you get really poor performance because you’ve just dumped the data in there, so you’ve got an unreliable data swamp, you have all this data, but it’s hard to make sense out of it.
So as a result of this, a lot of organizations end up actually having a coexistence of a data lake where they have all the data for data science, and then subsets of that data gets moved into a data warehouse where it has the schema and it can actually be consumed by BI and reporting. But this coexistence is not a great desirable strategy because now you have two copies of your data. If you modify the data in the data warehouse or in the data lake, it’s hard to keep the other one consistent. The BI tools that are producing the dashboard for business leaders often have outdated data because the most recent data is actually in the data lake. And then finally you have a really complicated, costly system where you’re first ETLing the data into the data lake and then ETLing it again into the data warehouse. So it’s actually quite messy and in some ways it’s many steps back.
At Databricks, we’re firm believers that it’s possible to actually combine all these use cases in one place. We call this the Lakehouse paradigm. And I’ll talk about that more today. So how does this work? The Lakehouse paradigm builds on the data lake. So it starts at the bottom, storing all the data in a data lake. And data lakes are awesome because they’re really cheap, they’re durable storage, they have 10 nines of durability, so 99.9999 and they’re cheap and they scale out. They’re also able to store all kinds of data. Raw data, video data, audio data, structured, unstructured. And then finally they’re based on open standardized formats, oftentimes parquet format or ORC format. And there’s a big ecosystem of tools that operate off of these formats on the data lakes. That’s why data lakes have taken off.
Challenges with Data Lakes
But at Databricks over the last decade, we’ve seen that there’s also lots of problems with data lakes and they’re simply not sufficient. And what I’m gonna do next is I’m gonna walk you through the nine most common problems that we saw people have with their data lakes. And I’m gonna explain to you some of the tricks that they were using to fix this.
So let’s start with the most common problem that we saw.
ONE. Most common problem is that it’s just hard to append new data into the data lake. In particular, if you add new data to the data lake, it’s very hard to read it at the same time and get consistent results. That’s because the underlying blob storage systems were not built to be consistent. They’re not file systems. The way organizations often try to fix this is by making lots of copies of the data. So they’ll have a copy in a directory called staging and another copy when it’s ready called production, and they try to fix that. But this is not a great way of doing data management.
TWO, we see that organizations have a very difficult time actually modifying existing data on data lakes because the data lakes are built using batch systems like Spark. And this is becoming particularly bad with the advent of GDPR and CCPA, which requires us to do fine grained operations on that data. The fine grained operations can involve deleting a record of a particular user because they don’t want to have any records of them in the data system anymore. The way many organizations attack this is run a batch job once a week and rewrite all of the data in the data lake and clean it up to be compliant. This is very costly and the latency is really, really bad.
THREE, oftentimes jobs fail, nothing gets noticed, part of the data makes it into the data lake, other parts are missing, but the worst part of this is that you don’t know about it. Years later, when you’re trying to run an application on top of your data lake, it fails, and after lots of debugging, you’ll find out that some job failed years ago and only half of the data is in the data lake.
FOUR, it’s really hard to do real time operations. This is really a special case of the first one, but basically adding data and appending it, and then in real time trying to read it, is very hard to do in a consistent fashion. And the old trick of using two directories doesn’t really work here because you’re reading it in real time.
FIVE, it’s really costly to keep historical versions of the data, especially in regulated industries. You need reproducibility auditing and governance, but this is really hard to do with data lakes and batch based systems. And again, what people do is they make lots of copies of all this data and put different dates on the directories and hope that they can keep track of all these different ones and that no one edits previous directories. But this is very, very costly and time consuming.
SIX. These data lakes as they’ve grown and become pretty large, the metadata for them itself is becoming pretty large as well. So handling that metadata is actually really difficult. Oftentimes gets slow or the systems fall over.
SEVEN, the data lakes are really a file abstraction. So oftentimes we get into problems around having millions of millions of tiny, tiny files or some very gigantic files. And you have to optimize that.
EIGHT. And as a result of this, we’re seeing more performance problems. It’s hard to actually really tune them so that they have great downstream performance.
NINE. And then finally, last but not least, the most important problem is the data quality issues that you have with the data lakes. It’s a constant headache to ensure that all the data is correct, has high quality, has the right schema, and that you downstream can actually rely on it.
Delta Lake: Foundation of a Lakehouse
So those are the nine problems. And at Databricks, we believe that there is a way to solve these, and we believe that the open source technology that we developed, called Delta Lake, addresses exactly these issues on data lakes.
With Delta Lake, you can add reliability, quality, performance to data lakes. It brings the best of data warehousing and data lakes together. And it’s based on an open source format and an open source system so that you don’t need to worry about locking your systems to some proprietary system.
In short, we believe this is the new standard for building Lakehouses. So let’s look at these nine problems and let’s see how Delta attacks them.
It turns out that the first five problems can actually be solved by using a technique called ACID transactions that’s been around for many decades for data management systems. So asset transactions, the way they work, is that they make sure that every operation either fully succeeds or gets aborted and cleaned up for any residue. The way we implement that is by putting a transaction log right next to your open parquet files. And in fact, the transaction log itself is in parquet format. And now you can ensure that every operation that you’re doing, whether it’s streaming, batch, append, either fully succeeds or gets cleaned up and aborted.
With this also, since we’re now storing every Delta of the operations that we’re doing in the transaction log, we can now actually implement something called time travel. That means we can review past transactions. As you can see from the example, you can submit the SQL query and then just add to its timestamp as of, and then it returns to you the data results as if you submitted the query back at that time when the time stamp is specified.
So this is great. Now we can solve all of these problems. We have consistent reads, appends, streaming, jobs failing, and time traveling. And most importantly, we can now do UPSERTS. Which means we can insert, delete, update records in a fine grained manner, as you can see from the examples. We can go in and delete one record and it gets stored in the transaction log and you don’t need to run a big batch job to make this happen.
So that’s great. That’s ACID transactions. How do we deal with the rest of the problems?
Well, for metadata, it turns out that we can actually reuse Apache Spark. Apache Spark already is an extremely scalable system that can handle petabytes of data. So for all metadata operations under the hood, we use Apache Spark. If the metadata ends up being really small, we actually have a single node implementation that does it really, really fast. If it’s really large, we can scale out infinitely.
How do we deal with the performance problems? Well there, we actually took all of the indexing techniques that we could find in the literature from the past and we implemented them specifically for the data lakes. So we implemented partitioning that automatically happens on the data lake. We implemented a technique called data skipping, which stores statistics, and prunes the data before you do the query so that you don’t have to read all the data sets if your query only touches certain parts of it. We added Z-ordering, which is a way in which you can actually index multiple columns at the same time. But unlike indexing, accessing any of the columns is equally fast. You can see the example here how easy it is to actually add as you order to your dataset. So awesome, that leaves us with the last problem.
How do we get quality out of our data? And here we’ve added strict schema validation and evolution to all of Delta. So it means all data that is in Delta tables have to adhere to a strict schema. A star schema, snowflake schema, whatever you want. And it also includes schema evolution and merge operations. But this means that whenever data comes into Delta, it always satisfies that schema. If it doesn’t, we move it to a quarantine where you can look at it, you can clean it up, so it makes it back in, but it means that when you’re using that table, you can ensure that it’s always clean.
On top of schema validation and evolution, we also added something called Delta Expectations. This is a really powerful way where you can, in SQL, express any quality metric you like. You can combine the columns, you can specify whatever you want, you can say that you want the particular table to satisfy all those qualities. That then ensures that at any given time, your table is pristine and has those expectations that you need. And with this, our customers are building what we call curated data lakes. The way it works is that by convention, they first store all the raw data in the data lake that might be unclean. We call those the bronze tables. But then they evolve it and clean it up and have more schemas in place and create silver tables that are much more filtered and cleaned and augmented. And then there’s the last level, the gold resiliency level, where we have the gold tables, where we might have added business level aggregates and additional expectations to make sure that downstream consumption is really great. That’s how we build a curated data lake.
And to zoom out and summarize, we’ve addressed the nine problems that we saw with data lakes, with asset transactions, using Spark to scale, indexing to make it fast, and schema validation and expectations to really bring quality to your data lake.
At the bottom, we have the data lake. On top of it, we’ve layered this transactional layer, and now we actually can get quality and reliability out of our data. But how do we really support all the use cases that we want to do? So towards this, we’ve built something in Databricks called Delta Engine. It’s a high performance query engine that I’m going to talk a little bit about. This engine completely is API compatible with Spark 3.0, so it supports all of the APIs of Spark, but it is built from scratch in C++ natively doing vectorization and custom built for Delta to be really, really, really fast for data that you have on your data lake in Delta format. It comes with an optimizer that’s highly improved and can do cost-based optimization. And we also have built in caching for SSDs and in memory so that we can really speed things up so we can mask the latency and the performance of data lakes.
That’s the Delta Engine, but what’s the performance of this? When you put it all together, what are we seeing? Towards this, we ran the industry standard benchmark called TPC-DS. We ran it as a pretty large scale factor, 30 terabytes. And we looked at when you run Delta without Delta Engine, and then we ran it with Delta Engine and we see a 3.3x speed up in performance. So this is awesome. So you can get state of the art performance with this. And now when you put it all together, you have a data lake.
You have Delta Lake, which is the structured transactional layer, you have Delta Engine for the high performance query engine, and now you can support all these different use cases inside the Lakehouse.
You can do BI, reporting, data science, and machine learning all in one place. Today, Databricks has over 6,000 customers. The majority of them have built Lakehouses using Delta Engine and the Delta Lake. And some of my favorite ones are in the healthcare industry, a company like Regeneron actually was able to find the cure for chronic liver disease, building a Lakehouse with genomic data in it, and doing machine learning to find the genome responsible for the disease. Customer like Comcast, in mass media, was able to actually build a voice controlled remote control that actually can take all of your data, put it in Lakehouse and use machine learning to interpret the commands that you have and in real time, let you operate it. And there are lots of lots of other examples like this, and we’re very excited about this.
We believe that the unfulfilled promise finally can be fulfilled with one platform for data analytics, data science, and machine learning with the Lakehouse on Delta Lake and Databricks.