Season 1, Episode 2
Welcome to Lakehouse
Legacy approaches have failed to deliver on the promise of a single data architecture that can support every downstream use case from BI to AI. Lakehouse aspires to address this by combining the best of data warehouses and data lakes. Ali Ghodsi, Co-Founder and CEO of Databricks, and David Meyer, SVP of Product at Databricks, explain how.
Ali is the CEO and co–founder of Databricks, responsible for the growth and international expansion of the company. He previously served as the VP of Engineering and Product Management before taking the role of CEO in January 2016. In addition to his work at Databricks, Ali serves as an adjunct professor at UC Berkeley and is on the board at UC Berkeley’s RiseLab. Ali was one of the creators of open source project, Apache Spark, and ideas from his academic research in the areas of resource management and scheduling and data caching have been applied to Apache Mesos and Apache Hadoop. Ali received his MBA from Mid-Sweden University in 2003 and PhD from KTH/Royal Institute of Technology in Sweden in 2006 in the area of Distributed Computing.
David Meyer joined Databricks in 2017 as SVP Products. He leads the company’s product organization, spanning product management, user experience and documentation. He previously served as VP of Engineering, then VP of Product Management at OneLogin, where he grew the company through Series B and Series C funding growth to the thousands of customers and market leadership it enjoys today. Before his work at OneLogin, he cofounded UniversityNow, an accredited open university system, running Product, Engineering and UX. Prior to that, David managed a $1 billion portfolio of business intelligence products at SAP and co-led cloud strategy. His first software journey was at Plumtree, where he ran engineering, CS, and product management, going public before being acquired by BEA in 2005. He holds an MSE from UW and a BSE from Penn.
Welcome to Data Brew by Databricks with Denny and Brooke. This series allows us to explore various topics in the data and AI community. Whether we’re talking about data science or data engineering, we’ll interview subject matter experts to dive deeper on these topics. While we’re at it, we’ll be enjoying our morning brew. My name is Brooke Wenig, Machine Learning Practice Lead at Databricks.
And my name is Denny Lee. I’m a developer advocate at Databricks. For this episode, we’d like to introduce Ali Ghodsi, CEO and co-founder of Databricks and adjunct professor in computer science at UC Berkeley. Ali, why don’t you tell a little bit about yourself and how you got into the field of computer science and big data?
Sure. Happy to do that. I mean, I grew up in Sweden from a sort of age of four. And thinking third or fourth grade, my parents bought me a Commodore 64 and it had one of those tape recorders on it. That’s how you sort of got the stuff working. And that tape recorder was broken so you couldn’t really play any games on it. So the only thing you could do is to use the built in basic interpreter to start doing programming. So that’s how I got started with the programming. So I started writing basic code on that Commodore 64 in fourth grade or something like that.
I love the reference of C 64. By the way, we also like to introduce David Meyer, SVP of product at Databricks. David, why don’t you tell a little bit about yourself and how you got involved in the field of computer science and or big data?
It’s funny. Ali was having me chuckling. I had this thing called the SInclair DX81. It had 1K of memory and it had no way to input anything. So you had to type every program and when you turned it off, the program was gone. And each key was a different command. So it was basic, but there was four key and stuff like that so it made it slightly quicker to type everything in every time. But I’ve been in enterprise software for a while now and very excited to join Databricks a few years ago when Ali pulled back the curtain and showed what was possible with big data and how we could help all these customers.
So going from one kilobyte to terabytes must be insane. How was your transition into the field of big data? What were some of the projects that you were involved with along the way? Ali, perhaps you could start with this question.
So I came to UC Berkeley in 2009 and the timing couldn’t have been better because around that time, pretty much we hit this thing called Moore’s wall, which meant Moore’s law no longer was applying. And there was a lab upstairs with now touring award winner, Dave Patterson. And he was saying, “They’re not going to be able to make the computers any faster so that’s it.” Which meant that the new computer was a data center. So you’re never going to be able to go to IBM and buy another supercomputer that can do all the stuff you need. You’ll have to do it in a data center. And just to give you a perspective, at that time Twitter was running pretty much all of Twitter on one giant machine with lots of memory on it. So this was around the time when computers, essentially, moved into a data center and the new computer was the data center and we had to figure out how to do everything in there.
So that was sort of the beginnings of how you would do cluster management at scale, the beginning of how you do data processing with things like Spark at scale. So it was perfect timing to figure out how would we do computations on thousands of machines in clusters, in the cloud, if we had to do it again from scratch. So for us, it was awesome. And it was kind of like, we always wanted to be born in the ’50s. So when computer science took off in the ’60s and ’70s, we could have been there and invented whatever, the first operating systems. Now we got to do that again, because around 2010, there was this new computer and everything had to be kind of rebuilt for that new computer, which was a data center.
So you mentioned your involvement with the UC Berkeley lab. How did you get involved with the Apache Spark project?
I mean, all of these projects were all part of that. There were all these companies in Silicon Valley who were funding the UC Berkeley lab and they had these problems they were trying to figure out. Yahoo was one of the major ones. They were trying to figure out how to manage these thousands of machines. And we saw that more and more of them wanted to do machine learning. So they had machine learning use cases and not just one or two machine learning use cases, they had hundreds. And they were enabling the whole organization to use data and do machine learning on it. So we were going into these companies and peaking and seeing what they’re doing. We’re at Facebook, we’re at Twitter, Airbnb and so on. And we just wanted to sort of democratize and bring this out to the masses. We were hippies. We said, “Look, let’s open source this, give it to everyone. We’ll change the world. It’ll be awesome and hopefully some companies go and make lots of money on it.” But that was the kind of state of mind, 2009, 2010 for us.
Super interesting. David, would you mind sharing a bit more about how you got into the field of big data and what are some of the exciting new things coming out in the field of big data?
So a long, long time ago in the late 2000 [inaudible 00:04:59] I ran the business objects portfolio for SAP, the stuff that they acquired and ran the analytics business there. And it was a fascinating space, but that was, I’d say, a couple of generations before what we’re doing here at Databricks. There, the philosophy was just make the single machine bigger and bigger and bigger and bigger like Ali was saying. And then that gave birth to HANA and trying to do all of that in memory on one massive machine. And then after that there was Hadoop and then Hadoop kind of gave way to the in-memory capabilities of Spark and then all the things we’ve layered on top.
And so to see companies go from trying to do everything on one massive machine to scaling out at web scale and elastic scale to now we’re seeing customers who are running their entire on these enhanced data lakes, oftentimes even looking at a multi-cloud strategy for all of their data. So from one machine to thousands of machines, to many regions in different clouds, to multiple clouds with many regions to manage their entire company’s footprint on data is quite an evolution.
So Ali, back to you. What are some of the surprise and challenges as the Spark project evolved in your involvement with it?
I mean, I think one of the biggest challenges was see that the enterprises were very, very different from say Airbnb or Uber or Facebook or Twitter. They had lots of legacy software and they had their data in all these different systems, whether it was data warehouses or other sort of systems where they had locked in the data. And just getting access to that data was hard. I mean, just getting to the bits itself was not just something simple. You have to go through lots of sort of hassle with IT and security. And they were just not set up for getting the same kind of innovations as the rest of these companies that we’re seeing.
So it kind of became clear soon after we had started Databricks that something different will be needed if we want to really help them innovate the way that Silicon Valley tech companies were doing. You see, I mean, the Silicon Valley tech companies weren’t just doing sort of a couple of use cases around AI or had one group doing it. They were completely data driven. They had every use case on the platform was going off using machine learning, hundreds of hundreds. And they were enabling all these different groups to leverage the data. They were data driven. And the state at the enterprise was very, very different. At best it was Excel mostly.
Oh, then actually I’ve pretty sure you want to add to that, David, especially coming from your business objects background with analytics. What was the transition like for you now that you actually had to shift from the old school SAP HANA style thinking to that distributed thinking here?
Well, I said earlier that when Ali pulled back the curtain to what customers were doing today with data, it was really eye opening for me. So I had to kind of rethink of things from the basic principles. Instead of carefully cultivating the data you were going to use for your operational systems, you had the opportunity to look at all of your data and it just opened up completely new business opportunities, completely new revenue opportunities. And instead of having a security system look at a slice of your data, it could look at all of the data on all of your machines. Instead of sampling transactions to look for fraud, you could look at all of the outliers and all of your data. So it really revolutionized the way I thought of how to approach data-driven systems and data-driven [inaudible 00:08:46]
So transitioning from the Spark and distributed nature side of things, how does Delta Lake solve some of these challenges and limitations that Spark could not solve? David, do you want to start with this one?
So technology and using technology in enterprise, especially in production is kind of littered with very tough trade-offs. You can get low latency and high fidelity from a warehouse, or you can get the machine learning algorithms working on all the data in the lake which is the only way to get the highest signal insights from machine learning. But what we… Delta Lake emerged from collaboration with customers, the top customers that we work with, and it allows you to get that asset transactionality or the correctness in a data lake. Oftentimes great data lakes are kind of a mess because you put everything into them, but to get correctness in a data lake, lets you look at all of the data but know the quality of the data you’re looking at and then to get the performance on top of that as well. It again, it unlocks those business use cases that you just couldn’t imagine.
Actually, in our previous session, when the panelists had described exactly what you were referring to as a data salad. There’s some good chunks in the salad. To him, it was the meatier chunks. To me, it’s the chunks of avocado. But a lot of the salad is just there to fill space and you have to derive meaning from it. And so I definitely understand those challenges that customers are facing. Ali, do you have anything else you want to add on top of that? What are some of the things that Delta was able to solve but Spark could not?
Well, I mean, really what it was about was that when we saw what was going on in industry, they were storing massive amounts of data in data lakes, which was awesome. But there were just so many configuration parameters and so many ways to lay it off and so many ways to format your data, that it was really hard to get any value of it downstream. So the team at Databricks, the same people that have built Spark, went back to the drawing board and said, what if we were to do it again and this time do it right? This time, be opinionated about it. Basically have an opinion that this is the right way. Because enterprises would like to have some guidance. We shouldn’t just let them run loose and get the thousand parameters and configure it all differently. And that’s when we came up with Delta.
So Delta, the whole idea was let’s look at the top 10 problems that enterprises are facing with data lakes and with Spark and let’s automate them away and let’s have a pre-configured way of solving them. So that was really the essence of it, zooming out. I mean, we can get into the details of how it did that, but the real idea was what if we want to really make sure that they get value out of these data lakes that they have? Which to be honest with you, most of them were failing. Many projects on top of data lakes were failing. They were pulling us in to do professional services to fix the problems they had with the data lakes. So we just wanted to sort of fix it, automate it with software once and for all instead of having people manually go in and fix these.
So going with what you just said, Ali, in one of our previous sessions also, our panelists of data warehousing luminaries, they described their journey from data warehouses to data lakes. And then we brought up the topic of lakehouses. So from your perspective, can you describe that lakehouse paradigm in your own words?
It’s pretty simple actually. It’s how do you enable large enterprises to become completely data-driven and get their data on the data lake in an open format so that it’s not locked in, in some proprietary format. It’s sitting there and then it’s enabling two downstream use cases mainly. One set of use cases have to do with machine learning, data science and AI. Can you do that downstream directly on these open data sets? And those machine learning and data science tools, they don’t work well with data warehouses or other technologies. They like to work directly on the files, oftentimes on something like Parquet. That’s what they’re built. If you look at data science and machine learning tools, they’re not built on top of SQL. So enable that downstream use case. And the other downstream use case you want to enable is BI, business intelligence and reporting.
And those use cases definitely are building on SQL. So how can you do SQL really, really well and really fast directly on your data lake? So if you have those three elements in one, that’s a lakehouse. So an open data lake where you have all of your files stored, downstream data science that directly works on top of it and then BI and reporting workloads downstream that get really good performance. And the whole system is sort of manageable. So you have governance so that you can securely do all these things that I said, because that’s a big, important topic for most enterprises. They have to make sure that their data is locked down.
Well, going off exactly what you just said, Ali, I mean, some of those panelists in the previous session had noted they sort of did like the concept of lakehouse just as you’ve called out. But they were concerned that it could not be solved by technology alone. So would you like to elaborate on that from your perspective?
I mean, first of all, the problem… I mean, what I’m saying sounds great, right? If you could make it work. The problem actually has been technology. So the technological breakthrough hasn’t been there to be able to do that. In other words, to get really, really fast SQL access directly on data lakes has actually not been possible until just recently. Getting transactionality directly on data lakes hasn’t been possible until very, very recently. Connecting BI tools with really fast performance hasn’t actually been possible until very recently. So there’s actually a few technological breakthroughs that are needed. The other, I would, say major thing that you have to solve is these data lakes are really sort of big oceans of files whereas all the SQL downstream use cases, they’re actually working off of structured data; tables with columns, and you can say who has access to which parts of it.
So the whole governance and the whole sort of access is on a much higher level. It’s at the level of tables and you access it with SQL. How can you marry these two models in a seamless way? That has been another technological breakthrough that wasn’t here until just in the last couple of years. So I would actually say the technology has been a barrier. Otherwise everyone would love it if they can have one platform for nine things and it just works out of the box and it’s awesome and it’s fast then you pay for the price of one. Why not? But there’s been things lacking.
So now that we’re able to solve for these technological issues, what do you foresee the next issue that we need to solve for being?
Well, for me, it’s actually now the awareness. Getting people to actually understand what the lakehouse paradigm is and also showing them the successful examples of large enterprises that have done this and making sure that they build it this way. The issue is there’s big tectonic plate shift that’s happening in the market right now, which is people are moving from on-prem to the cloud. And a lot of them are tempted to rebuild the same architectural pattern they had in the on-prem into the cloud. So I had a Hadoop data lake on-prem, I’ll have one in the cloud.
Oh, I had a data house on-prem, I’ll have a data warehouse in the cloud. Oh, I used to have [inaudible 00:15:55] based security on-prem. I’ll do the same thing there. I used to have clusters, I’ll have clusters in the cloud. Reeducating them on the lakehouse paradigm, what that would look like, so that they can architect for the future rather than just recreating each thing they had on-prem to have a cloud version of it really doesn’t buy us that much. So that’s, I think, the biggest thing. So it’s the education and getting those sort of paradigms widely spread. So I think that’s the most important thing.
To add to that, there’s a mindset aspect. It’s the extension of what Ali was saying. Data has gravity and all companies are scarce in the kind of the resources that can do these things. So the temptation to replicate your on-premise warehouse into a cloud warehouse or your on-premise lake into a cloud lake is dangerous because it can send you down a path that you live with for years. Now, there’s so many urgencies in the enterprise that it’s hard for people to take a step back and say, well, three years from now, what will I have wished I did? Will I have wished that I put something in a proprietary format that I have to pay tax on every year to get the data out, or will I have wished I would have had the forethought to put this all in an open framework and an open format in a way I control in a enhanced data lake that allows me to do all these use cases on it?
And I can always shuttle something off to a warehouse if I feel like I need to. But the key is I own my data. It’s in an open format that I have flexibility for the longterm. And all the practitioners deep in this world when you survey them of what pattern they want, they want the lakehouse pattern. They’re not sure when all those pieces Ali was talking about will come together. And it has to be coupled with internal training at their company to make sure people are thinking of things in a more future oriented, multiple year down the road mindset.
So, David, I know that you’re very enthusiastic about the lakehouse paradigm. Back at our CKO in February, you actually got up on stage and did a lakehouse dance. I’m not sure if you’re willing to do it for this recording, but would you be willing to show everybody the lakehouse dance to welcome them to lakehouse?
So it is a mindset as Ali said. And you have to welcome it into your hearts. So this is a modified song and it’s a three simple step dance. And you’re going to say welcome to the lakehouse to a tune. It’s very simple. You might remember this. This might be edited out. Three simple moves. (singing)
Well, outside of the welcoming to the lakehouse paradigm, what are some of the other interesting problems that lakehouses could solve that you actually did not originally foresee? And Ali I’d like to start with you please.
I think there’s a lot of exciting things you can do along real-time streaming. So real-time use cases that it’s hard in the previous architecture where you have lots of different things that data has to flow through. It’s hard to get that real-time streaming working. But if you’re operating directly on open data lakes, you can actually now enable real-time streaming use cases where data is flowing through. And as it’s coming in, you’re operating directly off of it. It’s triggering sort of chain reactions downstream, updating apps. And then now you can build actually data applications directly on top of that data lake. So I think that’s really exciting.
So I’m hearing about all these different applications as you described for lakehouses, from genomics to business reporting. Are there any scenarios where lakehouse is not the right paradigm or are there any drawbacks to using lakehouses? Ali, how about we start with you?
Good question. Scenarios where the lakehouse is actually not a good paradigm. Well, I mean, first of all, I think if you are doing real sort of all TP applications, transaction processing directly on data lakes in a lakehouse, I don’t think the capabilities of the platform are there yet. I mean, for any of these sort of lakehouses that people are building. So direct transaction processing, it’s something that’s still… People have separate systems for those. They want them to be isolated and those are the ones that are powering sort of the web systems and the front ends of the ecosystem out there. So that’s something that I think that’s the… If once you can actually enable people to even do online transaction processing directly on the lakehouse, I think that’s sort of going to be a major technological breakthrough. But I think we’re just not there yet.
How long do you think until we’ll get there?
I would be shocked if we’re not there in five years.
Okay. So hopefully there before self-driving cars.
Self-driving cars are just two years away always.
It’ll be there before fusion energy.
Fusion engine is always 30 years away.
So in that case, do you have any advice for folks who are planning to build their production lakehouses? And I’d like to, again, start with you Ali.
I mean, first of all, I think most enterprises have their data in data lakes, so that’s awesome. That’s very valuable. If they don’t, consider actually doing that, putting it there so that it’s actually stored. Then pick an open format. We prefer Parquet at Databricks, but there are other ones too, like ORC. So pick a standardized open format and store it there and then figure out one of the technologies that are sort of enabling sort of building blocks for building lakehouses. At Databricks, we’re big fans of open source project, the Delta Lake. So that’s the one. But there are other alternatives out there as well. So that’s the first sort of building blocks. If you get just your data ready there, I think you’re in a pretty good position to start using something like the Databricks platform or other ways of actually building lakehouses for downstream consumption, for machine learning AI and for BI and reporting.
But, I think the most important thing downstream five, 10 years from now are going to be the kind of data science and machine learning projects you can do. If you can enable your whole organization to start leveraging this data and building data apps that are intelligent, then I think you can actually change the sort of trajectory of your company completely. We’ve seen a lot of companies do that to up to now. Sort of Silicon Valley forward tech companies. I’m excited to see sort of the rest of the 99% of enterprises in the planet to be able to do that as well.
Perfect. And David, anything you’d like to add as well?
Well, it just dawned on me that sometimes people throw terminology around just to ground it. A data lake… If you feel like, I can’t figure out how to have a data lake, a data lake is just a storage bucket in the cloud. You can use any storage bucket in any cloud and you put files in the storage bucket. I know I’m over simplifying it. But so data lakes are easy to get started with. There’s a variety of ways you can format the data in data lakes, but there’s nothing mysterious about it. You can start on this today, start on this tomorrow. And it provides you a way to control all of your data, but have all of these capabilities you expect in the enterprise layered on top. So we’d love to partner with you on that journey.
So both David and Ali have mentioned a lot about education and educating people on the lakehouse paradigm. What are your recommendations to better enable and better equip people to get started with lakehouses?
Well, I would say just start with open-source Delta Lake project. Just download it and then try it out. Or if you’re not sort of the person to get your hands dirty yourself, get someone in the organization that can do that, to install it and try it out. Start small. No reason to sort of, black and white, major zero to one project. Start approaching it with some small use and get some quick wins and quick successes. And then you can sort of explore it from then on.
As a data scientist even when I’m working with small data, I actually really love using Delta for reproducibility because oftentimes I’m getting a data set from GitHub or from Kaggle and it’s being updated daily. And so if I build a model today and then I want to get the updated data and I want to reproduce that model, I actually have to version back to an earlier state of that data to be able to reproduce same model, same data, same hyperparameters. So I think that’s a great suggestion for all folks is to get started with some small data, get a quick win and then expand it out to your larger use cases.
Everything great started small seeds at some points, that kind of grew stronger and stronger and bigger and bigger. And if you’re using MLflow on top of it, it can also help you with the tracking and the reproducibility for machine learning projects.
Exactly. I think the two of them harmonize really well together. You can also track your Delta versions with MLflow. So this all fits into the much bigger, big data ecosystem and how do you reproduce your data both from a governance perspective and also being able to share your results with others and make sure, yes, this did in fact run and I can reproduce your run.
I didn’t actually even know that you could track the Delta versions with MLflow. Today I learned.
Well, the whole show has been about education. So educating folks that are listening in, and then also about educating people on Delta Lake. So it fits in very nicely. All right. Anything else that either of you would like to add about lakehouses, data science? Anything that you’re excited about in these upcoming weeks?
I’m just hoping soon I could actually go to a lakehouse. That the world makes it so we can go to places like that. But now you can only go to lakehouse technology because it’s hard to go to a lakehouses.
Well, I’m excited about the data and AI summit that we’re going to have in Europe. Going to have, right? It’s during the pandemic. So it’s all virtual anyway. So it’s everywhere. But happening November 14th. And I look forward to some big announcements from Databricks then too. So excited about that. So there’ll be more about the lakehouse. You’ll hear more about it there.
Well, thank you both so much for taking the time out of your very busy schedules, Ali and David, to come and talk about lakehouses and the importance of education, not just from the academic perspective, but also in technology and helping to break down some of these barriers.
Thank you. Thank you so much.
Thanks Brooke and Denny.