On Demand
The best data warehouse is a lakehouse

Available on demand
Many enterprises today are running a hybrid architecture — data warehouses for business analytics and data lakes for machine learning. But with the advent of the data lakehouse, you can now unify both on one platform.
Join us to learn why the best data warehouse is a lakehouse. You’ll see how Databricks SQL and Unity Catalog provide data warehousing capabilities, fine-grained governance and first-class support for SQL — delivering the best of data lakes and data warehouses.
Hear from Databricks Chief Technologist Matei Zaharia and our team of experts on how to:
- Ingest, store and govern business-critical data at scale to build a curated data lake for data warehousing, SQL and BI
- Use automated and real-time lineage to monitor end-to-end data flow
- Reduce costs and get started in seconds with on-demand, elastic SQL serverless compute
- Help every analyst and analytics engineer to ingest, transform and query using their favorite tools, like Fivetran, dbt, Tableau and Power BI
This presentation will come to life with demos, success stories and best practices learned from the field, while interactive Q&As will help you get all your questions answered from the experts.
Speakers

Matei Zaharia
Co-founder and Chief Technologist
Databricks

Shant Hovsepian
Principal Software Engineer
Databricks

Miranda Luna
Staff Product Manager
Databricks

Franco Patano
Lead Product Specialist
Databricks
– [Shant Hovsepian] Hello, everyone. Thank you so much for taking time out of your busy days to join us today. We're going to talk about the best data warehouse is a lakehouse. We'll be joined by Haley Creech, who's a product director for Enterprise Data and Analytics Platforms at GSK. Me, I'm Shant Hovsepian, software engineer at Databricks. We'll be joined by a few of my colleagues from Databricks, Miranda Luna, senior staff product manager, Franco Patano, lead product specialist, and Matei Zaharia, CTO and one of the co-founders of Databricks.
So for the agenda today, we're going to do an introduction to the Lakehouse, Unity Catalog, and Databricks SQL. I have an amazing demo, show you how to get started, and tips and tricks. Talk a little bit about how GSK is leveraging Databricks SQL and the Lakehouse. And we'll talk through some of the lessons that we've learned from our users out there.
A few housekeeping items. We will share the webinar recording after the event. Don't worry if you need to step away for something and come back. You'll have the recording handy later. You can use the Q&A box either down here or here to ask questions while I'm talking. We have a bunch of subject matter experts on standby, waiting to answer your questions. And at the end, we'll also take some time to answer any questions anyone may have.
So let's get started by talking a little bit about how data and AI is disrupting entire industries. If you look at the companies here, tech companies, Amazon, Google, Apple, the FAANGS, as we call them, they're all using data and AI in strategic ways in how they run their businesses.
Netflix, for example, started as a video rental mail order system, and went into streaming. They used data and AI to make recommendations of what other shows or movies people are interested in. They even took it one step further and they're a movie studio today, and they use the data and insights from which parts of films and shows their customers like to help produce better movies. So if it weren't for data and AI, Netflix probably wouldn't exist today. We'd all still be getting videos from Blockbuster.
And it's not just the tech companies. There are traditional larger enterprises, like AT&T, that are leveraging data and AI. For example, AT&T got a lot of their customer data that used to live in a data warehouse, and put it together with real-time streaming events from transactions and phone subscribers to build over a hundred machine learning models that were used to help 182 million subscribers stay safe from fraud throughout the usage of the platform.
So how is it that these companies are able to leverage data and AI? Well, we use what we call the data maturity curve here to kind of understand. This is if you look at the x-axis, it shows you a company's point in their data maturity. And the y-axis, the value, the competitive advantage they're getting from their data.
Most companies start all the way down here from the left, where they just get some data into the system, they clean it, they build some basic reports, ad hoc queries. Then as you move further to the right, you get into more predictive modeling or prescriptive and automated decision-making systems within the platform. So the left side kind of asks questions about the past, and the right, you sort of use a crystal ball to try to predict the future.
But most companies today are still stuck on the left of that curve. They struggle to really find success at scale. And they all want to go to the right. They all want to use predictions in AI to make their businesses more efficient.
Their main reason for that is most people when they start from the left, easiest way to get started, traditional way is get a data warehouse, pull your data in it, put a BI tool on top, and ask them basic questions. However, when you need to do AI and machine learning, you need a data lake. You need to get your various different datasets. Some of them may be images, video, unstructured content. And put them on a platform that can scale and let you ask different types of questions around different types of processing on it. That always runs on a data lake. These are two separate systems, and there's a giant gap between them.
Really, it manifests itself in a weird way. So for example, here, we have the data lake on the right-hand side, and this is where data typically lands first in just about every organization. Raw data, raw files come in a data lake, maybe some basic processing happens, and then they have to get summarized and copied over into the data warehouse. Once they're in the data warehouse, there are modeled and structured tables, governance rules, and your traditional tools are on top.
These two systems, though, painful to manage, they're very separated, and it really just... It creates a disjointed, duplicative set of data silos. And what's worse is on top of the copies of the data, you end up having incompatible, different security models where data in the data lake might be encrypted, but once it goes into the data warehouse and someone's built a table out of it, you may not have kept that encryption in there. You may have changed the format. Someone may have access to it when they shouldn't have access to it. Vice versa. Always gets complicated when you have to maintain multiple different security models.
And then of course, there are the use cases. You can't just get your high performance, high concurrency BI workloads and stick it on a data lake and expect it to perform just as well. You sort of need to change something. You need to do things differently. Just like how you can't do data streaming in a data warehouse because the data lake's where that data lands, and just even by the time you get it into the data warehouse, you've already lost a lot of that opportunity to get real-time insights.
But does it have to be two disparate platforms? Let's say we've got the best of the data lake. We've got the open, reliable data storage that can efficiently handle data of various types. On top of it, layered a new way to do governance, an approach that works across all your data assets and all your clouds. And integrated engine is built for today to be able to do both machine learning and BI workloads efficiently and scale as needed.
This is what we call the lakehouse paradigm. And at Databricks, we use Delta Lake as the core secret sauce technology as that first layer on the system, where you can bring reliability and consistency to what was once an unreliable set of files living in cloud storage. Then we have Unity Catalog, which is the central place to do all of your governance, whether they're tables, files, blobs, machine learning models. And Databricks SQL is that engine that's built for the modern processing needs, high performance, and scalability.
So this is the Databricks Lakehouse Platform. When we talk to our customers and ask them why they like it so much, one of the first things they say is that it's simple. See, when there's only one platform they need to worry about, it's one set of technology and tools they need to train everyone on, one place to put all of your use cases, it really simplifies the whole system.
And multicloud. About 80% of our customers use more than one cloud today. Having one consistent interface across the various clouds, not having to learn the idiosyncrasies of the various different cloud platforms is a tremendous benefit.
The last is open. All of the technology here in the lakehouse and Databricks are built on open source and open standards. This is hugely valuable to a lot of our customers who traditionally when using data platforms that come from proprietary locked in systems controlled by single vendors.
And when moving to the lakehouse, everyone's always a little apprehensive. Do I have to retool everything? Do I have to change everything? That's where the openness helps even more, because the entire lakehouse architecture is built on open standards. So all of the tools that you're used to for your data governance, your data pipelines, even data ingestion, the whole modern data stack actually already works in the lakehouse paradigm because it's so easy for these ecosystem partners to integrate using the open standards and open APIs. So you get all of the tools and functionality, even your traditional BI tools, machine learning, and user-facing data science platforms working on the system and free from integration hassles.
So it's not just Databricks. The rest of the industry's really taken notice about the lakehouse. You can see the various other cloud vendors have talked about their new lakehouses. Google announced a lakehouse-like offering. AWS, even Oracle. And your traditional data warehouse companies are talking about lakehouse.
But how do these lakehouse offerings really stand up? Well, we wanted to test that, and so we created what's called the lakehouse benchmark. This is essentially a modification of TPC-DS, traditional data warehouse and benchmark, but we ran it on data that's living in an open format Parquet. So this is data that exists in the cloud in an open format that all of these other vendors said they now support because they're a lakehouse.
When you ran this workload across these different systems, you can see here Databricks SQL came in at just $8 to run the entire three terabyte workload. Some of the other vendors, well, those in some cases, up to 30 times more expensive to run on those systems.
So while they say that they're a lakehouse and they support these open formats, were they really optimized? Do they expect their users to use it? It seems unclear if there's such a huge gap.
And you might say, "Well, it's not fair. These systems, while they claim to be lakehouses, weren't designed to be lakehouses from the start. So you're benchmarking them on something that they were never really built to do." That's true, so we also went one step further and got the traditional TPC-DS 10 terabyte benchmark. And this time, we ran it, but we didn't put it in an open format. We loaded the data directly to the proprietary data warehouse format that those vendors use and didn't keep it in an open storage format and open access. And even in that case, while the difference may not be 30X, you can see that Databricks came in at about $76, and this includes the time to load the data, do any optimizations, and run the queries. While some of the enterprise offerings or even the standard offering from other vendors still came in to over 3X the cost to run the same workload.
What's more is even if the data is going into the proprietary database format, it probably is sitting in a data lake today. So the fact that you don't even have to load it with Databricks is a huge potential savings as well, where in these other systems, you're always going to have to be loading that data in.
But let's talk a little bit beyond the benchmarks, right? Performance is really important in the cloud because it's usage-based, it's a consumption model. So the faster something runs, at the end of the day, you save money. So that performance and efficiency does deliver a lot of business impact.
For example, there's a global media company we all know about. When they moved to the lakehouse, they were able to unify their subscriber data and their streaming data and build better machine learning models for personalization and recommendations. And what's even nice about this is they had a $30 million reduction in costs. And on top of that, almost $40 million in increased revenue from accelerated offerings, right? So this helped on both the bottom line of the business and top-line growth.
Another Fortune 50 retailer, one of the largest in the world, was able to use their supply chain data with the streaming IoT sensors from within all of their stores to detect when food was being spoiled, when there was too much surplus, and optimize their whole supply chain end to end. This switch to the lakehouse and integrating the real-time data gave them 10X faster time to insight and saved around $100 million annually through reduced food waste.
And one of my favorite examples is Atlassian. It's a product I use every day as a developer. They completely migrated all of their internal data systems over to lakehouse and stopped using data warehouses. For them, it was really about democratizing access to their entire enterprise and lowering the operational costs so that every single Atlassian employee can use the data to make informed and better decisions. This really reduced their analytics infrastructure costs by 60%, and led to a 30% better delivery time on their analysis.
So at Databricks, our mission is to democratize data and AI, and our destination is the lakehouse.
Now I want to introduce Matei Zaharia to talk about how to do fine-grained access control, as well as data sharing in the lakehouse. Thank you, everyone.
– [Matei Zaharia] Thanks, Shant. I'm excited to talk about data governance and sharing on the lakehouse and what we're doing for those.
First of all, let's start with governance. I think as anyone who's tried to work on this knows, governance for data and AI workloads today is very complex. It's mostly because you've got different kinds of systems that you have to manage. You've got data lakes, you might have a metadata system like Apache Hive, and you probably also have your data warehouse and different systems for machine learning. And then you have to figure out how to give all your users the right permissions in each system and how to audit all of it.
The problem is that each system has different ways of managing governance. For example, on your data lake, you can set permissions on the level of files and directories, but it actually has some limitations. You can't set row and column level permissions, for instance.
In contrast, once you start defining a table, in systems like Hive, you can set permissions on tables and views. But just turning that over a data lake doesn't guarantee that all the permissions of the underlying files match. So the permissions on the metadata can be out of sync with the data.
And then when you put data in the data warehouse, that has its own permission model involving tables, columns, and rows and so on, which is great, but it's just different from what you have elsewhere in your data systems. And of course, for machine learning, all the different systems in the machine learning space have their own way of doing governance.
So at Databricks, to make governance significantly simpler in the lakehouse, we've designed Unity Catalog, a single catalog and a single governance layer over all the data and AI assets in the lakehouse. The design is very simple. All the access to your storage system, whether it's data lake files, tables, machine learning models and so on, is policed by Unity Catalog. And there's a single permission model you can set up once that determines who has access to what.
And to make it very easy to configure, the permissions are all set using SQL GRANT statements. So anyone who knows how to administer a data warehouse or just a SQL database can go and set permissions and actually administer the whole lakehouse, including files and machine learning.
And everything is also centrally audited across these workloads. We've instrumented all of them so you can see who is doing what. And you also get fine-grained lineage across all of them to see what items are derived from other ones. That's a very powerful feature for understanding how your data is being used.
So how do you use Unity Catalog? The simplest part of it is just setting up the access controls. You can just do that using SQL. So for example, you can set permissions on tables using a GRANT statement for select, but you can also set permissions on files, which is something unique with Unity Catalog. And you can also see all the activity that's happening in the system as a system table across all these things, and then see how people are using the data.
Beyond that, you've got a powerful search interface where people can search across all the datasets in the whole company and can discover them and can set up comments and additional metadata about each one.
And finally, you've got a very powerful lineage feature that tracks column level lineage feature through all the workloads that run on the Databricks Lakehouse Platform. So you can see, for example, for every table, what tables upstream it's derived from. But the lineage even extends to notebooks and dashboards. So you can see, for example, which dashboards are using this table, or which scheduled jobs, which data science notebooks, and so on.
And because we are doing all this in the Databricks Runtime engine, we can understand the lineage at the level of individual fields. Not just datasets, but you can see which columns are derived from which ones, which is very useful for both managing just the data in your organization and also tracking any possible security problems if someone creates a derived dataset based on a sensitive column.
So Unity Catalog is also designed to integrate with the best of breed products in many areas of data governance. This includes products to set advanced policies, like Immuta or Privacera, the best data ingestion products, BI and dashboarding tools, and tools for data pipelines. So you can use that with everything in the modern data stack together.
And we're very excited to announce that just earlier this summer, Unity Catalog became GA. So now anyone with a Databricks account can try it out. And we've been seeing a lot of customers greatly simplify the way they manage their data with this.
So Unity Catalog is one piece of the equation in making the lakehouse very easy to govern, but the other piece is collaborating across organizations or across just different groups in the same organization. For this purpose, we've also built some very powerful features for data sharing.
Data sharing, if you haven't looked into it, it's often considered one of the most important aspects of a modern data platform. Many organizations are saying this is becoming increasingly important. For example, Gartner is saying that it may improve economic performance of organization by 3X, and they expect data ecosystems to grow significantly.
Now, traditionally, it's been possible to do data sharing in some data warehouse platforms, but only within the context of a single platform. So for example, if you have lots of users on, say, BigQuery, these users can share data with each other, but only with other instances of BigQuery. And if someone else has lots of users on, say, Amazon Redshift, they can share data with each other across instances of Redshift, but not with other platforms.
And for many organizations, this is very limiting because you do end up with different systems deployed in the organization or with partner organizations, and they aren't able to share data with each other. So this leads to vendor lock-in. And it also means that in practice, if you need to share a dataset broadly, you probably need to replicate it into many systems, many different cloud regions, and get it into all of them, which is just expensive to do.
So at Databricks, we took a very different approach, sticking to the lakehouse philosophy of openness, and we designed an open protocol for data sharing that's called Delta Sharing, which means that it can work across computing platforms. You don't have to be sharing with someone on Databricks. Anyone who can run open source systems like Spark, Pandas, Power BI or just applications in Java can actually consume the data.
The way it works is very straightforward. The data provider can simply set up tables to be shared using SQL commands, but then the data consumer can connect to them from many different platforms, not just from Databricks, and can actually read the data in there. And it's all based on the open source Delta lake file format. So basically, any tool that includes that library can start receiving data. So it enables cross-platform sharing, and you can just do it on your existing tables.
Delta Sharing's rapidly being embraced throughout the industry. We're already seeing petabytes of data shared per day, even though the product only started preview less than a year ago.
Just as an example, Nasdaq has been using it to streamline delivery of some very large datasets they have in the cloud that they just couldn't share through other means. It was just too expensive to replicate it into other platforms or share it through FDP or things like that.
And Shell has also been using it to share datasets with other companies it collaborates with to streamline production and make it more efficient. Again, massive datasets that now any platform that can run Spark or one of these other tools can immediately access.
Delta Sharing has also been evolving rapidly. There are a whole bunch of new connectors that launched this year. The support for tables that change over time and sharing a changed data feed where you can see just which rows changed. And also within Databricks, it's very easy to set up sharing with someone else on Databricks, say, just through a couple of clicks.
So Delta Sharing is also a growing piece of the lakehouse. We're working on some powerful features that we're also incorporating in the open source project, including sharing views that can filter down the data, and also sharing streams so that a streaming client can consume the changes in real-time and react to data that's being shared.
Delta Sharing is also GA as of this summer, so we invite you to check it out. And together with Delta Sharing and with Unity Catalog, you end up having very powerful control of governance, both within an organization and across organizations that collaborate.
So that's it for the governance and sharing features. Now back to Shant to talk about some of the powerful warehousing and BI features in the lakehouse.
– [Shant Hovsepian] Thanks, Matei. Hi, everyone. It's me, Shan't, again. I'm going to spend some time now talking to you about how we can do data warehousing and analytics on the Lakehouse.
Of course, all of that's possible with Databricks SQL. It's our SQL warehouse offering on top of the lakehouse. You can think of it as the house in lakehouse. I'm going to spend some time talking about how you can connect your data stack up, how we combine the best of data lakes and data warehouses, and of course, how we obsess over bringing the most value possible through performance.
So let's talk about connecting your data stack. When we set out to build Databricks SQL, we wanted a first-class SQL experience. So all of your existing data warehouse tools that are built around the whole SQL ecosystem can just work out of the box in the lakehouse. So whether it's ingestion from business critical applications, transformation of that data using standard transformation toolkits, or just consumption and querying of that in your business applications, we've got you covered with Databricks SQL.
First, let's talk a little bit about data ingestion. Basically, the ability to work with your data no matter where it is. We've gone and done tons of work with Fivetran to make sure that their traditional data warehouse integration work just as well on the lakehouse, making it easy to ingest business critical data from Marketo, Salesforce, Google Analytics. Whatever your source may be, it works seamlessly out of the box.
When it comes to data transformation, we've done similar integrations with dbt. Open source connector that you can use with Databricks and run all of your dbt pipelines, your transformations, and collaboratively explore and transform your dbt projects on the lakehouse.
And then data consumption, of course. What good is the data that you just load into the system if you can't give it to your users to get insights and build new business value from it? So we've worked with a ton of the existing BI vendors out there to make sure that all of their connectors are lakehouse certified, so you can get the best out of the box experience possible.
But of course, BI tools aren't just it. When it comes to business applications, sometimes you need to write code and integrate the data right where it's going to be consumed, which is why we're excited now to support our ability to run SQL from anywhere. Build custom data applications directly using a REST API to run SQL queries, or use one of our new open source connectors from your favorite programming language. Everything from Go, Node.js, and of course, Python, basic command line connectivity, and traditional Java JDBC stack are all available to connect to directly today.
So how are we combining some of the best of data warehousing and data lakes together? Well, we've come out with Python user-defined functions. Databricks had always made Python a first-class citizen when it comes to your data science, AI, and engineering workloads. But now you have the power and flexibility of Python when you need it, calling it directly from your SQL. So it's available in the preview today. You can check out the link here, use it for your machine learning models, your data transformation pipelines, directly from your favorite SQL tools.
Query federation. Now, the lakehouse is home to all your data sources, no matter where they are. And with query federation functionality, now you can directly connect to multiple different data sources no matter where they live. You can connect to remote databases without having to worry about ingesting and moving data. You can combine multiple data sources across these things, or even treat some data sources as a reference dataset or a dimension table while your full dataset is in the data lake. And all of this is seamless, and our query optimization engine automatically pushes down the parts of the query that are relevant to each of the different data sources.
Materialized views, of course. Materialized views are essential for speeding up queries when you want to use pre-computed results. And while materialized views have been around data warehouses for a while, what's new and interesting is we've taken some of the best technology Databricks has for streaming workloads, and we've built our materialized views on top of that. So you don't have to worry about picking refresh intervals or measuring the trade-off between data latency and correctness. With our materialized views, you can choose. You can have your data in view to be always up to date as needed, or dial that back and simplify your ETL and data transformation pipelines.
We've added new support for data modeling with constraints. This is very familiar in traditional data warehouse techniques. You can have your primary and foreign key constraints defined to understand and visualize the relationships between your tables, support IDENTITY columns, which automatically generate unique integer values for all the new rows added to your tables, and most importantly, enforced CHECK constraints so you never have to worry about data correctness or quality issues. As the data is loaded, we'll check those constraints and give an error if the system doesn't meet the criteria you've defined in your constraint. This is very unique-
– [Shant Hovsepian] .... meet the criteria you've defined in your constraint. This is very unique and no other cloud data warehouse provides this functionality today.
We recently added geospatial support, so now you can supercharge your geospatial processing. We're using a unique engine based on H3, so you have very efficient storage of your spatial data, whether it's large or small. You get high performance spatial joins. And by having a direct SQL interface now to the geospatial engine, you can combine your AI, your advanced GPU data visualization and your simple SQL analytics together in the same platform, without having to switch between tools anymore.
We also added information schema support. So now, all of your metadata is simply a query away. And what's even more interesting, with Unity Catalog, you can have an information schema table that defines relationships between your catalogs, your schemas, your columns, even your unstructured data sources, like locations for files and cloud storage, as well as machine learning models.
So, let's talk a little bit about some of the world-class performance features we've incorporated in Databricks SQL to make it a competitive and performing engine on the lake house.
So, Photon is our next generation query execution engine, that we built from the ground up. You can see in this chart over here, with every release of the Databricks runtime, we're constantly getting better and better performance. But eventually, things start to flatten out. We sort of hit a plateau. And it was at that point that we knew we needed to do a whole new paradigm shift to get to the next level of performance. And that's Photon. Here it is. Huge performance improvement from the release of this technology.
So, what is Photon? It's essentially an MPP Execution model, written in native code, leveraging vectorized CPU primitives. It's a lot of technical language for saying, it was built from scratch to be really fast. So far, it's processed exabytes of data, billions of queries, and it's not just about performance. Our customers are getting faster interactive workloads, they've seen their ETL jobs run at one fifth the cost, and they're seeing on average about a 30% savings in TCO. Sometimes as high as 80%, some of their workloads.
And if you are interested in learning more about Photon, we published a paper recently at the SIGMOD conference. We're very humbled and fortunate enough to win the Best Industry Paper Award, and it really details all the inner workings and details of how Photon works. Feel free to check it out and get more information.
And so, when we talk about performance and we talk about Photon and all of the things we've done in the system, we really wanted to go back to build things from the ground up, so we can have amazing performance on both the data warehousing workloads and a lot of the AI/ML workloads that you would normally get on the data lake.
And so over here on the left-hand side, we have the 100 terabyte TPC-DS price/performance benchmark. So this was something where Databricks set the official data warehousing performance record. You can read more about it over here. This was a fully audited benchmark, where Databricks ran 100 terabytes, a large intensive, designed for data warehousing workloads, benchmark. And we essentially looked at the price/performance. How much did it cost to run this benchmark? And with Databricks SQL, it was done at about one twelfth what some of the highest competitors were able to do. So a lot of work is done into making these stressful, difficult data warehouse workloads fast.
But I also want to draw your attention to this chart over here on the right side, where here, it's a much smaller data set. It's 10 gigabytes, not 100 terabytes. So traditionally, you wouldn't think of data lakes as a platform that made sense for small data sets, but when it comes to data warehousing and BI, many cases, you have a mix of small data and large data sets, and we wanted to make sure our performance was awesome regardless of the data set size. So here, you can see that while we started out running this concurrent query benchmarks, we're running a bunch of queries concurrently, measuring how many of them can complete in an hour on a small data set. When we started, it wasn't too great, around 4,000 queries an hour. But since then, we've seen a 3x improvement. Now, not just meeting the other data warehouse competitors results, but exceeding them in many cases.
And one of the ways that we've been able to get such great performance for concurrency is having an elastic infrastructure that can scale with query and workload demand. And that's possible with Databricks Serverless. So, Databricks SQL runs best with Serverless. It gives improved agility at a lower cost for our users. It's instant, elastic compute. You get fast query execution, you can scale nearly instantly. For the admins, it's zero management. You don't have to worry about pools and reserving instances and dealing with a lot of the cloud compute complexities that can happen when there are demand and supply issues. And overall, for IT budget, it means lower TCO. You don't have to over-provision anymore, and you can automatically shut everything down when things are idle.
But of course, we had to spend a lot of time and effort making sure that we're optimizing away that idle time. Because when you have a serverless environment, if there's nothing running, no queries, no user interacting with it, we want to shut those machines down so you don't have to get charged for it. So we've done a lot of work then to make sure once they're shut down, the next query that comes in doesn't have to pay any warmup costs, to read data from cloud storage or to recompute certain results. So a lot of the concurrency benefits in the elasticity of scaling with Serverless were made possible because as soon as you need more resources, as soon as you get a lot of concurrency, we can spin them up and they can execute queries immediately.
When we first started, that first query that you would bring in on a new instance would still take about 40 to 50 seconds because of warmup and caching effects. Today with Serverless SQL, now that first query can get up and running within 10 seconds of a new instance starting up. And pretty soon, we're working on a new persistent caching feature where you can have your new compute resources added to your warehouse immediately, within three seconds, to handle the new workloads.
And so with that, I want to conclude that the best data warehouse is a lake house. I'm going to stop talking about why, and benchmark results and all of these things, and let Miranda show you exactly why we think the best data warehouse is a lake house. Thank you.
– [Miranda] Hi, everybody. My name's Miranda and I'm a product manager on the Databricks SQL team. As you've heard today, the best way to warehouse is in fact a lake house, because it marries the speed of the data warehouse with the scale and flexibility of the data lake. So today, I'm going to walk you through DB SQL top to bottom. We'll start by creating a serverless SQL warehouse and then explore some of our data, it's lineage, it's governance. We'll then analyze that data and DB SQL's built-in SQL Editor and take a look at new features like query history, materialized views and Python UDFs. We'll wrap by switching over to our favorite BI tools and interacting with our lake house data live from those environments, all powered by serverless warehouses. We've got a lot of fun ahead of us, so let's get started.
I'm going to start by creating a serverless warehouse. Serverless warehouses come online in seconds and are optimized for high concurrency BI workloads. That's exactly what we're doing today, so I'm going to create one. We'll go ahead and make a new one called Acme. And as soon as I get that confirmation, I'm going to go ahead and switch over to the data tab. Now, why don't we pick Acme from here? Let's see what's going on. As I... Oh, already started. Great. And as you'll see here, I have a number of different catalogs, each with different schemas and tables within. This is all governed by Unity Catalog in a central manner, and today we're going to show exactly how, by creating a new one ourselves. So I'm going to go ahead and start by creating a new catalog. Let's call it, oh, acme_avo, since we've got some avocado data. We'll go ahead and hit create.
And you'll notice a couple things have happened here. First, default schema has been created, there's no data there. Why don't we go ahead and start by giving Shant and Matei permissions on this. Let's see, Shant. And I forget if I have a second user on here. Okay, great. Let's just give it all on this particular schema. You'll also notice that I have an information schema that contains a number of different views on all of the changes that have happened in my catalog. So for example, if I go to schema privileges, I'll be able to see those changes I just made to grant permission on the default schema to Matei and Shant. And while we're at it, why don't we just make sure that they have at least usage on the entire catalog? Just thinking ahead, I want to make sure that they've got everything they need and they're not going to need me to intervene anymore. So let's go ahead and give them usage.
Great. Now again, if I go and look at my catalog privileges information schema view, I should be able to see that that would just execute it successfully. All the information I need to really audit changes and permissions and data are going to be captured right here. Now, it's not just easy to see the changes in permissions, it's actually easy to bring data into Databricks as well. So certainly, one option is I can always ingest from a partner. Another option would be to use a technology like Delta Live Tables. But today, let's do something simple. Let's create a table from a local CSV that I have here on my machine, that has avocado prices. I'm going to go ahead and select Acme and make sure we're in the default schema. Great.
Excellent. And now you can see that the table UI gives me a few helpers. One, I can see the type, that's kind of auto-assigned to each of those columns. It's going to be coming in from the CSV. Everything looks good to me, so I'm going to go ahead and I'm going to hit create, and we will have a new table in our default schema. As soon as that's done creating, it drops me right back into that new table in the data explorer. Now I'm going to do a couple things, so we can take a look at lineage.
Great, so this opens it in a new query and I can go ahead and hit run. What that's going to do is it's going to go ahead and just give me a preview of that table if I want to. I'm going to just save this and say, avocado test. I'm going to let that continue and then I'm going to pop back to the data explorer.
And the other action I can take is I can create a quick dashboard. So I can pick a couple different fields of interest, and I'll go ahead and hit create there. By default, that's going to create a new dashboard for me that shows me a couple interesting stats about my data. But the reason I wanted to do all that is so I could show you a little bit more about the lineage of this exact table. So, one, if I come down to dashboards, you'll see downstream lineage, the quick dashboard, I can go ahead and open it directly from in here. That's the same dashboard we were just on. The other thing that I can do is I can come and click on this table insights button, to understand a little bit more about who the frequent users are. No surprise, there I am, we just created this table. As well as some of the frequent queries, and I can open any one of those in the SQL editor with just a click of a button.
Now that we've got a good handle on the compute, data and governance side of the lake house, why don't we actually spend a little time here in this SQL Editor and learn a bit more about what we can do in terms of analyzing our data? If I want to use a third party SQL workbench, I can do that via ODBC, JDBC connection. But today, let's take a closer look at the experience right within Databricks, side by side with your data. One thing that I really like about the editor here is that it's super trivial for me to just pick a few columns of interest and go ahead and expand those. If I want to format the query, I can do that with a click of a button. Also, definitely realize that not everybody loves buttons, so we have a number of different keyboard shortcuts available. Super easy to review the entire list of them.
And then I can go ahead, and just like most other SQL editors, I can go ahead and add a visualization if I'm interested in that. And when I have a few columns, we actually suggest one out of the box for you. I can toggle through a few different views, if I'm interested in that.
Now, all these executions that we've been running today, I can review in past execution. So anything that's in the current query or in all queries across all the work I've been doing, which is a really nice way to go ahead and take a look at anything. If I did something silly like put a number of extra commas first, it's going to kind of give me a little bit of a warning there. But then if I try to run it, I'm going to get that sort of syntax error highlighting that I'm used to. I'm going to see that this is the issue, row two is the row with the syntax error. And I can even see it's this first duplicate comma that's the issue. So I go ahead, and if I didn't want to just modify this, I could actually go ahead and click here and open up a past execution. If I wanted to resume editing this, I would go back to my original one. But it's pretty trivial to go back in time and see what were the versions that were working.
Now, I'm going to go ahead and open up another query that I have, and this one is going to be for national revenue trends. So let's go ahead and pull that up. This is on TPC-DH data. And let's go ahead and add another country, just so we know that this is not going to be coming from cache or anything like that. I'm going to go ahead and remove the limit 1000 I had when I was working earlier, and let's just run that. Now, you'll notice as this starts to execute, I can pull up a little bit more information about what's actually happening with my query. And then as it completes, I'm going to actually be able to access a lot more of a detailed breakdown.
Now, I pulled this up while the query was executing, but as it completes, I can always get it back that way. And I can go ahead and I can pull up the query profile, and very quickly, I can see where time is being spent, whether I prefer the graph view, the tree view. And I can kind of understand exactly what the different elements of my query performance are. Obviously, this is a pretty quick query, we're not too concerned with what we see here. But if I did have a longer running one or somebody else in my organization was running into some trouble, I could zoom in and kind of really understand exactly what was going on, and help them correct any sort of inefficiency that they might have in their query.
Now, this is kind of the profile for one single query, but certainly as an administrator, it's helpful to have an understanding of what's going on across your entire workspace. So we also offer this query history area, where at a glance, I can see everything that's happening. And I'm an admin on this workspace, but I can see everything that's happening across all the different warehouses. Or I could drill down into just one. I can look at myself or all users. I've been the most active recently, so I would expect to... Yep, there I am. See myself.
But really helpfully, the other thing that we can do is also filter by status. So if I want to see all the queries that failed, I can really easily drill down to those. You can imagine too, that if I wanted to understand any sort of queue or queries that haven't finished running, I could go ahead and pull up all the queries that are running and sort by the duration that they've been out and open. That's going to help me, as an administrator, really identify anywhere where I might want to engage a particular user and make sure that they're not blocked.
Now let's go back to the SQL Editor, because we're going to dig into a couple new really exciting features, Python UDFs and materialized views. So let's get started. Starting with Python UDFs, I'm going to pull up a quick example that I've already created and let's walk through that together. Now, UDFs, or user-defined functions, are ways to kind of extend Vanilla Spark with custom business logic. And to date, we've supported SQL UDFs, right? So I'm going to run through a quick example of that, and then we'll get to the fun stuff in Python. But the idea here is again, in the lake house, we want you to be able to use whatever makes the most sense for your use case. Whether that is super simple, to express logic in SQL, being able to reuse that, whether that's something a little bit more complex that requires Python or that might already exist, it would just be great to use instead of kind of rewriting in a SQL format. We want to give you the tools that you need to do whatever you need to do.
So let's go ahead and start by just taking a look at this table. We're going to start with a SQL UDF example. So you can see here, I just have a country, email and name fields. This SQL UDF, let's kind of do a simple example first, let's just do a little bit of masking. Let's insert some ellipses. And you could imagine, this might be a source table in the bronze layer that I have access to as a data admin, but I want to do a little bit of transformation at the silver layer before I expose that to the broader organization. I certainly want analysts in my organization to be able to understand, you know, unique number of users by email across each country, but they probably don't need the specific email themselves in order to do that.
So the first thing I'm going to do is I'm going to go ahead and run this, create this function. So all this is going to do is kind of take a look at where the @ from the email is, and then insert some ellipses on either side. This is going to be SQL UDF, so you can see that our language here is SQL. And let's just take a look and... Oh, let's make sure that it's doing exactly what we expect it to. So I'm going to just take a look at the country, and then I'm going to mask the email field from emails demo. You can envision this being a super easy function to call anytime you have any sort of data in the bronze layer where you want to apply some simple masking.
So we can see here, yep, I've been able to kind of mask, it'd be pretty good to get an idea of distinct users per country with this sort of masking. Again, it's not going to be as unique as an ID field, but it's helpful to at least be able to do an eyeball check. Now, this is where the fun kind of starts. This is where the new and exciting pieces come in. So here, what we're going to do is we're actually going to create a Python UDF. And all we're going to do is, if you remember that name field, we're going to go ahead and just say hi. Right? Very, very simple. Certainly know that much more common use cases are redacting PII from Nested JSON fields or calling forecasts.
But today, again, we want to kind of look at everything very, very simple here in the SQL editors. So let's go ahead and do that. We're unselected. We've now created that in the email catalog default schema, and here's where I'm going to bring it all together. So you remember that mask email SQL UDF, greet Python UDF. And we're going to go ahead and we're going to call both of those UDFs in the exact same SQL query. So we're going to stick to the country, then we're going to mask the email, then we're going to add a little greeting in front of the name. So let's go ahead and run that.
And just like that, we are able to get country, a masked email address again via SQL UDF, and then add a little greeting to the name field via a Python UDF. That's the power of the lake house coming together. Now, let's talk about materialized views. Another really critical feature in a data warehouse.
To take a look at materialized views, let's switch over into another environment where I've loaded up an example. Materialized views reduce costs, and improve query latency by pre-computing slow queries and frequently used computations. They also enable easy to use transformations by cleaning, enriching and [inaudible 00:51:12] normalization of base tables. Overall, MVs reduce costs while providing a simplified end user experience, because they're incrementally computing changes from the base tables.
And today, we're going to look at a very simple example. So first, if you recall, even though we're in a new environment, same data we're used to. So the acme_avo catalog, the default schema and the avocado table, we're going to go ahead and run that. Great. And we'll see here, that was a confirmation that was successfully executed. And we're going to do a quick check, but all I'm doing here is essentially picking a subset of my columns from that original avocado table. We're going to do the quick validation that we are actually kind of just seeing the date, total bags and region. And then what we're going to do is we're going to take this as our base table for the materialized view, and we're just going to set the table properties to kind of enable the data feed that we need. We're going to run that.
Perfect, successfully executed. And now, this is where we're actually going to create the materialized view. So you see here, I'm going to create a replace that materialized view. We're going to call it avocado_mv. We're leaving avocado_mv base as a separate table, that's what we're going to build off of. And we're just going to go ahead and group by regions. We're going to end up with bags sold by region. And let's just double check. We'll first create the materialized view, then we'll double check, spot check one of our regions.
Awesome, so we got the confirmation, but let's just take a look and see, for example, what the value is for Boston, all right? Boston's a big Celtics town, we're looking at avocados, avocados are green. Celtics going far in the playoffs this year, I'm sure. So let's see what comes back for that particular region. Awesome. So it looks like we are at about 21.49 million avocados, love to see it. Now, again, if you remember our data, it was over a few years. And what we're going to do now is, we have a materialized view that is computing the total bag sold per region. We took a look at what it was for Boston, but now we're going to actually show the power of the materialized view itself, and whether it's incrementally updating based on just the inserted new values.
So here, all I'm going to do is insert one more row of data. What we're going to do is we're going to... And this is, again, into the base table. We're going to insert one more row, we're going to say that there was a crazy sale in September of 2022, there were an extra 5 million bags sold. So we should see this go up by about 5 million. So let's go ahead and first insert this. And then the next thing we're going to do is refresh the view. Now that view's only going to compute on this one inserted value. Cool. One inserted row, one affected row, fantastic.
And now if we refresh this materialized view and we take a look at that same, look at that materialized view where the region is Boston, we should see that number go up by 5 million. Excellent. So we've refreshed our view. And now, let's take a look. Moment of truth. Do we have 5 million more bags in Boston? Yes, we do. We were at 21.495 and now we're at 26.495. Just like that, our materialized view automatically refreshed and took into account that additional row of data inserted into the base table. And we were able to quickly see how I can recompute.
Now, we've done a lot of analysis and exploration of our data in the lake house, right from within the Databricks SQL UI. But of course, we know that it's just as important to work within our favorite third party BI tools. And we also understand that's where a lot of analysts are spending most of their day. So, why don't we go ahead and switch over to Tableau?
Now that I'm over here in Tableau and I've established a connection using my personal access token and some details on that serverless warehouse. You can see here that I'm able to view data from that exact same lake house data we've been interacting with to date. So I can take a look at the acme_avo catalog, the default database, and if I want to take a look at what tables are going to come back, I can go ahead and drag avocado over here, one we're super familiar with. And immediately, I'll be able to take a look and confirm the fields are what I expect, as well as get a quick preview of some of the data. You'll notice that this is a live connection, but of course, extract is also always an option.
You know, what I'm now going to be able to do is go ahead and let's just ask and answer a question of our data, like we would in Tableau any day. I might want to know the total number of bags sold each year, broken out by region. So with a few clicks and drags, I can do that. And I can see, of course, US total is going to be very large, but I'm also seeing quite a bit of sales in California, Great Lakes, Mid-South. If I wanted to understand if that held true just kind of overall, it's really trivial to remove that kind of date metric from the columns field. And you can see here that yes, California, Great Lakes, Los Angeles, Mid-South - Mid-South, more of a region - all stay very popular in terms of total number of bags sold overall. And also if I bring back, broken out by different year.
Well, thank you all for joining me today. I really appreciate you giving me the opportunity to show you a little bit of what's new in Databricks SQL, and why the best data warehouse is in fact a lake house. Now, I'm going to actually hand the reins over to one of our customers so that we can hear a real life story. This is going to be how GSK uses Databricks SQL and Unity Catalog to scale data analysis through a harmonized mesh ecosystem. Take it away.
– [Haley] Thanks so much. I'm so happy to be with all of you here today. At GSK, we treat and prevent disease with our vaccines, specialty medicines and general medicines. We focus on the science of the immune system, human genetics and advanced technologies in four core therapeutic areas, to have major impact of health at scale. We're ambitious for our patients, we're accountable for impact, and we do the right thing, because 2.5 billion people around the world count on us to get it right.
So it's important to note that the advanced analytics journey at GSK did not start with the enterprise data and analytics platform. There are some amazing data experts all around the company. We have folks in vaccines developing end-to-end in the vaccine space, machine learning algorithms. In commercial, they're developing insights for the commercial line of business. In the supply chain, they're optimizing the manufacturing chains and the supply chains to decrease cost.
And each one of these groups, if you dive into these, they're even more siloed. Which, it's great that they're creating insights, but as I mentioned, they're silos. There's opportunity for data redundancy. And most importantly, from my perspective, there's this amazing story to tell about a treatment going from the research and development phase, through the supply chain, actually getting manufactured, all the way to commercial and being sold to customers. And with these different groups operating in silos, that story gets lost.
Additionally, there are inconsistent governance patterns. Chargeback is different between groups. They may not be using the right tools for the problem. And all of this together inspired the enterprise data and analytics platform at GSK, which is called Code Orange. We have a fit for purpose ingestion tool catalog. We have an enterprise data lake that is a harmonized mesh architecture, which we'll be getting into shortly, that has a complex and audit-ready security model. We have core and common data tools that are easy for different customers to use for things like data modeling, data cataloging, and data preparation. And we also have analytics suites for advanced and traditional data and analytics, with things like, in the future, data science workbenches and analytics workbenches as well.
So how did we do this? We built Code Orange, as I mentioned, the brand name for our data platform, on a harmonized mesh ecosystem, where we see the respective business units, before, as different nodes.
– [Haley] ... see the respective business units before, as different nodes. I really scaled down this image so I wasn't overloading you with each component on the node, but it is much more robust than this and I'm totally willing to discuss this with you guys after the session if you have some more questions. So, you guys can see we leverage Databricks on the mesh core, which is where our common services are, and then on each node as well, as well as our ADLS Gen Two storage, which provides a lake on each business unit's node or business function's node. Note that we have connections both networking and virtualization between each of the nodes to enable that data sharing, which I'm going to get into that a bit more as we get to the end of this session. During this talk though, because of the scope of this session, I really want to focus on how we're leveraging Databricks SQL.
Additionally, it's important to note that we're also using Databricks as a key component in our Fit for Purpose Ingestion catalog, and as part of our data analytics workbench in all of those tools. Focusing on Databricks SQL, I really want to focus on three key problems that it solved for us as we released it to the supply chain business unit last quarter. The first one is the ease of use. I know this may seem silly to some of you, especially in startups or smaller organizations, but in a 90,000-person company, it is so difficult to download software. Some people don't have the right permissions to download any query-type tools that they need from the internet, and they have to go through a complex governance process just to be able to hit download. But, since this is browser-based, it really increases time to insight by making it accessible to anyone who already has the right access.
Additionally, our super users who are using Databricks as part of that Fit for Purpose catalog or is part of our analytics suite, have everything they need in one place so they can easily go back and forth between their Python notebooks and then query whatever they just transformed and make sure it looks accurate using these query capabilities, which is really easy to use for those power users. Next, I want to talk about cost. This is not Databricks. The screenshot is actually a cloud help, which is how we do fin-ops at GSK. We have saved a lot of money by using Databricks. In the first iteration of the platform. One of our major spend was actually on traditional data warehouses because as you guys know, they're always running, always incurring cost. And one thing that Databricks SQL did for my customers is it put financial power into their hands as a user of the platform. They're able to optimize when the queries run. They can make it so it's basically an on-demand system that they can actually control rather than something that's just always incurring cost.
And finally, connectivity. I know this is going to be mentioned a lot earlier today and a lot more, but there are many, many, many different ways to connect to downstream analytics tools. Just by going to Power BI, for instance, you can click a couple buttons and you can figure out 20 different ways to connect from ADLS to a visualization tool. But, whenever you get into complex security models, some of these ways just don't work, and in a big organization, this leads to lots of support tickets, customer complaints, and similar problems. Databricks provides that easy connection to the downstream applications for traditional reporting insights and most importantly to me as someone who ends up writing a lot of documentation for the customers, it's self-documenting. People can read and understand how to do it without any jumping through hoops.
What we used to do before this was an option, we used personal access tokens from the data science and engineering workspace, and we have some pretty complex documentation in place to allow them to connect to Power BI using personal access tokens, and so this is a game changer as was demoed in a supply chain demo at the end of Q2 last year. My customers are demoing this to their customers because it's so powerful. I want to take a second to look at the future. As we all heard, Unity Catalog is GA, which is so exciting to me, and let me tell you why. Looking at the architecture diagram from before, we see that each of these nodes are connected. Well, also notice that we have workspaces in each environment as well. Each of these different business functions have an analytics and ETL workspace per environment. But what happens whenever you're trying to access data in each one?
Today we have two options. One, the customer, so the person in the business has to open up, let's say the supply chain's workspace, but then also the enterprise's workspace and actually not see their data in a unified view. We also have another approach where we use service principles, but what that does is it triggers a secondary governance process. So, whenever someone has already been granted access to the data because this isn't user-based, it's still another process of them having to go through and jump through hoops to get access to that data. What Unity Catalog is going to do for us is it will allow in a single workspace, the customers to have access to all of the data they have access to, and it will actually allow that sharing experience to greatly improve across the company.
Whenever I started talking to you all today, I mentioned the story of the treatment all throughout the process and how it touches every single business function to get that global view of what it means to bring a treatment to market, and with Unity Catalog, I think that we're really going to be able to accomplish this. Thank you all for the time. Next up, I would like to introduce Franco who's going to be going through some best practices with you all.
– [Franco Patano] Thanks, Haley. It's always great hearing from our customers. Hey, everyone, my name is Franco Patano and I am a Product Specialist at Databricks. I focus on Databricks SQL and I'm here to give you some tips from the field. First off, let's talk about cloud data warehouses. We often see in the field how organizations are using cloud data warehouses, and unfortunately, they're trying to use it as a modern data platform, but they're kind of stuck in this situation where they're realizing it's not a modern data platform and they're trying to look for an alternate solution. Often what we find is they're doing ELT on a cloud data warehouse because we used to do that on-prem. You had the choice, but there were these tools, ETL tools like Informatica, Talend, and Ab Initio where you would build your ETL outside of the warehouse because the warehouse was considered premium compute.
You didn't want to waste your premium compute on ETL tasks, but you could still do it because you bought the servers, you bought the software, you already licensed it, and you owned the network and you had your data center. Why couldn't you just do it? You were able to, so some people did, and what we find is that on-prem, this was fine, but when you lift and shift that into the cloud, all of a sudden costs get out of control because in the cloud you're metered on everything. Those workloads while inefficient, but they did what they were supposed to do on-prem, they end up costing a lot of money in the cloud because they're built up of really complex multi-stage processes to get that data to be able to fit into a table, from a file into a table. And most data warehouses, by the way, they might have things called streams, they don't have real streaming because they measure streaming in minutes. Real streaming is measured in seconds and that's a big difference.
And then also you can't really load crucial data for data science into a data warehouse. Things like images or video or audio, they don't fit. Often they might have some support for semi-structured data with things called a variant type where you could just shove anything in there. Often this is not very efficient and it's very complex because it involves you making numerous copies of the data before you can actually do something with it. We often find that these things are not optimized for data engineering, but the one thing that they're great at is doing BI and reporting through typical tools like Power BI, Tableau, Looker, DBT, and they have a really good system there. But the thing that it isn't good for was data science. And normally what we find with organizations that try to do this is they have to copy data back out of the warehouse, back onto a data lake in order to use these ML or data science tools, and that is very expensive to do. And often these tooling that they're all disparate and disconnected from the main data stack, and this is fraught with friction.
Often customers come to us and they're like, "Can't we just have one system to do everything from BI to AI? And that's what we think Databricks is, and I'm going to tell you, you could do modern data warehousing on Databricks and then you can enable all the other use cases for data science and machine learning. There is a real stream processing engine and Databricks called Spark Structured Streaming that can read directly from these event buses and process streams at the speed of thought. We also have a large amount of partners that have real-time CDC tools that connect to on-prem systems or cloud systems to transport or ETL that data out onto your data lake. And essentially, the foundation of the Lakehouse is Delta Lake because Delta is really just an open protocol. It's an open-source protocol to deal with your tables on a data lake, and we'll talk a little bit more about the benefits of that. But essentially, Delta is what enables you to have a table on a data lake, and that's how you can do modern data warehousing on Lakehouse.
And then once you have one construct that can be read as a table or files, you can service your BI or analytics needs with Databricks SQL with those tables or your data science and machine learning can be leveraged using the files because it's all the same construct on your data lake in the open source format, Delta Lake. Now, often people come to us and they're like, "Well, how do you actually process the data?" Data warehousing, they had this concept of raw staging presentation layer, and really it's the same concept in Delta or Lakehouse. It's bronze, silver, and gold. These are very similar terms. Essentially, bronze or raw is where you land your data that you got from the source and you want this for lineage purposes, and often this could be considered right optimized, just get the data and land it there. If you're getting data from a vendor, this is where it lands on a cloud Object Store. If you're getting data from streams, this is where you would land it from your streaming source onto Object Store.
And then just like what was made popular with Bill Inman and data warehousing, you want an integration layer, you want a layer that all of your data can be integrated together, but the data needs to be cleaned up and there needs to be data enrichment, you need to take care of bad dates or nulls or maybe common business logic. And you organize this data by key domains, and this is the silver, the staging layer. You can employ normal forms here. We'll talk about data modeling a little bit, but this could also be considered like an ODS or an operational data store. And then you need to actually deliver solutions to your business and this is where you're going to lead the gold layer of the presentation layer, and this is where solutions are built, and here you can build your data models. We recommend Star Schema, but this is your read-optimized layer. This is where your BI tools and your analytics are going to connect to in order to get that data to serve it up to your users.
And you can build all types of things here like sandboxes or data meshes and even feature stores or data science areas. So, let's talk about dimensional modeling. People often ask, "Can you do dimensional modeling on Lakehouse?" Absolutely. Delta Lake as a table format on a data lake. Essentially you can do all those same things, but we often get asked, "What do you recommend?" Typically, all of the common benchmarks that exist today are something of a Star Schema like the data warehousing benchmarks. That's what we know works really, really well because it's what we benchmark. That doesn't mean that other modeling techniques don't work or they're not performant. This is the one that we know it is. And some best practices are don't over normalize. Use conform dimensions. Definitely use surrogate keys for your joins and your optimizations. You can do solely changing dimensions for time tracking. Delta Live Tables has a great API called Apply Changes Into which makes this really simple, and then you can use materialized views to scale out your analytics MBI so that you have great performance at runtime.
What are some common things or some steps to success for dimensional modeling on Databricks or Lakehouse? Essentially, you should leverage Delta Live Tables if you don't have another choice for ETL on Databricks and Databricks SQL together. Essentially you get efficient ETL and ELT with query serving. We'll talk a little bit about how we benchmark this in a couple slides, but essentially you have three things to remember, Optimize, Z-order, and Analyze. Delta Live Tables currently handles optimizing Z-order for you. It scheduled jobs for it so that the tables are kept in sync. And then Analyze collects statistics for the cost-based optimizer to pick the right plan and to do other optimizations during runtime. We're trying to make these tools as simple as possible for our users, but those are just some things you can do just to stay ahead of the game.
Sometimes people ask, "What about me that data mesh architecture, can you do data mesh on Lakehouse?" Absolutely. What we find is that data mesh is more of a practice and less of a technology or a tool set, and you can essentially use the bronze layer to put your source systems where you get your sources from. In your integration layer, your silver layer, you can build your product domains. And then in your gold layer you can build all of your derived data products, and this is where you can enable your self-service layer so that you could have all of your data workers leveraging your derived data products so that they can provide value to your business.
And then let's talk about a Lakehouse mesh governance model. If you have a self-service layer, you need a way in order to govern that, and like we learned about Unity, this is how we can enable that on the Lakehouse. When you have enterprise data sources where maybe only a few people have modify access, but everyone can have read, you can enable access controls in order to do that. Or if you want to give business units or departments the ability to manage their own data, you can set different permissions on that database itself. And then if you want to enable users to bring their own data sets to enrich the data even further, you can apply different access controls to that layer as well, and this keeps everyone in their well gardens in order to manage proper data governance on the Lakehouse.
Next, people often ask, "How can I get good workload management out of Databricks SQL? How do I know I'm getting the best cost optimizations over the day? Databricks SQL warehouses on the backend just kind of manages for you, and I'll just explain how that works. Essentially, the first thing you have to decide is a size. If you don't know, I often recommend medium and it works out great. This is the first level with most cloud instances where you get a full network bandwidth and you really want that in order to have fast performance. And then what you want to do is set your scaling. You can set scale min one, 10 max. Don't get concerned with looking at the price tag on 10.
Basically, as you see here over the course of the business day, as users are coming in the morning, Databricks SQL warehouse is scaling up to handle all of those users' demands, and then as they leave for lunch, it scales down to make sure you're not paying for idle compute, and then when those users come back in the afternoon, it slowly scales up to make sure it's handling their demand appropriately. Behind the scenes, Databricks SQL is looking out for your costs to make sure you're getting the best price-performance in the cloud. And that's not all because I'm here to tell you the best warehouse is a Lakehouse and we run a ton of benchmarks to prove it. But I want to talk to you today about two. You've probably already heard of TPCDS.
We made a big splash in the news about this last year, but to me, that's only one half of the story because TPCDS is the industry-standard benchmark for data warehouses, for query serving. How fast can you do these 99 queries and report on the price performance of that? But how does the data get into a warehouse? It has to be ETL or ETL teed. And that's a completely other half of the equation. So, there's another industry-standard benchmark for warehouses for ETL, which is called TPCDI or data integration. And while this benchmark is very old, and we haven't submitted an official submission yet because the price performance metric accounts for buying hardware and software and the depreciation over three years, we just don't do that stuff in the cloud anymore. But, generally, every data researcher agrees on two things, what matters most in benchmarking, your throughput, and your price performance.
We're here to tell you that we ran those benchmarks too, and what we found is that with Lakehouse on Delta Lake and Delta Live Tables, we can process 23 million rows per second. That's amazing performance. Not only that, but we are able to process a billion rows for less than a dollar. That's just unheard of in the cloud today. And this is the value that we're trying to provide to you, a great data platform that does everything like data warehousing, ETL, and query serving, and everything machine learning and AI, and you can get great price performance on all of that. And you can see here that our weekly active users is growing very, very fast because our users and their administrators and the people that pay the bill all agree that Lakehouse is the best warehouse.
You might be thinking, this sounds great, but why do people typically choose Lakehouse? Essentially, it breaks down to the economics of Lakehouse architecture because you don't have to make multiple copies of the data. It's one copy in Delta Lake. There's no need to have those copies, and so your costs for data storage go down, and this is great. Most people agree we don't want redundant copies of our data sitting around for data security reasons and cost reasons. And then because you have a single platform for all of your data and AI needs, you can reduce some of your redundancy in the cloud. Because with migrations, we often find customers have on-prem tools and cloud tools, and there are a whole lot of overlap between these things. And then since all of your data is in one place and all of your tooling is on one platform, it simplifies your path to production and it has never been clearer than with the lakehouse.
And we generally take a technical approach to migrations where we do an architecture infrastructure assessment. Definitely reach out to our Databricks, we have a field team that is excellent at this. And then we help you plan your migration, moving your data from wherever it sits to Delta Lake. And then we migrate your pipelines. If you want to use DLT or if you want to use one of our partners like Fivetran, Rivery, or any others, and then we migrate your pipelines. And then once we've migrated the data on the pipelines, we can point all of your analytics and BI tools to the new Lakehouse in the cloud, and now you're free to explore all those new use cases with data science and machine learning on Lakehouse. If you need help accelerating your migration, whether you need tools like Arcion, Fivetran, or Click, or you need system integrators to help you accelerate your migration, there's partners like Accenture, Deloitte, and Cognizant ready to help you out across all three clouds, and we even have Databricks' migration specialists to help you package up different solutions that you need.
When you meet with Databricks, we can help you with our solutions accelerators, which are pre-packaged solutions that you can copy and deploy in your organization. We also have professional services and personalized workshops to help you accelerate your success. And let's take a look at Databricks Academy. This is a great resource for all of your data workers across your organization to be able to get the training they need to be effective in their role. We have an array of academy plan learning plans to help all of the different types of workers that you have working with your data in your organization from a Lakehouse fundamental learning plan to data analysis or even an advanced SQL for Databricks SQL. There's all different types of learning plans for everyone to learn how to deliver value to their business.
And there's even different types of courses. There's free self-paced courses. We have workshops and instructor-led, even certifications if you want to get certified. And here's a look at some of our tracks we have for the data analyst, the data engineer, machine learning practitioner, or the Apache Spark developer, platform administrator, and the architect. Thank you for joining us today. Have a great day.
– [Jillian] All right, Franco, fantastic. Thank you very much to all of our presenters and all of our speakers and attendees for joining us today. It was an awesome session. We had so many questions coming through, so it's now time to kick off our live Q and A with all of our presenters. My name is Jillian [inaudible 01:23:39]. I am going to be your moderator for today. First off, we had lots of questions about how to qualify for the voucher. Just to recap, you will need to answer the feedback survey and take the fundamental course, which is about two hours. The links are in the resources section of this webinar platform. We will also send them to you via email after this webinar, and you will get links to the presentation, the recording as well, of course. We hope you enjoyed it so far. Let's just start with a few questions, starting with some intro basic questions about that. Just to recap, what can Databricks provide that others cannot?
– [Franco Patano] Is it open to anyone to take that?
– [Jillian] Go ahead, Franco.
– [Franco Patano] Hello, everyone. That's a great question. Most customers, they have a lot of choice in the cloud today, and you have cloud providers providing the solutions that are built for their cloud, and you have some vendors that are providing solutions that are built among the clouds. And typically, when I think of this type of question, it's about the confusion of choice. You have all of these different choices that are thrown at you as a consumer of these cloud products, and most of the time, you just need to get your work done. You need to build something, you need to build a platform, and you need to enable all of your users. What I think is something that Databricks provides that no one else does is essentially a unified analytics platform. Where can I build one solution that can service all of my users in my organization? And that's what I think Databricks provides that no one else does.
– [Jillian] Absolutely. We're a unified platform. Does that mean we provide all components required for batch processing, training, data warehousing, all that data science, machine learning?
– [Franco Patano] Yes. Yeah, that means when we say that we were looking at the architecture earlier that Sean was showing that all of those use cases are supported.
– [Jillian] Fantastic.
– [Franco Patano] You can have one platform to do your ETL, data engineering, machine learning, business intelligence with warehouse-serving all on one copy of your data at Delta Lake.
– [Jillian] Right. And that's how the Lakehouse is better than either the traditional data lake on one side or the traditional data warehouse on the other. Is Databricks available for on-prem deployment?
– [Franco Patano] That is a great question. Databricks is only available on the public clouds, so AWS, Azure, and GCP. But, that doesn't mean that we don't work in a hybrid type of architecture. We have a lot of customers that still have pockets of things on-prem, or they're going through their cloud migration journey. And with some of the features that were announced today, especially Unity and Query Federation, you essentially can connect to your on-prem networks to catalog all your data, and if you want to federate, you have that option. Or if you want to ETL that data into the cloud, you have that option as well.
– [Jillian] That's awesome. Databricks is more like a SaaS model, but do customers have to manage the infrastructure? Do we do it? How much work is needed?
– [Franco Patano] That is another great question. Databricks offers options in this category. Traditionally how Databricks works is we have a SaaS product that can be considered the UI, the notebooks, the cluster managers, the warehouse managers, but you can actually deploy the infrastructure in your account. It's similar to IAS, but not really, but it's really more of a SaaS product. And then we have our service offering, which is completely managed for you.
– [Jillian] Awesome. We have some questions also about how to get started. Some of the attendees ask, do we have any advice about skills that somebody needs to get started on its journey with Databricks? Is it mandatory to know Python or any other language specifically for Databricks? What's your advice on that?
– [Franco Patano] Yeah, I have organizations asking all the time and it usually comes out of, "We don't know Spark, how can we use Databricks? Or I don't know Scala, how can we use Databricks?" Or something along that effect. And actually, you don't have to know Spark to use Databricks, you don't even have to know Scala. Actually, the most popular languages on our platform are SQL and Python. SQL is considered lingua franca of the data professionals all over the world. It's the most common language out there. Python is a close second, but we even offer low-code tools on Databricks now. We just announced Bamboolib that basically is kind of like an Excel macro recorder for Databricks, or essentially for Python. Even if you don't know coding, you don't have to in Databricks, you can use Bamboolib and learn on your own, and you can even go to academy.databricks.com to help you on your journey.
– [Jillian] Yeah. Awesome. Thanks, Franco. I see we have Mate and Sean on the line as well. I hope the audio is functioning now. I just want to go back a little bit to the origins of the Lakehouse and maybe perhaps talk a little bit about Delta as well. We had lots of questions around Delta. Can somebody just summarize how a data lake is different from Delta Lake?
– [Matei Zaharia] Yeah. I can take that.
– [Jillian] Yeah.
– [Matei Zaharia] Oh.
– [Jillian] Go for it, Mate.
– [Matei Zaharia] Okay. Yeah. Traditionally a data lake was basically just a low-cost file-based storage where you could store large amounts of data as they're being produced. You don't have to throw anything away, and then you can query them and organize them later, and then extract more valuable curated data sets...
– [Matei Zaharia] ... Stacked, more valuable, curated data sets from them. Delta Lake and Lakehouse are trying to take the same type of infrastructure, but just make it also good for actually working with the curated data for data warehousing. Basically, you get the same kind of low cost, large scalability and easy ingest that you have with the data lake, but then you can have tables that get really good performance. You can have different sort of statistics and indexes about them, and you can also have transactions and version control and other management features right on there, so basically eliminating the need for a second system to do those advanced data management things.
– [Jillian] Very cool. So is that fair to say that, essentially, we can think about a lake house as your existing data lake with Delta Lake on top to help manage the data? Then all of the tools that we provide for ETL, data warehousing, streaming, DSML, et cetera, et cetera, and cataloging?
– [Matei Zaharia] Yeah. And basically just a additional kind functionality on top of the data lake. A key part of that is Delta as the data management and storage layer that supports all those management features.
– [Jillian] Awesome. So we're hearing about other formats available. So could you compare Delta with other open source table formats like Iceberg and Murphy?
– [Matei Zaharia] Yeah, so definitely this Lakehouse model is becoming quite popular and there are other projects that do similar things, that update your data lake to make it easier to manage.
Out of the ones today, I think Delta is the one that's been used the most widely. It's used at basically thousands of companies already, partly because Databricks and Azure and other vendors have built products based on it.
These formats are quickly evolving. But at least today, I think in Delta you'll find a lot more emphasis on scale. It works very well with very large tables and on performance for queries of any size basically. We've tried to really optimize performance because we wanted to take on the data warehousing use case head on. Some of the other projects started more in just the data lake world and these kind of batch workloads. But the formats are all evolving over time.
– [Jillian] Yeah. Awesome. Okay, so that's for Delta.
By the way, as we go through questions, I'll ask attendee, feel free to keep asking questions. We're watching as the Q&A panel, they're coming very fast, but we'll do our best to see what's coming and maybe take some additional as we go.
So about Unity Catalog, let say, we've heard from many customers that it truly is a game changer. So could you just recap why customers should adopt Unity Catalog for their data governance needs on the Databricks platform and what they love about it the most?
– [Matei Zaharia] Yeah, I mean, I think it just makes it a lot easier to manage your data, to manage who has access to it, to audit it, and to set up advanced governance around it. We've seen a lot of people adopted just to simplify that process. Once you do that, the interface is very similar to managing permissions in a SQL database, so just anyone who has the knowledge of how to do that from a system like, say, BigQuery or Edge, Shift or anything like that, can now manage the entire lake house and all the data sets they already have in there and give people fine grain access to pieces of it.
But it is also something where that we're building a lot of great new functionality in. For example, the Lineage feature is one of the favorite features among early users. It just simplifies a lot of operations when working with data, just to know who's using this data set, when did they last use it, and make sure you don't break anything as you change stuff in there. And we have quite a few other great features just coming on top of that, as well.
– [Jillian] Anything you'd like to share?
– [Matei Zaharia] I'm excited about the tags and attribute-based access control based on those. Basically, instead of having to set permissions on individual resources, like individual tables, you can set a tag. You can create a tag, for example, this is, say, credit card data and apply that to lots of columns and then just create a rule that applies to all of them that's based on the tag. Then you can just have your data stewards who just figure out which data to tag, or you can do it automatically using background scanning. Then everything has those tags and all the rules apply to them.
– [Jillian] Yeah, that's very cool.
We had some questions about integration to Unity Catalog. I know you touched on this in your presentation. But again, just to recap, there was so much content today. Unity Catalog works across multiple workspaces. It works with your active directory, for example, works with listing solutions like Collibra, Informatica, correct?
– [Matei Zaharia] Yep. Yeah, we've been working with all the major product vendors in the data space to make it work well. So all the BI tools, all the ETL products, and also catalogs like Collibra and Elation that let you catalog everything in your enterprise.
– [Jillian] Very cool.
You also talked about data sharing. So could you talk about some of the key benefits that customers would realize from using Delta sharing on Databricks?
– [Matei Zaharia] Yeah. I mean, it basically makes it easy to share data in your organization with other groups, either in your company or in others. One of the key things with it is the other groups don't need to be running the same platform as you. For example, they don't need to be on database. You can share data with someone who is just doing analysis, say, on a VM, doing data science and Python, they can just directly connect from Python to it. You can share it with someone who's using Power BI; they can just connect in the Power BI user interface. They don't need to set up a data warehouse or anything like that to put the data into. You can share it with anyone that's running Spark in any form or Databricks SQL or other tools.
When we talk to users who need to exchange data a lot, the top problem was how hard it is to share to different platforms than what you have, because every platform is trying to use it as a lock-in mechanism to encourage more people to use the same platform because of the convenient sharing. We think in most organizations, you're never going to just replace all that you have in every corner of it and have a single computing engine deployed everywhere. So we'd rather work with everything that's already out there and just make it very convenient for people.
– [Jillian] All right, Franco, fantastic. Thank you very much to all of our presenters and all of our speakers and attendees for joining us today. It was an awesome session. We had so many questions coming through, so it's now time to kick off our live Q and A with all of our presenters. My name is Jillian [inaudible 01:23:39]. I am going to be your moderator for today. First off, we had lots of questions about how to qualify for the voucher. Just to recap, you will need to answer the feedback survey and take the fundamental course, which is about two hours. The links are in the resources section of this webinar platform. We will also send them to you via email after this webinar, and you will get links to the presentation, the recording as well, of course. We hope you enjoyed it so far. Let's just start with a few questions, starting with some intro basic questions about that. Just to recap, what can Databricks provide that others cannot?
– [Franco Patano] Is it open to anyone to take that?
– [Jillian] Go ahead, Franco.
– [Franco Patano] Hello, everyone. That's a great question. Most customers, they have a lot of choice in the cloud today, and you have cloud providers providing the solutions that are built for their cloud, and you have some vendors that are providing solutions that are built among the clouds. And typically, when I think of this type of question, it's about the confusion of choice. You have all of these different choices that are thrown at you as a consumer of these cloud products, and most of the time, you just need to get your work done. You need to build something, you need to build a platform, and you need to enable all of your users. What I think is something that Databricks provides that no one else does is essentially a unified analytics platform. Where can I build one solution that can service all of my users in my organization? And that's what I think Databricks provides that no one else does.
– [Jillian] Absolutely. We're a unified platform. Does that mean we provide all components required for batch processing, training, data warehousing, all that data science, machine learning?
– [Franco Patano] Yes. Yeah, that means when we say that we were looking at the architecture earlier that Sean was showing that all of those use cases are supported.
– [Jillian] Fantastic.
– [Franco Patano] You can have one platform to do your ETL, data engineering, machine learning, business intelligence with warehouse-serving all on one copy of your data at Delta Lake.
– [Jillian] Right. And that's how the Lakehouse is better than either the traditional data lake on one side or the traditional data warehouse on the other. Is Databricks available for on-prem deployment?
– [Franco Patano] That is a great question. Databricks is only available on the public clouds, so AWS, Azure, and GCP. But, that doesn't mean that we don't work in a hybrid type of architecture. We have a lot of customers that still have pockets of things on-prem, or they're going through their cloud migration journey. And with some of the features that were announced today, especially Unity and Query Federation, you essentially can connect to your on-prem networks to catalog all your data, and if you want to federate, you have that option. Or if you want to ETL that data into the cloud, you have that option as well.
– [Jillian] That's awesome. Databricks is more like a SaaS model, but do customers have to manage the infrastructure? Do we do it? How much work is needed?
– [Franco Patano] That is another great question. Databricks offers options in this category. Traditionally how Databricks works is we have a SaaS product that can be considered the UI, the notebooks, the cluster managers, the warehouse managers, but you can actually deploy the infrastructure in your account. It's similar to IAS, but not really, but it's really more of a SaaS product. And then we have our service offering, which is completely managed for you.
– [Jillian] Awesome. We have some questions also about how to get started. Some of the attendees ask, do we have any advice about skills that somebody needs to get started on its journey with Databricks? Is it mandatory to know Python or any other language specifically for Databricks? What's your advice on that?
– [Franco Patano] Yeah, I have organizations asking all the time and it usually comes out of, "We don't know Spark, how can we use Databricks? Or I don't know Scala, how can we use Databricks?" Or something along that effect. And actually, you don't have to know Spark to use Databricks, you don't even have to know Scala. Actually, the most popular languages on our platform are SQL and Python. SQL is considered lingua franca of the data professionals all over the world. It's the most common language out there. Python is a close second, but we even offer low-code tools on Databricks now. We just announced Bamboolib that basically is kind of like an Excel macro recorder for Databricks, or essentially for Python. Even if you don't know coding, you don't have to in Databricks, you can use Bamboolib and learn on your own, and you can even go to academy.databricks.com to help you on your journey.
– [Jillian] Yeah. Awesome. Thanks, Franco. I see we have Mate and Sean on the line as well. I hope the audio is functioning now. I just want to go back a little bit to the origins of the Lakehouse and maybe perhaps talk a little bit about Delta as well. We had lots of questions around Delta. Can somebody just summarize how a data lake is different from Delta Lake?
– [Matei Zaharia] Yeah. I can take that.
– [Jillian] Yeah.
– [Matei Zaharia] Oh.
– [Jillian] Go for it, Mate.
– [Matei Zaharia] Okay. Yeah. Traditionally a data lake was basically just a low-cost file-based storage where you could store large amounts of data as they're being produced. You don't have to throw anything away, and then you can query them and organize them later, and then extract more valuable curated data sets...
– [Matei Zaharia] ... Stacked, more valuable, curated data sets from them. Delta Lake and Lakehouse are trying to take the same type of infrastructure, but just make it also good for actually working with the curated data for data warehousing. Basically, you get the same kind of low cost, large scalability and easy ingest that you have with the data lake, but then you can have tables that get really good performance. You can have different sort of statistics and indexes about them, and you can also have transactions and version control and other management features right on there, so basically eliminating the need for a second system to do those advanced data management things.
– [Jillian] Very cool. So is that fair to say that, essentially, we can think about a lake house as your existing data lake with Delta Lake on top to help manage the data? Then all of the tools that we provide for ETL, data warehousing, streaming, DSML, et cetera, et cetera, and cataloging?
– [Matei Zaharia] Yeah. And basically just a additional kind functionality on top of the data lake. A key part of that is Delta as the data management and storage layer that supports all those management features.
– [Jillian] Awesome. So we're hearing about other formats available. So could you compare Delta with other open source table formats like Iceberg and Murphy?
– [Matei Zaharia] Yeah, so definitely this Lakehouse model is becoming quite popular and there are other projects that do similar things, that update your data lake to make it easier to manage.
Out of the ones today, I think Delta is the one that's been used the most widely. It's used at basically thousands of companies already, partly because Databricks and Azure and other vendors have built products based on it.
These formats are quickly evolving. But at least today, I think in Delta you'll find a lot more emphasis on scale. It works very well with very large tables and on performance for queries of any size basically. We've tried to really optimize performance because we wanted to take on the data warehousing use case head on. Some of the other projects started more in just the data lake world and these kind of batch workloads. But the formats are all evolving over time.
– [Jillian] Yeah. Awesome. Okay, so that's for Delta.
By the way, as we go through questions, I'll ask attendee, feel free to keep asking questions. We're watching as the Q&A panel, they're coming very fast, but we'll do our best to see what's coming and maybe take some additional as we go.
So about Unity Catalog, let say, we've heard from many customers that it truly is a game changer. So could you just recap why customers should adopt Unity Catalog for their data governance needs on the Databricks platform and what they love about it the most?
– [Matei Zaharia] Yeah, I mean, I think it just makes it a lot easier to manage your data, to manage who has access to it, to audit it, and to set up advanced governance around it. We've seen a lot of people adopted just to simplify that process. Once you do that, the interface is very similar to managing permissions in a SQL database, so just anyone who has the knowledge of how to do that from a system like, say, BigQuery or Edge, Shift or anything like that, can now manage the entire lake house and all the data sets they already have in there and give people fine grain access to pieces of it.
But it is also something where that we're building a lot of great new functionality in. For example, the Lineage feature is one of the favorite features among early users. It just simplifies a lot of operations when working with data, just to know who's using this data set, when did they last use it, and make sure you don't break anything as you change stuff in there. And we have quite a few other great features just coming on top of that, as well.
– [Jillian] Anything you'd like to share?
– [Matei Zaharia] I'm excited about the tags and attribute-based access control based on those. Basically, instead of having to set permissions on individual resources, like individual tables, you can set a tag. You can create a tag, for example, this is, say, credit card data and apply that to lots of columns and then just create a rule that applies to all of them that's based on the tag. Then you can just have your data stewards who just figure out which data to tag, or you can do it automatically using background scanning. Then everything has those tags and all the rules apply to them.
– [Jillian] Yeah, that's very cool.
We had some questions about integration to Unity Catalog. I know you touched on this in your presentation. But again, just to recap, there was so much content today. Unity Catalog works across multiple workspaces. It works with your active directory, for example, works with listing solutions like Collibra, Informatica, correct?
– [Matei Zaharia] Yep. Yeah, we've been working with all the major product vendors in the data space to make it work well. So all the BI tools, all the ETL products, and also catalogs like Collibra and Elation that let you catalog everything in your enterprise.
– [Jillian] Very cool.
You also talked about data sharing. So could you talk about some of the key benefits that customers would realize from using Delta sharing on Databricks?
– [Matei Zaharia] Yeah. I mean, it basically makes it easy to share data in your organization with other groups, either in your company or in others. One of the key things with it is the other groups don't need to be running the same platform as you. For example, they don't need to be on database. You can share data with someone who is just doing analysis, say, on a VM, doing data science and Python, they can just directly connect from Python to it. You can share it with someone who's using Power BI; they can just connect in the Power BI user interface. They don't need to set up a data warehouse or anything like that to put the data into. You can share it with anyone that's running Spark in any form or Databricks SQL or other tools.
When we talk to users who need to exchange data a lot, the top problem was how hard it is to share to different platforms than what you have, because every platform is trying to use it as a lock-in mechanism to encourage more people to use the same platform because of the convenient sharing. We think in most organizations, you're never going to just replace all that you have in every corner of it and have a single computing engine deployed everywhere. So we'd rather work with everything that's already out there and just make it very convenient for people.
– [Jillian] Right. Plus, now based on this technology Delta sharing, we also provide solutions like marketplace and clean rooms. Correct?
– [Matei Zaharia] Yeah, and these will have the same benefits, basically, that you can connect to them from any computing platform and you can actually exchange and collaborate on data that way.
– [Jillian] Yeah, yeah. Very cool.
Is there anything else you would like to on Unity Catalog and Delta sharing? I have more questions on ingest, DTL, DB SQL, things like that.
– [Matei Zaharia] Yeah, they're both generally available, as I said in the webinar. So yeah, we're very excited to see people try them out and hear feedback.
– [Jillian] Absolutely.
So switching here a little bit. So we've talked about Delta, we've talked about Unity Catalog and data governance and data sharing.
We had lots of questions around ingest in ETL as we went through this webinar as well. One of the question was, "For data ingestion, should I have to use other ETL tools such as Fivetran, Data Factory, et cetera, or do we have any components provided on Databricks itself?"
– [Shant Hovsepian] I can take that. Hey, everyone, it's Sean.
– [Jillian] Hey, Sean.
– [Shant Hovsepian] Yeah. So Fivetran and Azure Data Factory are very frequently used ingestion tools with Databricks. Fivetran works great when you have various different data sources that you want to bring in. If you're in the Azure stack, ADF is also connected to just about anything. So we have tons of users that use those together with the rest of the Databricks platform.
Out of the box directly in Databricks with the workflows product and DLT, there are various different types of ingestion pipelines you can build and data sources you can choose from. Really, it's the beauty of the Lakehouse, have the flexibility to choose the tools and system that you're familiar with and that works with you. It's compatible with just about all the leading methods out there.
– [Jillian] So we had some specific questions like, "Is ingestion from block storage supported, for example?"
– [Shant Hovsepian] Yeah, of course. If that's Azure blob storage, yes. Then generally talking about blob storage, wherever the data is stored, yeah, it's always available in the Lakehouse.
– [Jillian] Yep. "Can I import data from Kafka, like auto streaming services?"
– [Shant Hovsepian] Yes. So you can use Auto Loader or the general streaming infrastructure to bring in data directly from Kafka topics. That's one of the nice things, you can have the realtime data in the Lakehouse directly with all of your data warehousing type workloads. So you don't ever have to worry about data being too old or stale. You can basically get that realtime feed right in.
– [Jillian] And can we implement CDC using Auto Loader?
– [Shant Hovsepian] Technically, you can. Auto Loader out of the box doesn't have CDC integration. with various data sources unless it was added recently. Franco, maybe you would know better than I would on that one?
Yeah, I think it's compatible. You can, essentially, for the source that you're reading from, whether it's Postgres or some other database, if you can get the stream to come in with a wall log CDC format, then we can have Auto Loader to that.
– [Jillian] Okay. So as we have this data coming in and all of these data pipelines, how do we handle the quality in Databricks?
– [Shant Hovsepian] Oh, yeah. So if you've heard about Delta Live Tables, DLT, which was something that we made generally available recently during the Data AI summit this summer, that's got an amazing feature. It's called expectations, where you can essentially define what you expect your data to look like. When those expectations aren't met, as the data's going through the pipeline and transformations, you can redirect those results to an error table or an exceptions table, get an alert and get a notification onto the system. So it's automatically maintained, you just need to define your constraints.
– [Jillian] Very cool. So DLT is our solution for managing your data pipeline's quality, stream data, et cetera, et cetera?
– [Shant Hovsepian] Yeah. Specifically. It's expectations. It's a cool feature if look it up online on the docs, it makes it very easy to constantly monitor your data quality and then deal with anything that doesn't meet your requirements.
– [Jillian] Yeah. We had a question around small, I guess, extensibility. So all these data is coming in into Delta Life Tables, Delta tables creating Databricks. Can they be consumed through APIs?
– [Shant Hovsepian] Oh, yeah. So especially with the Lakehouse, the value in many times is you can get this realtime, ingested data and make it accessible to everybody in your organization. We've done a lot of work, I talked about some of the new connectors that are available, but there's essentially rest APIs that you can use to get through your data and tables and query them as well as native SDKs, everything from Java, C#, traditional data sources, Python, Go, and Node where you can actually get to the data pretty...
– [Jillian] Very cool. Okay, so let's talk about Databricks SQL a little more. So we touched on serverless compute in the presentation and in the demo for Databricks SQL. So can you clarify, is it using the AWS plus Azure resources or Databricks compute itself? How does the infrastructure work?
– [Shant Hovsepian] Oh, yeah. That question came up a couple of times. So Databricks SQL works in two modes.
There's essentially a managed serverless version where all of the infrastructure is managed by Databricks on your behalf, so nothing spins up in your VPC in your accounts and the compute is instantly available and can scale nearly elastically based on your demand.
Then there's what we call DB SQL Classic, which is something where it's essentially a set of special built VMs that Databricks deploys into your VPC, your network, and your cloud account. So it's not fully managed infrastructure, there are some systems and resources that you'll see show up in your cloud console when using Databricks, but you get the flexibility to choose from both, depending on what your use case requirements are.
– [Jillian] Right. Awesome.
So in one of the examples we saw in the demo with the Tableau, with the live connection, if the data doesn't change, does it still query the data every time you change the visualization or will there it be a persistent cache on the serverless connection?
– [Shant Hovsepian] Oh, yeah. No, for sure. There's a lot of intelligence, and this is the beauty of Delta, is added logical, reliable ability to reason about the data in the Delta Lake so that we have what is traditionally a transaction snapshot ID equivalent... But when someone queries the same data, especially it's very 80/20, 80% of users tend to run the same queries over and over again. So caching is a huge benefit.
Behind the scenes we use Delta's snapshot transaction isolation to see, "Oh, the last time this query ran, we have those results, they're saved, we can just reuse them again and not run this query all over again if we know for a fact that Unity Catalog tells us that nothing has changed with the security permissions on the data and user and that Delta tells us that underlying data hasn't been updated, so the transaction snapshot hasn't changed." In those cases, it'll immediately serve the data from the cache. It'll be super snappy.
– [Jillian] Yeah, yeah. Thanks, Sean.
So Miranda, it was an amazing demo. Thank you so much for putting all of this together. You showed us some very cool existing and upcoming capabilities on Databricks SQL. Python UDF and Materialized View specifically are still in private preview. Can you comment on one that'll be publicly available, maybe or what's coming next?
– [Miranda] Sure. Yep, you are correct that both Python UDFs and Materialized Views are in private preview. The exact timing of a public preview availability will depend on just how we hit some of our exit criteria and the feedback we get during that private preview. So highly encourage anyone interested to sign up so that we can go ahead and get you early access and you can kind of help shape exactly what that experience looks like.
Next on deck is going to be Query Federation. So that private preview has not kicked off yet, but that's coming. If there's any interest there, again expressing, letting us know now so we can get you on the list and reach out when we're closer.
– [Jillian] Very cool. Thanks. Can you clarify how the DB SQL, Databricks SQL, refers from stock SQL?
– [Shant Hovsepian] Oh, I can take that. First and foremost, Databricks SQL is a whole service. It's a platform for your data warehousing needs. Databricks SQL is ANSI standard SQL compliance and it's built as a modern MPP architecture.
Spark's SQL is a part of the Apache Spark project and it's essentially the expression layer that you use to define SQL transformations, SQL expressions while you're working with Spark. You can embed it in your Python, you can use it with the rest of the Spark API. It's not quite ANSI SQL compliance at all, and it's not a full hosted, managed service to run all of your workloads. It's really just one of the parts of the general Spark project.
– [Jillian] Yeah, they're very, very different. But we had a few questions about this, so wanted to clarify.
I loved one of the comment we had throughout the presentation. So Nathalie said, "Okay, so I can see you have priority with other cloud data warehousing, right? Data masking, materialized views, [inaudible 01:49:18], time travel. So how is Databricks SQL different from a data warehouse? How's Breakout different from a data warehouse?" Could you clarify that again?
– [Shant Hovsepian] Oh, this is a good one. Who wants to take this? I could do it, too.
– [Jillian] Well, Franco, I know that this is a question a lot of your customers are asking you, as well. He was put on mute.
– [Shant Hovsepian] Well, okay. So I'll take first pass, that's it. But the most important thing is, and of the biggest ahas we've seen with Databricks SQL, yes, you can do your ETL, you can do your warehousing, you can your BI. It's very flexible. It's extremely cost-effective. That's the beauty of being built on the open standards and open technology of the Lakehouse. There's a huge value there that you get from a TCO perspective.
But the aha moments that we've really seen is when people can do simple AI and realtime stream processing trivially with that data in the data warehouse. That's in DB SQL, you don't have to move to data to another system, you don't need to copy it. You can kind of get those predictive insights almost instantly from the exact same data. So if we go back to that data AI maturity curve that I was talking about, because you don't need to switch systems, you don't need to reload the data, you don't need to change your tooling set, you don't need to get it retrained on it. You can go from basic BI to AI aha with just a couple commands. So it's very simple to go along that whole journey.
– [Jillian] Yeah, yeah. So I think to summarize, Databricks SQL is a serverless data warehouse on the Lakehouse, but it's part of a broader platform that does ETL, streaming, data sense, machine learning, so all in one. That's, I guess, one of the key differentiator.
Actually I see one question that's coming through the screen that's a great segue in the next section because they wanted to talk about Photon a little bit. So can you elaborate on the difference between Photon and Databricks SQL?
– [Shant Hovsepian] Oh, yeah. So Photon is a general purpose compute engine. Essentially, it's a technology and it's like the engine behind the scenes that crunches numbers. So if you take a car, let's say a pickup truck, and you look at it from the outside, it's a pickup truck, but it may have a V8 engine and may have a V4 engine. In the case of Photon, it's like a V12 superpower engine. DB SQL is the truck, right? DB SQL is a data warehouse product for your Lakehouse architecture. It's got SQL, it integrates with Unity Catalog, [inaudible 01:52:06] governance and all those things. The Photon is really just that new MPP built for modern new CPUs. Most data warehouses out there were designed 30, 40 years ago, probably before I was born, and they were built for the type of hardware that existed back in the day. Photon is really optimized for the new set of modern CPU and data center technologies that exists.
– [Jillian] Okay, thanks, Sean. So can you elaborate on how does it compare with stock? What was the journey between stock and Photon and what difference between both?
– [Shant Hovsepian] Yeah. First of all, if people are interested in this topic, we published a great paper, it's academic paper, a few months ago at SIGMOD Conference. If you just Google for Photon paper, SIGMOD paper, we can provide some links. It's got way more details and information there.
But yeah, the beauty of Photon is, it's essentially 100% Spark API compatible. Spark is a much bigger distributed compute system. It's not just an execution engine, but it also has different APIs like the RDD API. Photon is really specifically focused on the data frame API. It has task schedulers, has job management, it has a lot of the things that you need for distributed AI processing, broadcast variables, scheduling. So Spark is a bigger system. Photon is more just like the expression evaluation engine. They're both level with each other.
– [Jillian] Absolutely. Down the line, Photon is Databricks proprietary, right? So it's compatible with the Spark APIs like to mention, but that's becoming our default engine over time on the Lakehouse platform. Whereas Spark will remain open source, we're still committed to Spark. Photon is curating Spark in our platform, is that right?
– [Shant Hovsepian] Yes. Apache Spark and Databricks, it's called the Databricks Runtime. So that's essentially our version of Spark. It has a bunch of enhancements and features. Mind you, Databricks loves open source and we're 100% committed to it. We make the most contributions, enhancements, Apache Spark. So we do a lot of work with Apache Spark. Just our version of Apaches Spark is the Databricks Runtime called the DBR for short. And Photon is just like an acceleration piece for DBR, the Databricks Runtime.
– [Jillian] Yeah. Is there anything special that customers need to do to use Photon on the platform?
– [Shant Hovsepian] Absolutely not. They just need to make sure they're using... I tend to recommend a very recent version of Databricks Runtime. So when you use a data science, data engineering workspace cluster, you want to go with a DBR version... The most recent one you can find is 11.2 right now. Once you pick that DBR version, there'll be an option that says, "Enable Photon." You just check that box and you'll get the awesome features and functionality.
– [Jillian] Yeah. And on Databricks SQL, it's just on by default, so there's really nothing to worry about.
– [Shant Hovsepian] With the Databricks SQL, Photon... There's all sorts of things buried inside Databricks SQL that makes it awesome, but very much a self-contained data warehouse product where Photon is like a feature in your Spark.
– [Jillian] Right. We're almost on top of the hour, so this is my last question for you.
This is a pricing question because we had a few and think it's important to address. "Is Unity Catalog available on standard Databricks or premium? How about Databricks SQL Are there any extra cost enabling Photon acceleration on the cluster as well?"
If you could talk to that, it would be fantastic and then we'll be ready to wrap up.
– [Shant Hovsepian] Cool. So right now Unity Catalog is available in premium and enterprise SKUs, available in the standard SKU. The Databricks standard SKU in general doesn't have table access controls for various features like that. So a lot of the governance features that are important for many types of data warehousing workloads aren't available in the standard SKU. So Databricks SQL also isn't available in the standard SKU, it's available in premium and enterprise SKUs. Nobody else on the call is correcting me so I believe that is still true and it hasn't changed. But suffice it to say, we want data to be secure and governed everywhere. So I think... we find better ways of unlocking the unity and Unity Catalog for every workload and every... I think the second question you asked was about-
– [Jillian] Yeah, and for Photon, as well. Do you want to speak about that?
– [Shant Hovsepian] Yeah, yeah. When you enable photon, I showed some of it in the slides, we've seen customers on average get like a 30% savings in overall TCO. So not only do things run faster, but they cost less money because you don't need to keep your compute resources up longer. So overall, TCO, we've seen tremendous savings. So at the end of the day, it won't cost you any more money to use Photon, it ends up costing less money total. But that said, in Databricks, when you do enable Photon, it's a different TPU billing rate. So it's charged at a different rate, but you will not keep your compute resources up longer because everything's so much faster. So you always end up saving money in the long run.
– [Jillian] Yes, absolutely. Thank you, Sean.
We're now on top of the hour, so that conclude our event for today. I just want to say thank you again to all of our attendees for your time and sitting with us today, our presenters for creating all of this amazing content. We hope it was helpful and we-