We recently held a virtual event, featuring CEO Ali Ghodsi, that showcased the vision of Lakehouse architecture and how Databricks helps customers make it a reality. Lakehouse is a data platform architecture that implements similar data structures and data management features to those in a data warehouse directly on the low-cost, flexible storage used for cloud data lakes. This new, simplified architecture allows traditional analytics, data science, and machine learning to co-exist on the same platform, removes data silos, and enables a single source of truth for organizations.
In the event, we shared our ideas around the Lakehouse and how to implement it, highlighted examples of customers who have transformed their data landscape, and demoed our new SQL Analytics service that completes the vision of the Lakehouse. Most exciting, though, was the incredible engagement we had from the audience. Today, we wanted to share the most popular audience questions from hundreds of interesting and valuable questions we received. For those who were unable to attend, feel free to take a look at the event on demand here.
Q&A from the virtual event
What is Delta Lake and what does it have to do with a Lakehouse?
Delta Lake is an open format storage layer that delivers reliability, security and performance on your data lake — for both streaming and batch operations. Delta Lake eliminates data silos by providing a single home for structured, semi-structured, and unstructured data to make analytics simple and accessible across the enterprise. Ultimately, Delta Lake is the foundation and enabler of a cost-effective, highly scalable Lakehouse architecture.
Why is it called Delta Lake?
There are two main reasons for the name Delta Lake. The first reason is that Delta Lake keeps track of the changes or “deltas” to your data. The second is that Delta Lake acts like a “delta” (where a river splits and spreads out into several branches before entering the sea) to filter data from your data lake.
Can I create a Delta Lake table on Databricks and query it with open-source Spark?
Yes, in order to do this, you would install Open Source Spark and Delta Lake, both are open source. Delta Engine, which is only available on Databricks, will make delta faster than open source, with full support. Read this blog for more information.
What file format is used for Delta Lake?
The file format used for Delta Lake is called delta, which is a combination of parquet and JSON.
A Delta Lake table is a directory on a cloud object store or file system that holds data objects with the table contents and a log of transaction operations (with occasional checkpoints). Learn more here.
What is SQL Analytics?
SQL Analytics provides a new, dedicated workspace for data analysts that uses a familiar SQL-based environment to query Delta Lake tables on data lakes. Because SQL Analytics is a completely separate workspace, data analysts can work directly within the Databricks platform without the distraction of notebook-based data science tools (although we find data scientists really like working with the SQL editor too). However, since the data analysts and data scientists are both working from the same data source, the overall infrastructure is greatly simplified and a single source of truth is maintained.
SQL Analytics enables you to:
- integrate with the BI tools, like Tableau and Microsoft Power BI, you use today to query your most complete and recent data in your data lake
- complement existing BI tools with a SQL-native interface that allows data analysts and data scientists to query data lake data directly within Databricks
- share query insights through rich visualizations and drag-and-drop dashboards with automatic alerting for important changes in your data
- bring reliability, quality, scale, security, and performance to your data lake to support traditional analytics workloads using your most recent and complete data
Where can I learn more about SQL performance on Delta Lake?
To learn more about SQL Analytics, Delta, and the Lakehouse architecture (including performance), check out this two-part free training. In the training, we explore the evolution of data management and the Lakehouse. We explain how this model enables teams to work in a unified system that provides highly performant streaming, data science, machine learning and BI capabilities powered by a greatly simplified single source of truth.
In the hands-on portion of these sessions, you’ll learn how to use SQL Analytics, an integrated SQL editing and dashboarding tool. Explore how to easily query your data to build dashboards and share them across your organization. And find out how SQL Analytics enables granular visibility into how data is being used and accessed at any time across an entire Lakehouse infrastructure.
Is SQL Analytics available?
SQL Analytics is available in preview today. Existing customers can reach out to their account team to gain access. Additionally, you can request access via the SQL Analytics product page.
This is just a small sample of the amazing engagement we received from all of you during this event. Thank you for joining us and helping us move the Lakehouse architecture from vision to reality. If you haven’t had a chance to check out the event you can view it here.