Databricks SQL demo overview
Hello everyone, as you heard, the lakehouse is the best data warehouse because it combines data warehouse’s speeds with the data lake’s scale and flexibility. In this demo, I’ll show you how data teams can use Databricks SQL, a serverless data warehouse on the Databricks Lakehouse Platform to analyze revenue trends for a large multinational company.
Click to expand the transcript →
First, let’s look for revenue data for analysis. Because the lakehouse may contain thousands of data objects, such as tables or views, you can use Unity Catalog’s built-in search capability to find relevant data across tables, queries, and dashboards. You will only see results for data that you have access to because Unity Catalog governs the lakehouse. Let’s select revenue trends by country, which opens the data explorer.
The data explorer shows all catalogs, schemas, tables and views you can access. Within the data explorer, you will see the structure of this object, such as the column name, data type, and comments. You can preview sample data, visually inspect the object contents or view metadata information – in this case, you will see the view definition. You can grant or restrict access to users or groups authorizing them to read or modify this data object or manage access controls via familiar SQL syntax.
Lastly, lineage shows you how data flows upstream or downstream in the lakehouse. Here, you can see how revenue trends is derived or consumed by other tables, notebooks, workflows or dashboards. Alternatively, a high-fidelity lineage diagram displays the data flow graphically. You can traverse through this diagram and explore other data assets that can help with the analysis. For example, you could consider using the revenue forecast by priority table that’s derived by orders.
Typically, you may need to augment data from multiple sources. To ingest data into the lakehouse, Databricks SQL allows you to add data from various natively integrated sources or connect your data with your preferred tools such as Fivetran.
Let’s go back to the data explorer.
Now that you understand the context around this dataset, you can create a query or a quick dashboard. But let’s look at table insights to see if there are existing queries using this same table that can speed up the analysis.
Databricks SQL offers a full-featured SQL editor that allows you to author, run and schedule queries. As you can see it’s a familiar ANSI SQL statement – reading from multiple tables with specific criteria identifying total revenue by region – you can modify this query to suit your needs. This query will be executed on a serverless SQL warehouse which comes online in seconds and is optimized for high-concurrency workloads with minimal management required and capacity optimizations that lower overall cost.
As the query executes, you can use a query profile to view the query execution details which helps you troubleshoot performance bottlenecks during the query’s execution by detailing each task’s operations.
When the query results return, and within the same editor, Databricks SQL gives you the ability to create a visualization. For this query, Databricks SQL automatically suggests a visualization type but you have the option to customize with out-of-the-box chart types.
SQL is great but you may require additional functionality beyond what’s expressible with SQL, with the introduction of Python UDFs you can define custom functions in Databricks SQL using Python which can be called part of SQL statements and queries. For example, masking PII data such as email in the forecasting dataset.
In another scenario, your data may reside in various enterprise production systems outside the lakehouse. In this case, you can directly connect to multiple data sources using query federation, where you can configure read-only connection to remote databases with automatic and intelligent pushdown techniques for better performance — in this example, we’re going to forecast future revenue for a specific region.
Lastly, Databricks SQL supports materialized views to accelerate end-user queries and reduce infrastructure costs with efficient incremental computation. Here we’re going to create up-to-date, aggregated views of order priority for use in BI and analytics without having to reprocess the full underlying tables, instead updating only where changes have come through.
Now, many organizations have various BI tools like Tableau or PowerBI. With Databricks SQL you can seamlessly integrate your existing BI tools. As we switch to Tableau, a data analyst can get instant access to lakehouse data with the same Databricks serverless SQL warehouse we’ve been using for this demo to gain further actionable insights in driving revenue.
To summarize, you’ve seen how Databricks SQL, a serverless data warehouse, allows data teams to securely find data, author and run ad hoc queries on the lakehouse platform, create rich out-of-the-box visualizations, and connect to your favorite BI tools like Tableau to analyze revenue trends for a large multi-national organization.