The Azure Databricks Lakehouse Platform gives you the best of data lakes and data warehouses, on a simple, open and collaborative platform that securely integrates with your existing Azure services. In this demo, we cover several of the most common Azure Databricks integrations, including Azure Data Lake Storage (ADLS), Azure Data Factory (ADF), Azure IoT Hub, Azure Synapse Analytics, Power BI and more.
Dive deeper into Azure Databricks
Intro to Lakehouse on Azure Databricks
Azure Databricks is the jointly developed Data + AI service from Databricks and Microsoft for data engineering, data science, analytics and machine learning. The Azure Databricks Lakehouse Platform gives you the best of data lakes and data warehouses, on a simple, open and collaborative platform that securely integrates with your existing Azure services.
Batch ETL pipelines with Azure Data Factory and Azure Databricks
Orchestrating data pipelines for extract, transform and load (ETL) operations is easier than ever thanks to the integration between Azure Databricks and Azure Data Factory. In this demo, we’ll be working with power output and maintenance system records from a fleet management system used to maintain wind turbines across the United States. We’ve created an Azure Data Factory pipeline that uses COPY commands to incrementally ingest data from the fleet management system in SQL server to our landing zone in Azure Data Lake Storage Gen2. Data Factory then executes a Databricks notebook to process the data using Databricks Auto Loader, and saves it into Delta Lake tables.
Click to expand the transcript →
Click to collapse the transcript →
You create the Azure Data Factory to Azure Databricks integration by adding an Azure Databricks Notebook activity to the pipeline. In the activity, I add a new Azure Databricks Linked Service pointing to an Azure Databricks workspace and make the proper configuration to use an existing Interactive Cluster for my compute. Finally, I select a notebook path in the Databricks workspace that will be used to transform the data and load it into our Delta Lake.
When I run the pipeline, the output window shows the different activities that have been run, and in my Databricks activity, I can actually browse to the notebook that was executed when the pipeline ran.
We can see from the metadata this notebook executed successfully, and it took 23 seconds.
The real work in the code initiates when I create a DataFrame called “Maintenance Header DF” that reads the Parquet file that ADF drops to the landing zone. Even though I am reading Parquet data, my format is “cloud files” because I am using Auto Loader to automatically detect new files as they land in my landing zone. I also create a DataFrame for my power output data. I write my DataFrame to my Delta Lake tables using the writeStream command, specifying the format as Delta, and providing the table write location in Azure Data Lake storage. I’m also using an option called “triggerOnce” to program my code as a streaming pipeline, but run it as a batch-oriented execution.
Streaming data pipelines with IoT Event Hubs and Spark Streaming
Next, we’re using an IoT simulator provided by Azure to send live streaming data payloads similar to those created by real wind turbines into Azure IoT Event Hubs. Once we press the “Run” button, the simulator starts sending telemetry and weather data to the IoT Hub every second.
Within Event Hubs metrics, we can see that the simulator is running and sending new messages to Event Hubs as we speak.
In Azure Databricks, we created a notebook that reads this streaming data from the IoT Event Hub in real time. The first part of the code parses raw JSON data from the IoT devices into DataFrame columns. As you can see, the connection is really simple using the Spark connector to Event Hubs. We then split the weather data and turbine data into different DataFrames, and store the data in Delta format. Delta format is the foundation of the Delta Lakehouse architecture, and it provides streaming transactions and incremental data loads on top of your data lake.
In this visualization of the streaming DataFrame, you can see data being refreshed live, with new data points arriving every second.
We can also create a rolling aggregation for each one-hour rolling window, and store it in Delta format in the Silver layer, for downstream data analysts and applications to use.
Serving Delta Lake Gold tables with Azure Synapse Dedicated SQL Pools
In the third part of the demo, we’re enhancing data by combining information from the maintenance system (loaded by batch processing) with IoT and telemetry data (loaded by streaming). Then, we’re storing the result in a Delta Lake Gold table on Azure Data Lake Storage, and also publishing it to Azure Synapse Analytics Dedicated SQL pool.
The Delta Lake Gold table will be used for data science and machine learning, as well as data exploration and data analytics performed directly in Databricks Notebooks and Databricks SQL Analytics. The version loaded into a Synapse Dedicated SQL Pool is typically a smaller table with higher-level aggregation used for high-concurrency, low-latency workloads.
Azure Databricks provides a high-performance connector to Synapse SQL Pool. This connector supports both batch and streaming refreshes. In this case, the turbine Gold table is being continuously refreshed in streaming updates.
Browsing and visualizing data in SQL Analytics
To streamline processes and get data into the hands of the business faster, many organizations are enabling reporting directly off the Delta Lake using Databricks SQL Analytics and Power BI. As a business analyst, I can see what data is available to me in SQL Analytics. I create a new query by clicking the “New Query” button in the UI. In the data browser, I can connect to my IoT demo database and see the tables available to me and their metadata. I can then query my turbine Gold table, browse the data and even make simple data visualizations.
Connecting to Power BI with SQL Analytics endpoints
But if I really need to make my data shine, Azure Databricks integrates with Power BI through an optimized connector that provides a secure, more interactive data visualization experience for data stored in your data lake. To connect Power BI to Databricks SQL Analytics, I browse to Endpoints, and review the connection information of my endpoint on the connection details tab. From within Power BI, I click on “Get Data” > “more,” > “Azure,” > “Azure Databricks,” and then “connect.” I copy the server hostname and HTTP path information from my endpoint and put it into Power BI. Optionally, I can enter my database name, select “Import” or “DirectQuery,” and then click “OK”.
When the table and view list comes up in the navigator UI, I can select the turbine Gold table and click “Load.” Once Power BI has loaded my data into the data model, I can begin dragging and dropping the field list onto my canvas to dissect the data interactively, or even create visually appealing, enterprise-ready reports and dashboards that can be published to the Power BI service and shared with my teams.