Company Blog

Connect 90+ Data Sources to Your Data Lake with Azure Databricks and Azure Data Factory

Share this post

Data lakes enable organizations to consistently deliver value and insight through secure and timely access to a wide variety of data sources. The first step on that journey is to orchestrate and automate ingestion with robust data pipelines. As data volume, variety, and velocity rapidly increase, there is a greater need for reliable and secure pipelines to extract, transform, and load (ETL) data.

Databricks customers process over two exabytes (2 billion gigabytes) of data each month and Azure Databricks is the fastest-growing Data & AI service on Microsoft Azure today. The tight integration between Azure Databricks and other Azure services is enabling customers to simplify and scale their data ingestion pipelines. For example, integration with Azure Active Directory (Azure AD) enables consistent cloud-based identity and access management. Also, integration with Azure Data Lake Storage (ADLS) provides highly scalable and secure storage for big data analytics, and Azure Data Factory (ADF) enables hybrid data integration to simplify ETL at scale.

Batch ETL with Microsoft Azure Data Factory and Azure Databricks
Diagram: Batch ETL with Azure Data Factory and Azure Databricks

Connect, Ingest, and Transform Data with a Single Workflow

ADF includes 90+ built-in data source connectors and seamlessly runs Azure Databricks Notebooks to connect and ingest all of your data sources into a single data lake. ADF also provides built-in workflow control, data transformation, pipeline scheduling, data integration, and many more capabilities to help you create reliable data pipelines. ADF enables customers to ingest data in raw format, then refine and transform their data into Bronze, Silver, and Gold tables with Azure Databricks and Delta Lake. For example, customers often use ADF with Azure Databricks Delta Lake to enable SQL queries on their data lakes and to build data pipelines for machine learning.

Bronze, Silver, and Gold tables with Azure Databricks, Azure Data Factory, and Delta Lake

Get Started with Azure Databricks and Azure Data Factory

To run an Azure Databricks notebook using Azure Data Factory, navigate to the Azure portal and search for “Data factories”, then click “create” to define a new data factory.

Create a data factory from the Azure portal

Next, provide a unique name for the data factory, select a subscription, then choose a resource group and region. Click "Create".

Define a new data factory

Once created, click the "Go to resource" button to view the new data factory.

Click Go to resource once data factory deployment is complete

Now open the Data Factory user interface by clicking the "Author & Monitor" tile.

Ready to Author & Monitor the data factory

From the Azure Data Factory “Let’s get started” page, click the "Author" button from the left panel.

Azure Data Factory Let's get started

Next, click "Connections" at the bottom of the screen, then click "New".

Data factory connections

From the "New linked service" pane, click the "Compute" tab, select "Azure Databricks", then click "Continue".

Azure Databricks linked compute service

Enter a name for the Azure Databricks linked service and select a workspace.

Name the Azure Databricks linked service

Create an access token from the Azure Databricks workspace by clicking the user icon in the upper right corner of the screen, then select "User settings".

User settings

Click "Generate New Token".

Generate New Token

Copy and paste the token into the linked service form, then select a cluster version, size, and Python version. Review all of the settings and click "Create".

Choose cluster version, node type, and Python version

With the linked service in place, it is time to create a pipeline. From the Azure Data Factory UI, click the plus (+) button and select "Pipeline".

Add an ADF pipeline

Add a parameter by clicking on the "Parameters" tab and then click the plus (+) button.

Add a pipeline parameter

Next, add a Databricks notebook to the pipeline by expanding the "Databricks" activity, then dragging and dropping a Databricks notebook onto the pipeline design canvas.

Connect to the Azure Databricks workspace by selecting the "Azure Databricks" tab and selecting the linked service created above. Next, click on the "Settings" tab to specify the notebook path. Now click the "Validate" button and then "Publish All" to publish to the ADF service.

Validate the ADF data pipeline

Publishing changes to the factory

Once published, trigger a pipeline run by clicking "Add Trigger | Trigger now".

Trigger a pipeline run

Review parameters and then click "Finish" to trigger a pipeline run.

Set parameters and trigger a pipeline run

Now switch to the "Monitor" tab on the left-hand panel to see the progress of the pipeline run.

Monitor the pipeline run

Integrating Azure Databricks notebooks into your Azure Data Factory pipelines provides a flexible and scalable way to parameterize and operationalize your custom ETL code. To learn more about how Azure Databricks integrates with Azure Data Factory (ADF), see this ADF blog post and this ADF tutorial. To learn more about how to explore and query data in your data lake, see this webinar, Using SQL to Query Your Data Lake with Delta Lake.

Try Databricks for free

Related posts

Company blog

Connect 90+ Data Sources to Your Data Lake with Azure Databricks and Azure Data Factory

March 6, 2020 by Clinton Ford and Mike Cornell in Company Blog
Data lakes enable organizations to consistently deliver value and insight through secure and timely access to a wide variety of data sources. The...
See all Company Blog posts