Data Exfiltration Protection with Azure Databricks

Learn details of how you could set up a secure Azure Databricks architecture to protect data exfiltration

High-level view of the architecture recommended to prevent data exfiltration and secure sensitive information.

Published: March 20, 2024

by Ganesh Rajagopal, Bruce Nelson and Bhavin Kukadia

Last updated on: August 23, 2024

In the previous blog, we discussed how to securely access Azure Data Services from Azure Databricks using Virtual Network Service Endpoints or Private Link.

Here’s a quick recap:

Service Principals: Use Azure AD service principles for secure authentication.
Managed Identity: Leverage managed identities for secure access without handling credentials.
Azure Key Vault: Store and manage secrets securely using Azure Key Vault.
VNet and Private Endpoints: Ensure secure networking with VNet injection and private links.

In this article we walkthrough detailed steps on how to harden your Azure Databricks deployment from a network security perspective in order to prevent data exfiltration.

As per wikipedia: Data exfiltration occurs when malware and/or a malicious actor carries out an unauthorized data transfer from a computer. It is also commonly called data extrusion or data exportation. Data exfiltration is also considered a form of data theft. Since the year 2000, a number of data exfiltration efforts severely damaged the consumer confidence, corporate valuation, and intellectual property of businesses and national security of governments across the world. The problem assumes even more significance as enterprises start storing and processing sensitive data (PII, PHI or Strategic Confidential) with public cloud services.

Solving for data exfiltration can become an unmanageable problem if the PaaS service requires you to store your data with them or it processes the data in the service provider's network. But with Azure Databricks, our customers get to keep all data in their Azure subscription and process it in their own managed private virtual network(s), all while preserving the PaaS nature of one the fastest growing Data & AI service on Azure. We've come up with a secure deployment architecture for the platform while working with some of our most security-conscious customers, and it's time that we share it out broadly.

Databricks Deployment Options

There are three distinct flavors of Databricks workspace deployments from a network perspective.

Deploy workspace in a Microsoft managed virtual network(VNet)
Deploy workspace in a Customer managed virtual network (VNet injection)
Deploy workspace in a Customer managed virtual network with Private Link

Please note that no matter what options you choose, the virtual network used by Databricks will reside in your Azure subscription. The rest of this article is built around option 3 i.e. Deploy workspace in a Customer managed virtual network with secure cluster connectivity and Private Link.

Choosing standard or simplified deployment

From here: There are two types of Private Link deployment that Azure Databricks supports, and you must choose one:

Standard deployment (recommended): For improved security, Databricks recommends you use a separate private endpoint for your front-end connection from a separate transit VNet. You can implement both front-end and back-end Private Link connections or just the back-end connection. Use a separate VNet to encapsulate user access, separate from the VNet that you use for your compute resources in the Classic data plane. Create separate Private Link endpoints for back-end and front-end access. Follow the instructions in Enable Azure Private Link as a standard deployment.

Simplified deployment: Some organizations cannot use the standard deployment for various network policy reasons, such as disallowing more than one private endpoint or discouraging separate transit VNets. You can alternatively use the Private Link simplified deployment. No separate VNet separates user access from the VNet that you use for your compute resources in the Classic data plane. Instead, a transit subnet in the data plane VNet is used for user access. There is only a single Private Link endpoint. Typically both front-end and back-end connectivity are configured. You can optionally only configure the front-end connection. You cannot choose to use only the back-end connections in this deployment type. Follow the instructions in Enable Azure Private Link as a simplified deployment.

High-level Data Exfiltration Protection Architecture

We recommend a hub and spoke topology styled reference architecture. The hub virtual network houses the shared infrastructure required to connect to validated sources and optionally to an on-premises environment. And the spoke virtual networks peer with the hub, while housing isolated Azure Databricks workspaces for different business units or segregated teams.

Such a hub-and-spoke architecture allows creating multiple-spoke VNETs for different purposes and teams. It is also possible to implement isolation by creating separate subnets for different teams within a large contiguous virtual network. In such instances, it's totally possible to set up multiple isolated Azure Databricks workspaces in their own subnet pairs, and deploy Azure Firewall in another sister subnet within the same virtual network.

High-level view:

Steps to deploy a secure Azure Databricks deployment:

Deploy Azure Databricks with secure cluster connectivity (SCC) enabled in a spoke virtual network using VNet injection and Private link.
Set up Private Link endpoints for your Azure Data Services (Storage accounts, Eventhub, SQL databases etc) in a separate subnet within the Azure Databricks spoke virtual network. This would ensure that all workload data is being accessed securely over Azure network backbone with default data exfiltration protection in place (refer to this blog for more details). Also in general it's completely fine to deploy these endpoints in another virtual network that's peered to the one hosting the Azure Databricks workspace. Note that Private Endpoints incurs additional cost and it is fine to leverage (based on your organization's security policies) Service Endpoints instead of Private Endpoints to access the Azure Data services.
Leverage Azure Databricks Unity Catalog for unified governance solution.
Deploy Azure Firewall (or other Network Virtual Appliance) in a hub virtual network With Azure Firewall, you could configure:
- Application rules that define fully qualified domain names (FQDNs) that are accessible through the firewall. Some Azure Databricks required traffic could be whitelisted using the application rules.
- Network rules that define IP address, port and protocol for endpoints that can't be configured using FQDNs. Some of the required Azure Databricks traffic needs to be whitelisted using the network rules.
If you happen to use a third-party firewall appliance instead of Azure Firewall, that works as well. Though please note that each product has its own nuances and it’s better to engage relevant product support and network security teams to troubleshoot any pertinent issues.
Create a user-defined route table which allows us to forward all of the traffic originating from subnets used by databricks to an egress appliance like Azure Firewall. Alternatively, the egress traffic can also be routed to Control Plane assets via User defined route table (and add the Service tag rules), which could avoid throttling and additional data transfer cost associated with network Virtual Appliances. Please note that this will allow egress to storage accounts and services across the region and not just the ones you desire to reach, something to be considered carefully while designing your security architecture.
Configure virtual network peering between the Azure Databricks spoke and Azure Firewall hub virtual networks.
Deploy Private endpoints for the Front end and browser auth (for SSO) on the Hub Vnet (private end point subnet)

Secure Azure Databricks Deployment Details

Before you begin:

Why do we need two subnets per workspace?
A workspace requires two subnets, popularly known as "host" (a.k.a "public") and "container" (a.k.a "private") subnets. Each subnet provides an ip-address to the host (Azure VM) and the container (Databricks runtime aka dbr) which runs inside the VM.

Does the public or host subnet have public ips?
No, when you create a workspace using secure cluster connectivity aka SCC, none of Databricks subnets have public IP addresses. It is just that the default name of the host subnet is public-subnet. SCC makes sure that no network traffic from outside of your network enters e.g. SSH into one of the Databricks workspace compute instances.

Is it possible to resize/change the subnet sizes after the deployment?
Yes, it is possible to resize or change the subnet sizes after the deployment. It is not possible to change the virtual network or change the subnet names. Please reach out to Azure support, submit a support case for resizing the subnets.

Pre-requisites

Item	Details
Virtual Network	Virtual network to deploy Azure Databricks Dataplane (a.k.a VNet Injection). Make sure to choose the right CIDR blocks.
Subnets	Three subnets Host (Public), Container (Private) and Private endpoint Subnet (to hold private endpoints for the storage, dbfs and other azure services that you may use)
Route Tables	Channel Egress traffic from the Databricks Subnets to network appliance, Internet or On-prem data sources
Azure Firewall	Inspect any egress traffic and take actions according to allow / deny policies
Private DNS Zones	Provide reliable, secure DNS service to manage and resolve domain names in a virtual network (can be automatically created as part of the deployment if not available)
Azure Key Vault	Stores the CMK for encrypting DBFS, Managed Disk and Managed Services.
Azure Databricks Access Connector	Required if enabling Unity Catalog. To connect managed identities to an Azure Databricks account for the purpose of accessing data registered in Unity Catalog
List of Azure Databricks services to allow list on Firewall	Please follow this public doc and make a list of all the ip’s and domain names relevant to your databricks deployment

Deploying Azure Databricks in your Virtual network

Step 1: Deploy Azure Databricks Workspace in your virtual network

The default deployment of Azure Databricks creates a new virtual network (with two subnets) in a resource group managed by Databricks. So as to make necessary customizations for a secure deployment, the workspace data plane should be deployed in your own virtual network aka vnet injected workspace with NPIP. This deployment can be done using Azure Portal or All in one ARM templates or using Azure Databricks Terraform Providers.

Create a virtual network in a resource group with 3 subnets (host/public, container/private and pe ). Note that the subnet pe is used for private endpoints, to ensure all application data is being accessed securely over Azure network backbone. The host (public) and container (private) subnets need to be determined based on the use cases before the workspace deployment. Once the Databricks workspace is deployed , it is not possible to resize / change the Databricks network subnets.

Deploy Azure Databricks from Azure Portal

Create Azure Databricks workspace with Vnet Injection and No Public IP (SCC) from the Azure Portal

Click Review and Create. Few things to note:

Select the SCC / NPIP and VNet Injection options.
Select the Virtual network to deploy Azure Databricks Workspace.
The virtual network must include two subnets dedicated to each Azure Databricks workspace: a private subnet and public subnet (feel free to use a different nomenclature).
- The public subnet is the source of a private IP for each cluster node's host VM. The private subnet is the source of a private IP for the Databricks Runtime container deployed on each cluster node. It indicates that each cluster node has two private IP addresses today.
- Each workspace subnet size is allowed to be anywhere from /18 to /26, and the actual sizing will be based on forecasting for the overall workloads per workspace. The address space could be arbitrary (including non RFC 1918 ones), but it must align with the enterprise on-premises plus cloud network strategy.
- Azure Databricks will create these subnets for you when you deploy the workspace using Azure portal and will perform subnet delegation to the Microsoft.Databricks/workspaces service. That allows Azure Databricks to create the required Network Security Group (NSG) rules. Azure Databricks will always provide advance notice if we need to add or update the scope of an Azure Databricks-managed NSG rule. Please note that if these subnets already exist, the service will use those as such. Detailed explanation for these NSG rules are provided in the table below.
- There is a one-to-one relationship between these subnets and an Azure Databricks workspace. You cannot share multiple workspaces across the same subnet pair, and must use a new subnet pair for each different workspace.
- Once the workspace is deployed, the public and private subnets cannot be re-sized.
- Note that the Azure Databricks deployment would create a managed resource group in the Azure Databricks resource overview page on Azure portal. You cannot create any resources in the managed resource group, or can you edit any existing ones.
Azure Databricks supports Private link for both Front-end (user to workspace i.e. Allow public network access set to Disabled) and Back-end (Data plane to control plane i.e. No Azure Databricks Rules) enabling private connection without exposing the Azure Databricks management traffic to the public internet.
Create private endpoints by following through the documentation to deploy Azure Databricks with a private link either using a simplified or standard deployment pattern.
Enable customer managed keys to encrypt and protect DBFS, managed services and managed disks.

Network Security Rules: What does it mean?

Inbound Rules

Worker to Worker Inbound rule allows traffic between cluster instances.

Outbound Rules

Worker to Worker rule allows traffic between cluster instances so that drivers and workers can communicate between each other.
Metastore (Sql Service Tag) allows outbound traffic to the default HMS from the public Subnet
Control Plane (AzureDatabricks Service Tag) allows outbound traffic to Azure Databricks Control Plane (i.e. SCC, Webapp ) from the public subnet.
Note: AzureDatabricks service tag will not be added to the NSG rules if back end private link is enabled.
Storage (Storage Service Tag) allows outbound traffic to Control plane assets such log storage, artifacts and dbfs from the public subnet
Event Hub (EventHub Service Tag) allows outbound traffic to Event hub end point (for observability) from the public Subnet
For a private link enabled workspace, two additional (443, 6666) ports need to be added for outbound communication with the private end point subnet. The same ports need to be opened for the inbound communication on the private endpoint subnet's NSG rules.
The outbound traffic rule 65001 allows egress to the internet, this is an automatic rule that gets added upon NSG creation. Later in the section we’ll overwrite this behavior by forwarding all of the egress traffic originating from Azure Databricks subnets to a firewall instead of directly going to the public internet.

Step 2: Setup private endpoints for default blob storage (DBFS) (Optional)

Azure Databricks creates a default blob storage (a.k.a root storage) during the deployment process which is used for storing logs and telemetry. Even though public access is enabled on this storage, the Deny Assignment created on this storage prohibits any direct external access to the storage; it can be accessed only via the Databricks workspace. Azure Databricks deployments now support secure connection to the root blob storage (DBFS) with the creation of Private Endpoint (both dfs and blob), but enabling private endpoint for DBFS does not turn off public access. Note that the Private Endpoints for storage incurs additional cost.

As a best practice It is NOT recommended to store any application data in the root blob (DBFS) storage. Leverage separate ADLS Gen2 Storage to store any application specific data using private link (Securely Accessing Azure Data Services)

We do not recommend setting up access to such data services through a network virtual appliance / firewall, as that has a potential to adversely impact the performance of big data workloads and the intermediate infrastructure.

NOTE: It is highly recommended to store the application data on an external ADLS Gen2 Storage. Follow through similar setup to create private link endpoints for the external ADLS storages to access / store data securely.

To configure such private endpoints for additional services, please refer to the relevant Azure documentation.

Step 3: Deploy Azure Firewall

Azure Firewall is a scalable cloud native firewall that can act as the filtering device for any allowed public endpoints to be accessible from your Azure Databricks workspace.

Typically, Firewalls are placed on the centralized Hub VNet and peered with multiple Spoke Vnet. The Spoke Vnet egresses all the traffic via the Firewall.

Azure Firewall policies are the recommended approach to create rules for the Azure Firewall. The firewall policies are global resources that can be used across multiple Azure Firewall instances.

Create a network rule (ip address based) and application rule (FQDN based) collection. Example below shows a representative set of rules, for exact details please refer to the complete list of control plane assets relevant to your deployment region.

Note:-

AzureDatabricks Service Tag is not required if private endpoints are enabled for the workspace.
Azure Databricks also makes additional calls to NTP service, CDN, cloudflare, GPU drivers and external storages for demo datasets which need to be whitelisted appropriately.

Attach the firewall policy to the firewall.

Step 4: Create User Defined Routes (UDRs)

At this point, the majority of the infrastructure setup is completed. Next we need to route traffic from Azure Databricks workspace subnets to Azure Firewall.

Create a Route table and forward all the traffic by adding a 0.0.0.0/0 rule to the Virtual appliances (azure firewall).

Step 5: Configure VNET Peering

Finally, the virtual network azuredatabricks-spoke-vnet and hub-vnet need to be peered so that the route table configured earlier could work properly. Follow through the documentation to setup Vnet peering between Hub and Spoke Networks.

The setup is now complete.

Step 6: Assign workspace Unity Catalog Metastore

We are now at the last step. Now, assign the workspace to Unity Catalog.

Step 7: Validate Deployment

It's time to put everything to test now:

Deploy a virtual machine with the VNet if the front end access is disabled.
Go to the Azure Databricks workspace that you'd created in Step 1, launch and create a cluster.
Create a notebook and attach it to the cluster.
Try and access the storage account that you created in Step 2 earlier.

If the data access worked without any issues, that means you've accomplished the optimum secure deployment for Azure Databricks in your subscription. This was quite a bit of manual work, but that was more for a one-time showcase. In practical terms, you would want to automate such a setup using a combination of ARM Templates, Azure CLI, Azure SDK etc.:

Common Questions with Data Exfiltration Protection Architecture

Can I use service endpoint to secure data egress to Azure Data Services?

Yes, Service Endpoint provides secure and direct connectivity to Azure services owned and managed by customers (ex: ADLS gen2, Azure KeyVault or eventhub) over an optimized route over the Azure backbone network. Service Endpoints can be used to secure connectivity to external Azure resources to only your virtual network.

Can I use service endpoint policies with Databricks managed storage services?

No, subnets used by Databricks are locked using a network intent policy, this prevents service endpoint policy enforcement on Databricks managed storage services used by our artifacts and logs service and event hub which is used by health monitoring service. Azure network intent policies are an internal network construct to prevent customers from accidentally modifying the subnets used by Databricks.

Can I use Network Virtual Appliance (NVA) other than Azure Firewall?

Yes, you could use a third-party NVA as long as network traffic rules are configured as discussed in this article. Please note that we have tested this setup with Azure Firewall only, though some of our customers use other third-party appliances. It's ideal to deploy the appliance on cloud rather than be on-premises.

Can I have a firewall subnet in the same virtual network as Azure Databricks?

Yes, you can. As per Azure reference architecture, it is advisable to use a hub-spoke virtual network topology to plan better for the future. Should you choose to create the Azure Firewall subnet in the same virtual network as Azure Databricks workspace subnets, you wouldn't need to configure virtual network peering as discussed in Step 6 above.

Can I filter Azure Databricks control plane SCC Relay Ip traffic through Azure Firewall?

Yes you can but we would like you to keep these points in mind:

The traffic between Azure Databricks clusters(data plane) and the SCC Relay service stays over Azure Network and does not flow over the public internet. This is primarily management traffic to make sure Azure Databricks workspace is functioning properly.
SCC Relay service and the data plane needs to have stable and reliable network communication in place, having a firewall or a virtual appliance between them introduces a single point of failure e.g. in case of any firewall rule misconfiguration or scheduled downtime which may result in excessive delays in cluster bootstrap (transient firewall issue) or won't be able to create new clusters or affect scheduling and running jobs.

Can I analyze accepted or blocked traffic by Azure Firewall?

We recommend using Azure Firewall Logs and Metrics for that requirement.

Can I Upgrade an existing non-NPIP (managed Databricks deployment) to NPIP or PL Enabled workspace ?

No, managed databricks deployment cannot be upgraded to a Vnet Injected workspace. Databricks recommends creation of new Vnet Injected workspace and migrate workspace artifacts.

Getting Started with Data Exfiltration Protection with Azure Databricks

We discussed utilizing cloud-native security control to implement data exfiltration protection for your Azure Databricks deployments, all of which could be automated to enable data teams at scale. Some other things that you may want to consider and implement as part of this project:

Enable meta controls to unlock true potential of your data lake
Manage access to notebook features
Audit everything with Diagnostic Logs, Storage Access Logs and NSG Flow Logs (requires VNET Injection)

Please reach out to your Microsoft or Databricks account team for any questions.

What's next?

October 1, 2024/5 min read

Build Compound AI Systems Faster with Databricks Mosaic AI

October 24, 2024/4 min read