In the previous blog, we discussed how to securely access Azure Data Services from Azure Databricks using Virtual Network Service Endpoints or Private Link. Given a baseline of those best practices , in this article we walkthrough detailed steps on how to harden your Azure Databricks deployment from a network security perspective in order to prevent data exfiltration.
As per wikipedia: Data exfiltration occurs when malware and/or a malicious actor carries out an unauthorized data transfer from a computer. It is also commonly called data extrusion or data exportation. Data exfiltration is also considered a form of data theft. Since the year 2000, a number of data exfiltration efforts severely damaged the consumer confidence, corporate valuation, and intellectual property of businesses and national security of governments across the world. The problem assumes even more significance as enterprises start storing and processing sensitive data (PII, PHI or Strategic Confidential) with public cloud services.
Solving for data exfiltration can become an unmanageable problem if the PaaS service requires you to store your data with them or it processes the data in the service provider's network. But with Azure Databricks, our customers get to keep all data in their Azure subscription and process it in their own managed private virtual network(s), all while preserving the PaaS nature of one the fastest growing Data & AI service on Azure. We've come up with a secure deployment architecture for the platform while working with some of our most security-conscious customers, and it's time that we share it out broadly.
Databricks Deployment Options
There are three distinct flavors of Databricks workspace deployments from a network perspective.
- Deploy workspace in a Microsoft managed virtual network(VNet)
- Deploy workspace in a Customer managed virtual network (VNet injection)
- Deploy workspace in a Customer managed virtual network with Private Link
Please note that no matter what options you choose, the virtual network used by Databricks will reside in your Azure subscription. The rest of this article is built around option 3 i.e. Deploy workspace in a Customer managed virtual network with secure cluster connectivity and Private Link.
Choosing standard or simplified deployment
From here: There are two types of Private Link deployment that Azure Databricks supports, and you must choose one:
Standard deployment (recommended): For improved security, Databricks recommends you use a separate private endpoint for your front-end connection from a separate transit VNet. You can implement both front-end and back-end Private Link connections or just the back-end connection. Use a separate VNet to encapsulate user access, separate from the VNet that you use for your compute resources in the Classic data plane. Create separate Private Link endpoints for back-end and front-end access. Follow the instructions in Enable Azure Private Link as a standard deployment.
Simplified deployment: Some organizations cannot use the standard deployment for various network policy reasons, such as disallowing more than one private endpoint or discouraging separate transit VNets. You can alternatively use the Private Link simplified deployment. No separate VNet separates user access from the VNet that you use for your compute resources in the Classic data plane. Instead, a transit subnet in the data plane VNet is used for user access. There is only a single Private Link endpoint. Typically both front-end and back-end connectivity are configured. You can optionally only configure the front-end connection. You cannot choose to use only the back-end connections in this deployment type. Follow the instructions in Enable Azure Private Link as a simplified deployment.
High-level Data Exfiltration Protection Architecture
We recommend a hub and spoke topology styled reference architecture. The hub virtual network houses the shared infrastructure required to connect to validated sources and optionally to an on-premises environment. And the spoke virtual networks peer with the hub, while housing isolated Azure Databricks workspaces for different business units or segregated teams.
Such a hub-and-spoke architecture allows creating multiple-spoke VNETs for different purposes and teams. It is also possible to implement isolation by creating separate subnets for different teams within a large contiguous virtual network. In such instances, it's totally possible to set up multiple isolated Azure Databricks workspaces in their own subnet pairs, and deploy Azure Firewall in another sister subnet within the same virtual network.
Steps to deploy a secure Azure Databricks deployment:
- Deploy Azure Databricks with secure cluster connectivity (SCC) enabled in a spoke virtual network using VNet injection and Private link.
- Set up Private Link endpoints for your Azure Data Services in a separate subnet within the Azure Databricks spoke virtual network. This would ensure that all workload data is being accessed securely over Azure network backbone with default data exfiltration protection in place (refer to this blog for more details). Also in general it's completely fine to deploy these endpoints in another virtual network that's peered to the one hosting the Azure Databricks workspace. Note that Private Endpoints incurs additional cost and it is fine to leverage (based on your organization's security policies) Service Endpoints instead of Private Endpoints to access the Azure Data services.
- Leverage Azure Databricks Unity Catalog for unified governance solution.
- Deploy Azure Firewall (or other Network Virtual Appliance) in a hub virtual network With Azure Firewall, you could configure:
- Application rules that define fully qualified domain names (FQDNs) that are accessible through the firewall. Some Azure Databricks required traffic could be whitelisted using the application rules.
- Network rules that define IP address, port and protocol for endpoints that can't be configured using FQDNs. Some of the required Azure Databricks traffic needs to be whitelisted using the network rules.
- Create a user-defined route table with the following rules and attach it to Azure Databricks subnets and route all egress traffic to Azure Firewall. To make the Azure Firewall easily manageable and prevent service outages during IP Change, it is recommended to use Azure Service Tags. Alternatively, the egress traffic can also be routed to Control Plane assets via User defined route table (and add the Service tag rules), which could avoid throttling and additional data transfer cost associated with network Virtual Appliances.
- Configure virtual network peering between the Azure Databricks spoke and Azure Firewall hub virtual networks.
- Deploy Private endpoints for the Front end and browser auth (for SSO) on the Hub Vnet (private end point subnet)
Secure Azure Databricks Deployment Details
Before you begin:
Why do we need two subnets per workspace?
A workspace requires two subnets, popularly known as "host" (a.k.a "public") and "container" (a.k.a "private") subnets. Each subnet provides an ip-address to the host (Azure VM) and the container (Databricks runtime aka dbr) which runs inside the VM.
Does the public or host subnet have public ips?
No, when you create a workspace using secure cluster connectivity aka SCC, none of Databricks subnets have public IP addresses. It is just that the default name of the host subnet is public-subnet. SCC makes sure that no network traffic from outside of your network enters e.g. SSH into one of the Databricks workspace compute instances.
Is it possible to resize/change the subnet sizes after the deployment?
No, Once the Databricks workspace is deployed , it is not possible to resize / change the Databricks network subnets.
|Virtual network to deploy Azure Databricks Dataplane (a.k.a VNet Injection). Make sure to choose the right CIDR blocks.
|Three subnets Host (Public) , Container (Private) and Private endpoint Subnet (to hold private endpoints for the storage, dbfs and other azure services)
|Channel Egress traffic from the Databricks Subnets to network appliance, Internet or On-prem data sources
|Inspect any egress traffic and take actions according to allow / deny policies
|Private DNS Zones
|Provide reliable, secure DNS service to manage and resolve domain names in a virtual network (can be automatically created as part of the deployment if not available)
|Azure Key Vault
|Stores the CMK for encrypting DBFS, Managed Disk and Managed Services.
|Azure Databricks Access Connector
|Required if enabling Unity Catalog. To connect managed identities to an Azure Databricks account for the purpose of accessing data registered in Unity Catalog
Deploying Azure Databricks in your Virtual network
Step 1: Deploy Azure Databricks Workspace in your virtual network
The default deployment of Azure Databricks creates a new virtual network (with two subnets) in a resource group managed by Databricks. So as to make necessary customizations for a secure deployment, the workspace data plane should be deployed in your own virtual network with NPIP. This deployment can be done using Azure Portal or All in one ARM templates or using Azure Databricks Terraform Providers.
Create a virtual network in a resource group with 3 subnets (host/public, container/private and pe ). Note that the subnet pe is used for private endpoints, to ensure all application data is being accessed securely over Azure network backbone. The host (public) and container (private) subnets need to be determined based on the use cases before the workspace deployment. Once the Databricks workspace is deployed , it is not possible to resize / change the Databricks network subnets.
Deploy Azure Databricks from Azure Portal
Click Review and Create. Few things to note:
- Select the SCC / NPIP and VNet Injection options.
- Select the Virtual network to deploy Azure Databricks Workspace.
- The virtual network must include two subnets dedicated to each Azure Databricks workspace: a private subnet and public subnet (feel free to use a different nomenclature).
- The public subnet is the source of a private IP for each cluster node's host VM. The private subnet is the source of a private IP for the Databricks Runtime container deployed on each cluster node. It indicates that each cluster node has two private IP addresses today.
- Each workspace subnet size is allowed to be anywhere from /18 to /26, and the actual sizing will be based on forecasting for the overall workloads per workspace. The address space could be arbitrary (including non RFC 1918 ones), but it must align with the enterprise on-premises plus cloud network strategy.
- Azure Databricks will create these subnets for you when you deploy the workspace using Azure portal and will perform subnet delegation to the Microsoft.Databricks/workspaces service. That allows Azure Databricks to create the required Network Security Group (NSG) rules. Azure Databricks will always give advance notice if we need to add or update the scope of an Azure Databricks-managed NSG rule. Please note that if these subnets already exist, the service will use those as such. Detailed explanation for these NSG rules are provided in the table below.
- There is a one-to-one relationship between these subnets and an Azure Databricks workspace. You cannot share multiple workspaces across the same subnet pair, and must use a new subnet pair for each different workspace.
- Once the workspace is deployed, the public and private subnets cannot be re-sized.
- Note that the Azure Databricks deployment would create a managed resource group in the Azure Databricks resource overview page on Azure portal. You cannot create any resources in the managed resource group, nor can you edit any existing ones.
- Azure Databricks supports Private link for both Front-end (user to workspace i.e. Allow public network access set to Disabled) and Back-end (Data plane to control plane i.e. No Azure Databricks Riles) enabling private connection without exposing traffic to the public internet.
- Create private endpoints by following through the documentation to deploy Azure Databricks with a private link either using a simplified or standard deployment pattern.
- Enable customer managed keys to protect and control access to the encrypted data.
Network Security Rules: What does it mean?
Worker to Worker Inbound rule allows traffic between cluster instances.
- Worker to Worker rule allows traffic between cluster instances so that drivers and workers can communicate between each other.
- Metastore (Sql Service Tag) allows outbound traffic to the default HMS from the public Subnet
- Control Plane (AzureDatabricks Service Tag) allows outbound traffic to Azure Databricks Control Plane (i.e. SCC, Webapp ) from the public subnet.
- Note: AzureDatabricks service tag will not be added to the NSG rules if back end private link is enabled.
- Storage (Storage Service Tag) allows outbound traffic to Control plane assets such log storage, artifacts and dbfs from the public subnet
- Event Hub (EventHub Service Tag) allows outbound traffic to Event hub end point (for observability) from the public Subnet
- For a private link enabled workspace, two additional (443, 6666) ports need to be added for outbound communication with the private end point subnet. The same ports need to be opened for the inbound communication on the private endpoint subnet's NSG rules.
Step 2: Setup private endpoints for default blob storage (DBFS) (Optional)
Azure Databricks creates a default blob storage (a.k.a root storage) during the deployment process which is used for storing logs and telemetry. Even though public access is enabled on this storage, the Deny Assignment created on this storage prohibits any direct external access to the storage; it can be accessed only via the Databricks workspace. Azure Databricks deployments now support secure connection to the root blob storage (DBFS) with the creation of Private Endpoint (both dfs and blob), but enabling private endpoint for DBFS does not turn off public access. Note that the Private Endpoints for storage incurs additional cost.
As a best practice It is NOT recommended to store any application data in the root blob (DBFS) storage. Leverage separate ADLS Gen2 Storage to store any application specific data using private link (Securely Accessing Azure Data Services)
We do not recommend setting up access to such data services through a network virtual appliance / firewall, as that has a potential to adversely impact the performance of big data workloads and the intermediate infrastructure.
NOTE: It is highly recommended to store the application data on an external ADLS Gen2 Storage. Follow through similar setup to create private link endpoints for the external ADLS storages to access / store data securely.
To configure such private endpoints for additional services, please refer to the relevant Azure documentation.
Step 3: Deploy Azure Firewall
Azure Firewall is a scalable cloud native firewall that can act as the filtering device for any allowed public endpoints to be accessible from your Azure Databricks workspace.
Typicalls, Firewalls are placed on the centralized Hub VNet and peered with multiple Spoke Vnet. The Spoke Vnet egresses all the traffic via the Firewall.
Azure Firewall policies are the recommended approach to create rules for the Azure Firewall. The firewall policies are global resources that can be used across multiple Azure Firewall instances.
Create a network rule and application rule collection as follows. Note that the application rules are optional if the egress traffic is via UDR (discussed in the next section)
- AzureDatabricks Service Tag is not required if private endpoints are enabled for the workspace.
- Azure Databricks also makes additional calls to NTP service, CDN, cloudflare, GPU drivers and external storages for demo datasets which need to be whitelisted appropriately.
Attach the firewall policy to the firewall.
Step 4: Create User Defined Routes (UDRs)
At this point, the majority of the infrastructure setup for a secure, locked-down deployment has been completed. We now need to route appropriate traffic from Azure Databricks workspace subnets to Control plane and Azure Firewall.
Create a Route table and forward all the traffic to the Virtual appliances (azure firewall).
Step 6: Configure VNET Peering
Finally, the virtual network azuredatabricks-spoke-vnet and hub-vnet need to be peered so that the route table configured earlier could work properly. Follow through the documentation to setup Vnet peering between Hub and Spoke Networks.
The setup is now complete.
Step 7: Assign workspace Unity Catalog Metastore
We are now at the last step. Now, assign the workspace to Unity Catalog.
Step 8: Validate Deployment
It's time to put everything to test now:
- Deploy a virtual machine with the VNet if the front end access is disabled.
- Go to the Azure Databricks workspace that you'd created in Step 1, launch and create a cluster.
- Create a notebook and attach it to the cluster.
- Try and access the storage account that you created in Step 2 earlier.
If the data access worked without any issues, that means you've accomplished the optimum secure deployment for Azure Databricks in your subscription. This was quite a bit of manual work, but that was more for a one-time showcase. In practical terms, you would want to automate such a setup using a combination of ARM Templates, Azure CLI, Azure SDK etc.:
- Deploy Azure Databricks in your own managed VNET using ARM Template
- Create Private Endpoint using Azure CLI (or ARM Template)
- Deploy Azure Firewall using ARM Template (or Azure CLI)
- Deploy Route Table and Custom Routes using ARM Template
- Peer Virtual Networks using ARM Template
- Create Private Link for Azure Databricks with Terraform
Common Questions with Data Exfiltration Protection Architecture
Can I use service endpoint to secure data egress to Azure Data Services?
Yes, only with VNet injection. Service Endpoint provides secure and direct connectivity to Azure services over an optimized route over the Azure backbone network. Service Endpoints can be used to secure connectivity to external Azure resources to only your virtual network. Service Endpoints are secure only if used in conjunction with properly defined network firewall rules for the Azure service using the Service Endpoint.
Can I use service endpoint policies?
No, subnets used by Databricks are locked using a network intent policy, this prevents service endpoint policy enforcement. Azure network intent policies are an internal network construct to prevent customers from accidentally modifying the subnets used by Databricks.
Can I use Network Virtual Appliance (NVA) other than Azure Firewall?
Yes, you could use a third-party NVA as long as network traffic rules are configured as discussed in this article. Please note that we have tested this setup with Azure Firewall only, though some of our customers use other third-party appliances. It's ideal to deploy the appliance on cloud rather than be on-premises.
Can I have a firewall subnet in the same virtual network as Azure Databricks?
Yes, you can. As per Azure reference architecture, it is advisable to use a hub-spoke virtual network topology to plan better for the future. Should you choose to create the Azure Firewall subnet in the same virtual network as Azure Databricks workspace subnets, you wouldn't need to configure virtual network peering as discussed in Step 6 above.
Can I filter Azure Databricks control plane SCC Relay Ip traffic through Azure Firewall?
Yes you can but we would not recommend it because:
- The traffic between Azure Databricks clusters(data plane) and the SCC Relay service stays over Azure Network and does not not flow over the public internet.
- SCC Relay service and data plane needs to have stable and reliable communication in place, having a firewall or a virtual appliance between them introduces a single point of failure e.g. in case of any firewall rule misconfiguration or scheduled downtime which may result in excessive delays in cluster bootstrap (transient firewall issue) or won't be able to create new clusters or affect scheduling and running jobs.
Can I analyze accepted or blocked traffic by Azure Firewall?
We recommend using Azure Firewall Logs and Metrics for that requirement.
Can I Upgrade an existing non-NPIP to NPIP or PL Enabled workspace?
Getting Started with Data Exfiltration Protection with Azure Databricks
We discussed utilizing cloud-native security control to implement data exfiltration protection for your Azure Databricks deployments, all of which could be automated to enable data teams at scale. Some other things that you may want to consider and implement as part of this project:
- Enable meta controls to unlock true potential of your data lake
- Manage access to notebook features
- Audit everything with Diagnostic Logs, Storage Access Logs and NSG Flow Logs (requires VNET Injection)
Please reach out to your Microsoft or Databricks account team for any questions.