Enterprise readiness and security are top-of-mind for most organizations as they plan and deploy large scale analytics and AI solutions. Right from RBAC through to network isolation, securing all your information is crucial. In this session, this important aspect for large-scale deployments will be our focus. Join this technical deep-dive to learn security best practices to help you deploy, manage and operate a secure analytics and AI environment. We will describe how multiple Azure-specific features fit into the Azure Databricks model for data security and illustrate these capabilities and best practices.
Premal Shah: Welcome to the session. Thank you for being here. I hope you’re having a great Summit. For this talk, I’m going to take you on a journey of how to secure your Azure Databricks deployments and share some best practices in that area.
My name is Premal Shah. I’m a program manager on the Azure Databricks team at Microsoft. I hope you’re excited to hear about our security capabilities. I know I’m excited to share, so let’s dive right in.
Let’s go with the agenda we have for you today. For those of you who are new to Azure Databricks, I’ll give a brief overview of the service to start with. We will then get into the challenges that enterprises face today in securing the big data environments and how Azure Databricks security features and best practices can help you to deploy, manage, and operate a secure data and AI environment.
We will then do a simple and quick demo to showcase a capability of how you can enable and use our audit logging capability in Azure Databricks and then of course would look forward to some Q & A from you towards the end.
If you’re new here and wondering, “What is all this buzz about one of the fastest growing services on Azure?” Let me give you a quick introduction. If you have data and lots of it, Azure Databricks is going to be your best friend. It is a high-performance, limitless scaling, big data processing and machine learning platform.
Whether your goal is to extend untapped sources, gain insights in real time or perform advanced analytics and machine learning, Azure Databricks has you covered. Azure Databricks is collaborative. You can create workspaces with advanced role-based access control, create collaborative notebooks and integrate deeply using CI/CD pipelines with your entire team.
Azure Databricks is connected. This is a first-party Microsoft product that we’ve engineered native connections with almost all other services in the Azure portfolio to ensure that you have a seamless, integrated experience. Azure Databricks is fast. Really fast.
Being able to process petabytes of data at scale efficiently is in fact one of our key goals and most importantly, Azure Databricks is secure and we’re going to spend most of our time here talking through our security capabilities and how you as an enterprise can use them to secure your Azure Databricks deployments.
Many organizations face a common challenge of how best to manipulate large volumes of sensitive data to gain meaningful insights in a secure way. Data engineers and security teams struggle to give their data scientists and analysts the speed and access to the data they need to drive AI initiatives while ensuring consistent policy management, data governance, and security compliance.
For our customers, if you see the falling three security challenges over and over again, first, fragmented security with teams acting in silos. For many organizations, technology, people and AI workflows exist in silos. Data engineers and data scientists work with their own toolsets.
Oftentimes, these tools are rapidly evolving open source applications that are poorly integrated across data workflows. This not only slows innovation but also creates large security gaps. Secondly is poor reliability and an inability to deploy securely at scale.
Building a secure, reliable and scalable architecture is not easy. You have to manage configuration, monitoring, batching authentication and security scanning. Enterprises who have compliance requirements like SOC 2, HIPAA, GDPR have an even more difficult challenge and third, disjointed governance.
With security often as an afterthought, AI projects are started with a focus on speed and innovation. Security is often a bolt on and thought of only later when a major compliance audit is due. At this point, it might be too late to solve some of the fundamental problems with the deployment but then what do enterprises really want and what have we learned by talking to hundreds of customers?
What they’re looking for is simple workflows where they can use a single identity to not only control role-based access control privileges, but also use that same identity to access their data. They want data protection using their own customer managed keys.
Given the complexity of the network setup, they want to be able to customize their Azure Databricks network deployments to fit in best with the rest of their infrastructure. Last of course but not the least, all enterprises today have strict compliance requirements.
Before we get into the security best practices, let’s first step back and understand the overall architecture of Azure Databricks which will help you better understand our security features and prospectuses later on as we talk about them.
Azure Databricks has a control plane which runs in a Microsoft subscription and consists of a set of common services like cluster manager or job manager that manages the data plane that’s in our customer subscriptions. For example, your compute clusters in the data plane in managed VNets and they’re managed by the cluster manager in the control plane.
All the communication between the control plane and the data plane is encrypted using TLS 1.2. The nice thing is the data itself is stored in our enterprise data sources and no transfer of data is necessary so there is a separation between the compute engine that’s Azure Databricks and your downstream storage sources.
Now, for securing your Azure Databricks deployments, we have something called E2 which is a set of features that cover these four critical areas. Network security, identity and access, data protection and compliance. I’m going to cover each area and the features we have in each of these areas for implementing best practices on security.
As a first party service, we integrate tightly with Azure Active Directory to offer single sign-on for AD users and service principals. The nice thing is you can then use this same identity to configure role-based access control for your Azure Databricks workspace objects like clusters and notebooks.
Also, you can use this identity to authenticate automatically to your Azure Data Lake Gen1 and Gen2 storage using a feature called credential passthrough. For now, credential passthrough is offered in interactive clusters but we do have a plan to support these for job clusters as well.
You can further sync your users and groups that you’ve configured in AAD to an Azure Databricks workspace in real time using a feature we call SCIM. Furthermore, our REST APIs supports both AAD tokens and service principals and all these tokens can be secured with standard AAD conditional access.
You can leverage features then like multi-factor authentication and other conditional access features for your workspace. As you can see, because of the nature of being a first party service, we integrate… Again, back to the point I made earlier tightly with the Azure infrastructure.
A best practice we highly recommend to enterprise customers is to deploy their Azure Databricks workspaces to their own virtual networks using our VNet injection feature instead of using the default managed VNet option that I talked about a couple of slides ago.
Even though the default deployment more works for many, a number of enterprise customers really want more control over the service network configuration to comply with internal data governance policies, adhere to external regulations and of course, do network customizations.
Some of those network customizations are connecting Azure Databricks clusters to other Azure Data Services using Azure private endpoints or service endpoints. Using services already deployed in existing VNets instead of needing to do peering from managed VNets.
In this picture here, you’ll see an Azure Databricks workspace deployed in an existing VNet which already has Kafka running and then you are creating a private endpoint to ADLs from that same VNet.
You can also connect Azure Databricks clusters in a VNet injected workspaces to on-prem data sources which is a requirement for many customers as they’re migrating from on-prem to the cloud. You can restrict on outbound traffic and route it through your firewalls.
You can also configure by this custom site arranges for Azure Databricks clusters to better manage your IP space. Furthermore, you can consolidate costs also now by launching multiple clusters in the same VNet to make sure that you can consolidating all the costs in one place.
An important aspect that I want you to keep in mind here is that any new features like private links will be offered only for VNet injected workspaces going forward. Moving to a VNet injected model is an added benefit and might help you also be tolerant to future features that we might be rolling out and take advantage of them.
With a secure cluster connectivity feature, Azure Databricks cluster nodes now do not have any public IPs and there are no inbound rules required from the control plane to the data plane. Before, we had public IPs for nodes of the cluster and we got a lot of customer feedback to support private IPs and the good news is we have this now generally available.
In this, all connections from the data plane are only outbound with the control plane using a scalable relay that’s hosted in the control plane. This leads to of course easier network configuration, easier approvals by InfoSec teams and overall, a more secure architecture. This feature is available for both standard and premium tiers for both VNet injected and managed VNet work spaces.
Enterprise customers require that users can only access Azure Databricks from the secured corporate network parameters like the corporate VPN. This is because they can inspect the traffic and also apply security policies on the network communication.
Now, with IP access lists which is generally available, workspace administrators can dynamically and programmatically configure allow list of IPs and subnets that can access Azure Databricks UI and the APIs. You can also specify a block list to block a subset of IPs from the allow list. You might use this feature if an allowed IP address range includes a smaller range of infrastructure IP addresses that in practice might be outside the actual secure network parameter.
A feature we are really excited about and which is in private preview is Private Link support for Azure Databricks. This is again, has been one of the most highly requested features by our enterprise customers. As mentioned before, a product is comprised of two high level components, control plane and the data plane.
The former is set up in the Microsoft subscription and hosts our web app, cluster manager, job service and other management services. The latter is set up in the customer’s subscription and host the clusters to process the data. Any access from customer’s on-prem network to the web app could it be for notebooks, REST API goes over the public network before it enters the control plane.
Even though the traffic is encrypted by TLS 1.2 or 1.3, some customers don’t want their sensitive data flowing over the public pipe. Similarly, any traffic from customer’s clusters in the data plane to the control plane, though it goes over the cloud provider’s network backbone is public for all customers on their cloud. That’s also a security concern that some customers have raised.
Now, with private workspaces using Private Link, private endpoints, one would set up their network infrastructure and workspaces in such a way where all traffic from customer’s on-prem network to the web app in the control plane can now be made to transit over Azure ExpressRoute and the cloud provider backbone all through a private tunnel.
This traffic then will be completely isolated from related traffic to other workspaces and those other customers as well. All traffic from customer’s clusters in the data plane to the relay and web app in the control plane will now also go through a private tunnel.
This traffic will also be completely isolated from related traffic for other workspaces and those other customers. All traffic from customer’s clusters in the data plane to their own cloud-native data sources can already be made to transit over the cloud provider backbone all through a private tunnel. What this means is Private Link to data sources in customer’s environment already works at scale today.
This is the interaction that I’m talking about and that is one of the best practices we recommended yet along with VNet injection and we have a detailed blog that talks about data exfiltration. I’ve put a link in the resources slide as well for you to look at it offline and derive some benefit from there.
Our Token Management API provides Azure Databricks workspace administrators visibility and control over tokens in their workspaces. The Token Management API provides tighter access controls on which users can create end user tokens and therefore who can access the workspace for automation purposes.
It now gives admins the ability to control the lifetime of these tokens in Azure Databricks workspaces and to monitor tokens with a list of existing tokens that they can see who created them, their expiration date and also a note on what the token is used for.
Admins can then go ahead and choose to revoke any of these tokens from the workspace and thereby have a much greater security control on what users can do with a particular workspace. With customer managed keys, customers can bring their own managed enterprise keys to encrypt the notebooks and queries stored in the Databricks control plane and the data in DBFS store.
They can now use customer managed keys to encrypt data that is stored in the data plane in DBFS also now like Bloomberg results and this particular feature is generally available. The way we do this is we use envelope encryption to wrap the underlying data encryption key with your customer managed key.
These keys are stored in Azure Key Vault. If the customer deletes that Azure Key Vault key, the notebooks and queries stored in the control plane as well as data stored in DBFS are inaccessible to our services as well as the customers. This provides InfoSec teams the security they need to control any risks that are associated with Azure Databricks storing notebooks, keys or even the data that’s in the DBFS storage.
In this diagram, you can see that you have created a couple of keys in Azure Key Vault and these keys could be different. You could have a different key for the control plane and a different key to encrypt your data in DBFS storage. Audit logging or diagnostic logging feature in Azure Databricks allows enterprise security and admins to monitor all access to data and other cloud resources which helps to establish an increased level of trust with the users.
This is a demo that I’m going to show you shortly. Through audit logging, security teams gain insights into a host of activities occurring within or from an Azure Databricks workspace. Some examples are clustered administration. If a user starts and stops clusters, you can see those audit activities.
You can see things like permission management. You can figure out who is accessing your workspaces and how. Is it through the web application or the API? And many more other such categories or features. The nice thing is the audit log feature is integrated with Azure Monitor so you can now deliver your auditing activities to Log Analytics workspace, a storage account or an Event Hub and all of this is delivered in less than five minutes.
We highly recommend that you turn on diagnostic logging for all your workspaces. Just to note that this feature is available in premium tier workspaces only. We have the required security and privacy certifications like HITRUST, HIPAA and SOC 2 to ensure that we are meeting your compliance requirements as enterprise customers.
We also now have Azure Gov Cloud. We are production ready in Azure Gov Cloud and we also now have a FedRAMP High certification. Let’s now jump into a demo. I want to give you a glimpse of our audit logging feature on Azure Databricks and how that helps you gain insights on the activities done by users within an Azure Databricks workspace.
Let me show you have to configure diagnostic settings for an Azure Databricks workspace. Here, I have already got a premium Azure Databricks workspace created. Remember that you can configure diagnostic settings only on premium workspaces. Now, if I go here, I have picked the diagnostic settings option in the blade and that’s what brings me to this window.
Now, I can create a new diagnostic setting so let me just go there for a minute and show you. You can give it any name so that you can set multiple settings on a workspace. You can see that there is a comprehensive list of auditing activities or categories of auditing activities that you can do.
You can figure out who’s starting and stopping clusters, who’s logging in. There’s information about jobs, notebooks, secrets. Once you pick what you want to log, you can then send it to either Log Analytics workspace, a storage account or an Event Hub as I’ve mentioned and you can in fact pick all three if you wish.
I’ve already configured a setting here. Let me just show you quickly. I’ve already configured this particular setting and I’ve said that I want all the audit activities to go to this Log Analytics workspace. Let’s for a moment go into the workspace itself and show you what I did.
As a user, I’ve logged into this workspace and what I’ve done is I created a cluster and then I terminated a cluster and then I started a cluster. If I go to clusters here and now if I go to a particular cluster and if I go to the event log here, you can see that I created a cluster here and then I terminated it and then I restarted this cluster again as this particular user. That’s the user I’m logged in.
Now, all this information is collected in the Log Analytics workspace that I showed you. Let me go to that Log Analytics workspace. When I go to the Log Analytics workspace, I go to the logs option in the blade. That’s what brings me here and I can basically do Kusto queries here.
I have two Kusto queries here. One Kusto query is telling me which users started which clusters and also I have a query which talks about logins. I’ve run both these queries to show you. Let me just run it again.
You can see here that it captured this particular user that is me, has started this particular cluster. If you go to this window, you can see that this is the cluster that I started which I was trying to show you. If I go to the other query here and run this again, you can see that it’s showing me that I logged in to this workspace through the browser.
I came in through the browser and not an API. Again, it gives you a good idea about who’s logging in and from where and this way you can also prevent unauthorized access. I hope this gives you a very simple and very basic feel of how audit logging works in Azure Databricks.
In summary, we saw that Azure Databricks is secure by design. Security is not an afterthought in the product. With Azure Databricks, you’re covered in the key areas of network security, identity and access, compliance and data protection. Again, want to state that security is baked in and through the product and will of course continue to be our core focus.
You can check out these additional useful materials we have put together to help you implement the right Azure Databricks security architecture for your enterprise. I would highly recommend out of these that you go through the Azure Databricks security best practices blog that goes into detail on all the practices we covered in this presentation and it gives you a chance to dive into the details so you can implement the right Azure Databricks security architecture for your enterprise.
Thank you so much for your time. Let’s do keep the conversation going. Visit us at the Microsoft booth where you can chat with our experts, download content and take a survey for a chance to win a pair of Surface Earbuds. Have a great rest of the Summit. Thank you.
Premal Shah is a Principal Program Manager in the Azure Databricks team. His areas of expertise include big data and large scale machine learning on Spark for both cloud and on-premise workloads. He i...