Security Best Practices for AWS on Databricks
May 24, 2021 in Platform Blog
The Databrick Lakehouse Platform is the world’s first lakehouse architecture -- an open, unified platform to enable all of your analytics workloads. A lakehouse enables true cross-functional collaboration across data teams of data engineers, data scientists, ML engineers, analysts and more.
In this article, we will share a list of cloud security features and capabilities that an enterprise data team can use to harden their Databricks environment on AWS as per their risk profile and governance policy. For more information about how Databricks runs on Amazon Web Services (AWS), view the AWS web page and Databricks security on AWS page for more specific details on security and compliance.
Databricks security features
Allow your cloud infrastructure and security teams to customize and control the AWS network used by Databricks for simplified deployment and centralized governance.
Secure cluster connectivity for private clusters
Deploy Databricks clusters in your private subnets. With secure cluster connectivity, VPCs require no inbound ports to be open, and cluster infrastructure does not require public IPs to interact with the Control Plane.
Private Databricks workspaces with AWS PrivateLink
You can create private Databricks workspaces on AWS, powered by AWS PrivateLink. With this capability, you can ensure that all user and cluster traffic to the front-end and back-end interfaces of a workspace remains on the AWS network backbone. You could create private-only or hybrid workspaces depending on your risk profile. If you configure custom DNS for your VPC, read Custom DNS With AWS Privatelink for Databricks Workspaces.
Data exfiltration protection with Databricks on AWS
Give your security team full visibility and control over egress by routing traffic through the cloud-native firewall service provided by AWS. Since it’s a pluggable architecture, use any next-gen transparent firewall. You could implement it along with PrivateLink.
Restrict access to your S3 buckets
Use S3 Bucket Policies to restrict access to trusted IPs and VPCs. Ensure that you meet the requirements for bucket policies, and then conditionally allow or deny access from any other source.
Securely accessing AWS data sources from Databricks
Leverage AWS PrivateLink or Gateway VPC Endpoints to ensure private connectivity between your Databricks clusters and AWS cloud-native data sources. Use VPC Endpoint Policies to strictly enforce which S3 buckets can be accessed from your Customer-managed VPC, ensuring that you also allow read-only access to the S3 buckets that are required by Databricks.
Customer-managed keys for managed services
Encrypt your notebooks, secrets, queries and query history stored in the Databricks control plane with your own-managed keys from AWS Key Management Service (KMS). Databricks maintains an access key hierarchy similar to the envelope encryption technique used by cloud services providers, and thus encrypts the data with a Data Encryption Key (DEK) that’s wrapped by your own key (CMK).
Customer-managed keys for workspace storage
Encrypt the data on your workspace’s root S3 bucket and, optionally, your cluster EBS volumes created in your AWS account using your own managed keys from AWS Key Management Service (KMS). You can use the same or different CMKs for managed services and workspace storage and across multiple workspaces.
Restrict access to corporate IPs with IP access lists
If you’re not using private-only workspaces, you can restrict Databricks workspace access to only trusted IP addresses (e.g., the public IP of your on-premises egress gateway). Using IP Access Lists, configure both Allow and Block lists, ensuring that users can only access their Databricks workspaces from known and trusted IPs.
Implement fine-grained data security and masking with dynamic views
Most cloud-native security controls are based on identity and access management at the file level, but what happens if you want to provide different representations of the underlying datasets to different users, or mask and redact specific columns? On Databricks, data owners can build dynamic views and manage access to the tables they’ve built using SQL-based Data Object Privileges. These permissions are strictly enforced on Table Access Control clusters and SQL Analytics endpoints.
Use cluster policies to enforce data access patterns & manage costs
Enforcing certain data access patterns like Credential Passthrough or Table Access Controls requires Databricks administrators to be able to strictly enforce the types of clusters that users are able to create. Cluster Policies allow Databricks administrators to create templates of approved cluster configurations, and then enforce the use of those policies. This helps from a cost perspective too -- project-based tags could be enforced on cluster resources for chargeback purposes, or users could be made to request expensive resources like GPU clusters on an exception rather than self-serve basis.
Automatically provision & sync users & groups from your identity provider using SCIM
Creating silos of users and groups to manage access to your data and analytics tools is an anti-pattern that introduces complexity and risk. With Databricks, configure users and groups to be synced automatically from your Identity Provider using System for Cross-domain Identity Management (SCIM).
Manage personal access tokens using the token management API
Personal Access Tokens (PAT) allow your users to access non-UI interfaces of a Databricks workspace, whether it’s via the CLI,API or third-party tools. However, PAT tokens are prone to misuse if not controlled and audited centrally. Token Management API allows administrators to create, list, revoke and even manage the lifetimes associated with PAT tokens, providing a central point of control to reduce risk and improve the level of governance.
Trust but verify with Databricks
Databricks administrators can configure delivery of low latency audit logs for activities performed by Databricks users. These logs can be joined with AWS-specific logs like S3 server access logging to provide a 360 view of who’s doing what and when.