Our release of Databricks on Google Cloud Platform (GCP) was a major milestone toward a unified data, analytics and AI platform that is truly multi-cloud. Databricks on GCP, a jointly-developed service that allows you to store all of your data on a simple, open lakehouse platform, is based on standard containers running on top of Google's Kubernetes Engine (GKE).
When we released Databricks on GCP, the feedback was "it just works!" However, some of you asked deeper questions about Databricks and Kubernetes, so we’ve decided to share reasons for using GKE, our learnings and some key implementation details.
Why Google Kubernetes Engine?
Open source software and containers
At Databricks, open source is core to who we are, which is why we’ve continued to create and contribute to major open source projects, such as Apache Spark™, MLflow,Delta Lake and Delta Sharing. As a company, we also contribute back to the community and use open source on a daily basis.
We have been using containers for many years. For example, in MLflow, users build machine learning (ML) models as Docker images, store them in a container registry and then deploy and run the model from the registry.
Another example are Databricks notebooks: version-controlled container images simplify the support for multiple Spark, Python and Scala versions, and containers lead to faster iterations in software development and more stable production systems.
Kubernetes and hyperscale
We are well aware that a container orchestration system, such as Kubernetes, brings its own challenges. The underlying concepts of Kubernetes and its abundance of features demand an experienced and knowledgeable data engineering team.
Databricks, however, has grown into a hyperscale environment within just a few years by successfully building on containers creating open source software. Our customers spin up millions of instances per day, and we are supporting hundreds of thousands of data scientists each month.
Security and simplicity
What matters most to us is delivering new features for data engineers and data scientists faster. When it came to designing Databricks on GCP, our engineering team looked at the best options for fulfilling our security and scalability requirements. Our goal was to simplify the implementation and focus less on lower-level infrastructure, dependencies and instance life-cycle. With Kubernetes, our engineers could leverage the strong momentum from the open source community to drive infrastructure logic and security.
GKE and other Google Cloud Services
We critically evaluated the tradeoff between the operational expertise required and the benefits gained from operating a large, upstream Kubernetes environment in production and ultimately decided against using a self-managed Kubernetes cluster.
The key reasons for selecting GKE instead are the fast adoption of new Kubernetes versions and Google's priority for infrastructure security. GKE from Google, the original creator of Kubernetes, is one of the most advanced managed Kubernetes services on the market.
On the one hand, Databricks integrates with all the key GCP cloud services like Google Cloud Storage, Google BigQuery and Google Looker. On the other hand, our implementation is running on top of GKE.
Databricks on Google Kubernetes Engine
Splitting a distributed system into a control plane and a user plane is a well-known design pattern. The task of the control plane is to manage and serve customer configuration. The data plane, which is often much larger, is for executing customer requests.
Databricks on GCP follows the same pattern. The Databricks operated control plane creates, manages and monitors the data plane in the GCP account of the customer. The data plane contains the driver and executor nodes of your Spark cluster.
GKE clusters, namespaces and custom resource definitions
When a Databricks account admin launches a new Databricks workspace, the corresponding data plane is created in the customer’s GCP account as a regional GKE cluster in a VPC (see Figure 1). There is a 1:1 relation between workspaces, GKE clusters and VPCs. Workspace users never interact with data plane resources directly. Instead, they do so indirectly via the control plane, where Databricks enforces access control and resource isolation among workspace users. Databricks also deallocates GKE compute resources intelligently based on customer usage patterns to save costs.
GKE cluster and node pools
The GKE cluster is bootstrapped with a system node pool dedicated to running workspace-wide trusted services. When launching a Databricks cluster, the user specifies the number of executor nodes, as well as the machine types for the driver node and the executor nodes. The cluster manager, which is part of the control plane, creates and maintains a GKE nodepool for each of those machine types; driver and executor nodes often run on different machine types, and therefore are served from different node pools.
Kubernetes offers namespaces to create virtual clusters with scoped names (hence the name). Individual Databricks clusters are separated from each other via Kubernetes namespaces in a single GKE cluster and a single Databricks workspace can contain hundreds of Databricks clusters. GCP network policies isolate the Databricks cluster network within the same GKE cluster and further improve the security. A node in a Databricks cluster can only communicate with other nodes in the same cluster (or use the NAT gateway to access the internet or other public GCP services).
Custom resource definitions
Kubernetes was designed from the ground up to allow the customization and extension of its API using Kubernetes custom resource definitions (CRD). For every Databricks cluster in a workspace, we deploy a Databricks runtime (DBR) as a Kubernetes CRD.
Nodepools, pods and sidecars
The Spark driver and executors are deployed as Kubernetes pods, which are running inside the nodes of the corresponding nodepool specified by a Kubernetes pod node selector. One GKE node is exclusively used by either a driver pod or by an executor pod. Cluster namespaces are configured with Kubernetes memory requests and limits.
On each Kubernetes node, Databricks also runs a few trusted daemon containers along with the driver or executor container. These daemons are trusted sidecar services that facilitate data access and log collection on the node. Driver or executor containers can only interact with the daemon containers on the same pod through restricted interfaces.
Frequently Asked Questions (FAQ)
Q: Can I deploy my own pods in the Databricks provided GKE cluster?
You cannot access the Databricks GKE cluster. It is restricted for maximum security and configured for minimal resource usage.
Q: Can I deploy Databricks on my own custom GKE cluster?
We don't support this at the moment.
Q: Can I access the Databricks GKE cluster with kubectl?
Although the data plane of the GKE cluster is running in the customer account, default access restrictions and firewall settings are in place to prevent unauthorized access.
Q: Is Databricks on GKE faster, (e.g. cluster startup times), than Databricks on VMs or other clouds?
We encourage you to do your own measurements since the answer to this question depends on many factors. One benefit of the Databricks multi-cloud offering is that you can run such tests quickly. Our initial tests have shown that for a large number of concurrent workers, cold startup time was faster on GKE compared to other cloud offerings. Instances with comparable local SSDs did run certain Spark workloads slightly faster compared to some other clouds with similar compute core/memory/disk spec.
Q: Why aren't you using one GKE cluster per Databricks cluster?
For efficiency reasons. Databricks clusters are created frequently, and some of them are short-lived (e.g. with short-running jobs).
Q: How long does it take to start up a cluster with 100 nodes?
Startup - even for large clusters of more than 100 nodes - happens in parallel, and thus the startup time does not depend on the cluster size. We recommend you measure the startup time for your individual setup and settings.
Q: How can I optimize how pods are assigned to a node for cost efficiency? I want to schedule several Spark executor pods to a larger node.
Pods are optimally configured by Databricks for their respective usage (driver or worker nodes).
Q: Can I bring my own VPC for the GKE cluster?
Please contact your Databricks account manager for our future roadmap if you are interested in this feature.
Q: Is it safe that Databricks is running multiple Databricks clusters within a single GKE cluster?
Databricks clusters are fully isolated against each other using Kubernetes namespaces and GCP network policies. Only Databricks clusters from the same Databricks workspace share a GKE cluster for reduced cost and faster provisioning. If you have several workspaces they will be running on their own GKE cluster.
Q: Doesn't GKE add extra network overhead compared to just having VMs?
Our initial tests on GCP with the iperf3 benchmarks on n1-standard-4 instances in us-west2/1 showed excellent inter-pod throughput of more than 9 Gbps. GCP in general provides a high throughput connection to the internet with very low latencies.
Q: Now that Databricks is fully containerized, can I pull the Databricks images and use them myself, (e.g. on my local Kubernetes cluster)?
Databricks does not currently support this.
Q: Does Databricks on GCP limit us to one AZ within a region? How does node allocation to GKE actually work?
A GKE cluster uses all the AZs in a region.
Q: What features does Databricks on GCP include?
Please check out this link for up-to-date information.
The authors would like to thank Silviu Tofan for his valuable input and support.