In this demo, we walk through an overview of the Databricks Platform, including the platform architecture and the Databricks Data Science & Engineering, Databricks Machine Learning and Databricks SQL environments.
Let’s take a look at the architecture of the Databricks platform. While understanding the details of all the components and how they’re integrated falls under the responsibility of a platform administrator, as a data engineer, it’s good to have a broad understanding of the structure and how it all comes together.
This diagram illustrates the Databricks architecture. The control plane consists of the backend services that Databricks manages in its own cloud account aligned with the cloud service and used by the customer – AWS Azure or GCP. Though the majority of your data does not live here, some elements, such as notebook commands and workspace configurations, are stored in the control plane and encrypted at rest. Through the control plane and the associated UI and APIs it provides, one can launch clusters, start jobs and get results, and interact with table metadata.
Click to expand the transcript →
The Databricks web application delivers three different services catering to the specific needs of various personas, Databricks SQL, Databricks Machine Learning and Data Science and Engineering workspace also known as the workspace.
A Databricks cluster is a set of computation resources and configurations on which you run data engineering, data science and data analytic workloads. You run these workloads as a set of commands in a notebook or as a job. Typical applications include production ETL pipelines, streaming analytics, ad hoc analytics and machine learning. The clusters live in the data plane within your organization’s cloud account.
Although cluster management is a function of the control plane that is part of the services offered by the Databricks platform, the clusters themselves consist of a set of one or more virtual machine instances over which computation workloads are distributed by Apache Spark™️. In the typical case, a cluster has a driver node alongside one or more worker nodes. Although Databricks does provide a single-node model as well, which is typically limited to development or testing with small workloads. Workloads are distributed across available worker nodes by the driver.
Databricks makes a distinction between all-purpose and job clusters. All-purpose clusters Analyze data collaboratively using interactive notebooks. You can create an all-purpose cluster using the workspace or programmatically using the command line interface or rest API. You can manually terminate and restart an all-purpose cluster, and multiple users can share all-purpose clusters to do collaborative, interactive analysis. Job clusters run automated jobs in an expeditious and robust way.
The Databricks job scheduler creates a job cluster when you run a job on a new job cluster and terminates the cluster when the job is complete. You cannot restart a job cluster. These properties ensure an isolated execution environment for each and every job. for job clusters, configuration info is retained for up to 30 clusters recently terminated by the job scheduler. For all-purpose clusters, configuration info is retained for up to 70 clusters terminated within the last 30 days to retain information. Beyond this period, an administrator must pin the cluster.
Ready to get started?