Saving Time and Cost With Cluster Reuse in Databricks Jobs
February 4, 2022 in Product
With our launch of Jobs Orchestration, orchestrating pipelines in Databricks has become significantly easier. The ability to separate ETL or ML pipelines over multiple tasks offers a number of advantages with regards to creation and management. With this modular approach, teams can define and work on their respective responsibilities independently, while allowing for parallel processing to reduce overall execution time. This capability was a major step in transforming how our customers create, run, monitor, and manage sophisticated data and machine learning workflows across any cloud. Today, we are excited to share further enhancement in our orchestration capabilities, with the ability to reuse the same cluster across multiple tasks in a job run, saving even more time and money for our customers.
Until now, each task had its own cluster to accommodate for the different types of workloads. While this flexibility allows for fine-grained configuration, it can also introduce a time and cost overhead for cluster startup or underutilization during parallel tasks.
In order to maintain this flexibility, but further improve utilization, we are excited to announce cluster reuse. By sharing job clusters over multiple tasks customers can reduce the time a job takes, reduce costs by eliminating overhead and increase cluster utilization with parallel tasks.
When defining a task, customers will have the option to either configure a new cluster or choose an existing one. With cluster reuse, your list of existing clusters will now contain clusters defined in other tasks in the job. When multiple tasks share a job cluster, the cluster will be initialized when the first relevant task is starting. This cluster will stay on until the last task using this cluster is finished. This way there is no additional startup time after the cluster initialization, leading to a time/cost reduction while using the job clusters which are still isolated from other workloads.
We hope you are as excited as we are with this new functionality. Learn more about cluster reuse and start using shared Job clusters now to save startup time and cost. Please reach out if you have any feedback for us.