Companies across every industry have continued to prioritize optimization and the value of doing more with less. This is especially true of digital native companies in today’s data landscape, which yields higher and higher demand for AI and data-intensive workloads. These organizations manage thousands of resources in various cloud and platform environments. In order to innovate and iterate quickly, many of these resources are democratized across teams or business units; however, higher velocity for data practitioners can lead to chaos unless balanced with careful cost management.
Digital native organizations frequently employ central platform, DevOps, or FinOps teams to oversee the costs and controls for cloud and platform resources. Formal practice of cost control and oversight, popularized by The FinOps Foundation™, is also supported by Databricks with features such as tagging, budgets, compute policies, and more. Nonetheless, the decision to prioritize cost management and establish structured ownership does not create cost maturity overnight. The methodologies and features covered in this blog enable teams to incrementally mature cost management within the Data Intelligence Platform.
What we’ll cover:
Whether you’re an engineer, architect, or FinOps professional, this blog will help you maximize efficiency while minimizing costs, ensuring that your Databricks environment remains both high-performing and cost-effective.
We will now take an incremental approach to implementing mature cost management practices on the Databricks Platform. Think of this as the “Crawl, Walk, Run” journey to go from chaos to control. We will explain how to implement this journey step by step.
The first step is to correctly assign expenses to the right teams, projects, or workloads. This involves efficiently tagging all the resources (including serverless compute) to gain a clear view of where costs are being incurred. Proper attribution enables accurate budgeting and accountability across teams.
Cost attribution can be done for all compute SKUs with a tagging strategy, whether for a classic or serverless compute model. Classic compute (workflows, Declarative Pipelines, SQL Warehouse, etc.) inherits tags on the cluster definition, while serverless adheres to Serverless Budget Policies (AWS | Azure | GCP).
In general, you can add tags to two kinds of resources:
Tagging for both types of resources would contribute to effective governance and management:
Refer to this article (AWS | AZURE | GCP) for details about tagging different compute resources, and this article (AWS | Azure | GCP) for details about tagging Unity Catalog securables.
For classic compute, tags can be specified in the settings when creating the compute. Below are some examples of different types of compute to show how tags can be defined for each, using both the UI and the Databricks SDK..
SQL Warehouse Compute:
You can set the tags for a SQL Warehouse in the Advanced Options section.
With Databricks SDK:
All-Purpose Compute:
With Databricks SDK:
Job Compute:
With Databricks SDK:
Declarative Pipelines:
For serverless compute, you should assign tags with a budget policy. Creating a policy allows you to specify a policy name and tags of string keys and values.
It's a 3-step process:
You can refer to details about serverless Budget Policies (BP) in these articles (AWS/AZURE/GCP).
Certain aspects to keep in mind about Budget Policies:
With Terraform:
|
|
|
|
|
|
||
|
|
||
|
|
||
|
|
||
|
|
Next is cost reporting, or the ability to monitor costs with the context provided by Step 1. Databricks provides built-in system tables, like system.billing.usage
, which is the foundation for cost reporting. System tables are also useful when customers want to customize their reporting solution.
For example, the Account Usage dashboard you’ll see next is a Databricks AI/BI dashboard, so you can view all the queries and customize the dashboard to fit your needs very easily. If you need to write ad hoc queries against your Databricks usage, with very specific filters, this is at your disposal.
Once you have started tagging your resources and attributing costs to their cost centers, teams, projects, or environments, you can begin to discover the areas where costs are the highest. Databricks provides a Usage Dashboard you can simply import to your own workspace as an AI/BI dashboard, providing immediate out-of-the-box cost reporting.
A new version version 2.0 of this dashboard is available for preview with several improvements shown below. Even if you have previously imported the Account Usage dashboard, please import the new version from GitHub today!
This dashboard provides a ton of useful information and visualizations, including data like the:
The dashboard also allows you to filter by date ranges, workspaces, products, and even enter custom discounts for private rates. With so much packed into this dashboard, it really is your primary one-stop shop for most of your cost reporting needs.
For Lakeflow jobs, we recommend the Jobs System Tables AI/BI Dashboard to quickly see potential resource-based costs, as well as opportunities for optimization, such as:
For enhanced monitoring of Databricks SQL, refer to our SQL SME blog here. In this guide, our SQL experts will walk you through the Granular Cost Monitoring dashboard you can set up today to see SQL costs by user, source, and even query-level costs.
Likewise, we have a specialized dashboard for monitoring cost for Model Serving! This is helpful for more granular reporting on batch inference, pay-per-token usage, provisioned throughput endpoints, and more. For more information, see this related blog.
We talked about Serverless Budget Policies earlier as a way to attribute or tag serverless compute usage, but Databricks also has just a Budget (AWS | Azure | GCP), which is a separate feature. Budgets can be used to track account-wide spending, or apply filters to track the spending of specific teams, projects, or workspaces.
With budgets, you specify the workspace(s) and/or tag(s) you want the budget to match on, then set an amount (in USD), and you can have it email a list of recipients when the budget has been exceeded. This can be useful to reactively alert users when their spending has exceeded a given amount. Please note that budgets use the list price of the SKU.
Next, teams must have the ability to set guardrails for data teams to be both self-sufficient and cost-conscious at the same time. Databricks simplifies this for both administrators and practitioners with Compute Policies (AWS | Azure | GCP).
Several attributes can be controlled with compute policies, including all cluster attributes as well as important virtual attributes such as dbu_per_user
. We’ll review a few of the key attributes to govern for cost control specifically:
Often, when creating compute policies to enable self-service cluster creation for teams, we want to control the maximum spending of those users. This is where one of the most important policy attributes for cost control applies: dbus_per_hour
.
dbus_per_hour
can be used with a range
policy type to set lower and upper bounds on DBU cost of clusters that users are able to create. However, this only enforces max DBU per cluster that uses the policy, so a single user with permission to this policy could still create many clusters, and each is capped at the specified DBU limit.
To take this further, and prevent an unlimited number of clusters being created by each user, we can use another setting, max_clusters_by_user
, which is actually a setting on the top-level compute policy rather than an attribute you would find in the policy definition.
Policies should enforce which cluster type it can be used for, using the cluster_type
virtual attribute, which can be one of: “all-purpose”, “job”, or “dlt”. We recommend using fixed
type to enforce exactly the cluster type that the policy is designed for when writing it:
A common pattern is to create separate policies for jobs and pipelines versus all-purpose clusters, setting max_clusters_by_user
to 1 for all-purpose clusters (e.g., how Databricks’ default Personal Compute policy is defined) and allowing a higher number of clusters per user for jobs.
VM instance types can be conveniently controlled with allowlist
or regex
type. This allows users to create clusters with some flexibility in the instance type without being able to choose sizes that may be too expensive or outside their budget.
It’s important to stay up-to-date with newer Databricks Runtimes (DBRs), and for extended support periods, consider Long-Term Support (LTS) releases. Compute policies have several special values to easily enforce this in the spark_version
attribute, and here are just a few of those to be aware of:
auto:latest-lts:
Maps to the latest long-term support (LTS) Databricks Runtime version.auto:latest-lts-ml:
Maps to the latest LTS Databricks Runtime ML version.auto:latest
and auto:latest-ml
for the latest Generally Available (GA) Databricks runtime version (or ML, respectively), which may not be LTS.
We recommend controlling the spark_version
in your policy using an allowlist
type:
Cloud attributes can also be controlled in the policy, such as enforcing instance availability of spot instances with fallback to on-demand. Note that whenever using spot instances, you should always configure the “first_on_demand” to at least 1 so the driver node of the cluster is always on-demand.
On AWS:
On Azure:
On GCP (note: GCP cannot currently support the first_on_demand
attribute):
As seen earlier, tagging is crucial to an organization’s ability to allocate cost and report it at granular levels. There are two things to consider when enforcing consistent tags in Databricks:
custom_tags.
attribute.In the compute policy, we can control multiple custom tags by suffixing them with the tag name. It is recommended to use as many fixed tags as possible to reduce manual input on users, but allowlist is excellent for allowing multiple choices yet keeping values consistent.
Long-running SQL queries can be very expensive and even disrupt other queries if too many begin to queue up. Long-running SQL queries are usually due to unoptimized queries (poor filters or even no filters) or unoptimized tables.
Admins can control for this by configuring the Statement Timeout at the workspace level. To set a workspace-level timeout, go to the workspace admin settings, click Compute, then click Manage next to SQL warehouses. In the SQL Configuration Parameters setting, add a configuration parameter where the timeout value is in seconds.
ML models and LLMs can also be abused with too many requests, incurring unexpected costs. Databricks provides usage tracking and rate limits with an easy-to-use AI Gateway on model serving endpoints.
You can set rate limits on the endpoint as a whole, or per user. This can be configured with the Databricks UI, SDK, API, or Terraform; for example, we can deploy a Foundation Model endpoint with a rate limit using Terraform:
For more examples of real-world compute policies, see our Solution Accelerator here: https://github.com/databricks-industry-solutions/cluster-policy
Lastly, we will look at some of the optimizations you can check for in your workspace, clusters, and storage layers. Most of these can be checked and/or implemented automatically, which we’ll explore. Several optimizations take place at the compute level. These include actions such as right-sizing the VM instance type, knowing when to use Photon or not, appropriate selection of compute type, and more.
As mentioned in Cost Controls, cluster costs can be optimized by running automated jobs with Job Compute, not All-Purpose Compute. Exact pricing may depend on promotions and active discounts, but Job Compute is typically 2-3x cheaper than All-Purpose.
Job Compute also provides new compute instances each time, isolating workloads from one another, while still permitting multitask workflows to reuse the compute resources for all tasks if desired. See how to configure compute for jobs (AWS | Azure | GCP).
Using Databricks System tables, the following query can be used to find jobs running on interactive All-Purpose clusters. This is also included as part of the Jobs System Tables AI/BI Dashboard you can easily import to your workspace!
Photon is an optimized vectorized engine for Spark on the Databricks Data Intelligence Platform that provides extremely fast query performance. Photon increases the amount of DBUs the cluster costs by a multiple of 2.9x for job clusters, and approximately 2x for All-Purpose clusters. Despite the DBU multiplier, Photon can yield a lower overall TCO for jobs by reducing the runtime duration.
Interactive clusters, on the other hand, may have significant amounts of idle time when users are not running commands; please ensure all-purpose clusters have the auto-termination setting applied to minimize this idle compute cost. While not always the case, this may result in higher costs with Photon. This also makes Serverless notebooks a great fit, as they minimize idle spend, run with Photon for the best performance, and can spin up the session in just a few seconds.
Similarly, Photon isn’t always beneficial for continuous streaming jobs that are up 24/7. Monitor whether you are able to reduce the number of worker nodes required when using Photon, as this lowers TCO; otherwise, Photon may not be a good fit for Continuous jobs.
Note: The following query can be used to find interactive clusters that are configured with Photon:
There are too many strategies for optimizing data, storage, and Spark to cover here. Fortunately, Databricks has compiled these into the Comprehensive Guide to Optimize Databricks, Spark and Delta Lake Workloads, covering everything from data layout and skew to optimizing delta merges and more. Databricks also provides the Big Book of Data Engineering with more tips for performance optimization.
Organizational structure and ownership best practices are just as important as the technical solutions we will go through next.
Digital natives running highly effective FinOps practices that include the Databricks Platform usually prioritize the following within the organization:
These are some of the most successful organization structures for FinOps:
A center of excellence has many benefits, such as centralizing core platform administration and empowering business units with safe, reusable assets such as policies and bundle templates.
The center of excellence often puts teams such as Data Platform, Platform Engineer, or Data Ops teams at the center, or “hub,” in a hub-and-spoke model. This team is responsible for allocating and reporting costs with the Usage Dashboard. To deliver an optimal and cost-aware self-service environment for teams, the platform team should create compute policies and budget policies that tailor to use cases and/or business units (the ”spokes”). While not required, we recommend managing these artifacts with Terraform and VCS for strong consistency, versioning, and ability to modularize.
This has been a fairly exhaustive guide to help you take control of your costs with Databricks, so we have covered several things along the way. To recap, the crawl-walk-run journey is this:
Finally, to recap some of the most important takeaways:
Get started today and create your first Compute Policy, or use one of our policy examples. Then, import the Usage Dashboard as your main stop for reporting and forecasting Databricks spending. Check off optimizations from Step 3 we shared earlier for your clusters, workspaces, and data. Check off optimizations from Step 3 we shared earlier for your clusters, workspaces, and data.
Databricks Delivery Solutions Architects (DSAs) accelerate Data and AI initiatives across organizations. They provide architectural leadership, optimize platforms for cost and performance, enhance developer experience, and drive successful project execution. DSAs bridge the gap between initial deployment and production-grade solutions, working closely with various teams, including data engineering, technical leads, executives, and other stakeholders to ensure tailored solutions and faster time to value. To benefit from a custom execution plan, strategic guidance, and support throughout your data and AI journey from a DSA, please contact your Databricks Account Team.