Understanding the value of your AI and data investments is crucial—yet over 52% of enterprises fail to measure Return on Investment (ROI) rigorously [Futurm]. Complete ROI visibility requires connecting platform usage and cloud infrastructure into a clear financial picture. Often, the data is available but fragmented, as today’s data platforms must support a growing range of storage and compute architectures.
On Databricks, customers are managing multicloud, multi-workload and multi-team environments. In these environments, having a consistent, comprehensive view of cost is essential for making informed decisions.
At the core of cost visibility on platforms like Databricks is the concept of Total Cost of Ownership (TCO).
On multicloud data platforms, like Databricks, TCO consists of two core components:
Understanding TCO is simplified when using serverless products. Because compute is managed by Databricks, the cloud infrastructure costs are bundled into the Databricks costs, giving you centralized cost visibility directly in Databricks system tables (though storage costs will still be with the cloud provider).
Understanding TCO for classic compute products, however, is more complex. Here, customers manage compute directly with the cloud provider, meaning both Databricks platform costs and cloud infrastructure costs need to be reconciled. In these cases, there are two distinct data sources to be resolved:
Together, these sources form the full TCO view. As your environment grows across many clusters, jobs, and cloud accounts, understanding these datasets becomes a critical part of cost observability and financial governance.
The complexity of measuring your Databricks TCO is compounded by the disparate ways cloud providers expose and report cost data. Understanding how to join these datasets with system tables to produce accurate cost KPIs requires deep knowledge of cloud billing mechanics–knowledge many Databricks-focused platform admins may not have. Here, we deep dive on measuring your TCO for Azure Databricks and Databricks on AWS.
Because Azure Databricks is a first-party service within the Microsoft Azure ecosystem, Databricks-related charges appear directly in Azure Cost Management alongside other Azure services, even including Databricks-specific tags. Databricks costs appear in the Azure Cost analysis UI and as Cost management data.
However, Azure Cost Management data will not contain the deeper workload-level metadata and performance metrics found in Databricks system tables. Thus, many organizations seek to bring Azure billing exports into Databricks.
Yet, to fully join these two data sources is time-consuming and requires deep domain knowledge–an effort that most customers simply don't have time to define, maintain and replicate. Several challenges contribute to this:
On AWS, while Databricks costs do appear in the Cost and Usage Report (CUR) and in AWS Cost Explorer, costs are represented at a more aggregated, SKU-level, unlike Azure. Moreover, Databricks costs appear only in CUR when Databricks is purchased through the AWS Marketplace; otherwise, CUR will reflect only AWS infrastructure costs.
In this case, understanding how to co-analyze AWS CUR alongside system tables is even more critical for customers with AWS environments. This allows teams to analyze infrastructure spend, DBU usage and discounts together with cluster-and workload-level context, creating a more complete TCO view across AWS accounts and regions.
Yet, joining AWS CUR with system tables can also be challenging. Common pain points include:
In production-scale Databricks environments, cost questions quickly move beyond overall spend. Teams want to understand cost in context—how infrastructure and platform usage connect to real workloads and decisions. Common questions include:
Answering these questions requires bringing together financial data from cloud providers with operational metadata from Databricks. Yet as described above, teams need to maintain bespoke pipelines and a detailed knowledge base of cloud and Databricks billing to accomplish this.
To support this need, Databricks is introducing the Cloud Infra Cost Field Solution —an open source solution that automates ingestion and unified analysis of cloud infrastructure and Databricks usage data, inside the Databricks Platform.
By providing a unified foundation for TCO analysis across Databricks serverless and classic compute environments, the Field Solution helps organizations gain clearer cost visibility and understand architectural trade-offs. Engineering teams can track cloud spend and discounts, while finance teams can identify the business context and ownership of top cost drivers.
In the next section, we’ll walk through how the solution works and how to get started.
Although the components may have different names, the Cloud Infra Cost Field Solution for both Azure and AWS customers share the same principles, and can be broken down into the following components:
Both the AWS and Azure Field Solutions are excellent for organizations that operate within a single cloud, but they can also be combined for multicloud Databricks customers using Delta Sharing.
The Cloud Infra Cost Field Solution for Azure Databricks consists of the following architecture components:
Azure Databricks Solution Architecture
To deploy this solution, admins must have the following permissions across Azure and Databricks:
The GitHub repository provides more detailed setup instructions; however, at a high level, the solution for Azure Databricks has the following steps:
[Azure] Configure Azure Cost Management Export to export Azure Billing data to the Storage Account and confirm data is successfully exporting
Storage Account with Azure Cost Management Export Configured
AI/BI Dashboard Displaying Azure Databricks TCO
The solution for Databricks on AWS consists of several architecture components that work together to ingest AWS Cost & Usage Report (CUR) 2.0 data and persist it in Databricks using the medallion architecture.
To deploy this solution, the following permissions and configurations must be in place across AWS and Databricks:
The GitHub repository provides more detailed setup instructions; however, at a high level, the solution for AWS Databricks has the following steps.
As demonstrated with both Azure and AWS solutions, there are many real-world examples that a solution like this enables, such as:
As a practical example, a FinOps practitioner at a large organization with thousands of workloads might be tasked with finding low hanging fruit for optimization by looking for workloads that cost a certain amount, but that also have low CPU and/or memory utilization. Since the organization’s TCO information is now surfaced via the Cloud Infra Cost Field Solution, the practitioner can then join that data to the Node Timeline System Table (AWS, AZURE, GCP) to surface this information and accurately quantify the cost savings once the optimizations are complete. The questions that matter most will depend on each customer’s business needs. For example, General Motors uses this type of solution to answer many of the questions above and more to ensure they are getting the maximum value from their lakehouse architecture.
After implementing the Cloud Infra Cost Field Solution, organizations gain a single, trusted TCO view that combines Databricks and related cloud infrastructure spend, eliminating the need for manual cost reconciliation across platforms. Examples of questions you can answer using the solution include:
Platform and FinOps teams can drill into full costs by workspace, workload and business unit directly in Databricks, making it far easier to align usage with budgets, accountability models, and FinOps practices. Because all underlying data is available as governed tables, teams can build their own cost applications—dashboards, internal apps or use built-in AI assistants like Databricks Genie—accelerating insight generation and turning FinOps from a periodic reporting exercise into an always-on, operational capability.
Deploy the Cloud Infra Cost Field Solution today from GitHub (link here, available on AWS and Azure), and get full visibility into your total Databricks spend. With full visibility in place, you can optimize your Databricks costs, including considering serverless for automated infrastructure management.
The dashboard and pipeline created as part of this solution offer a fast and effective way to begin analyzing Databricks spend alongside the rest of your infrastructure costs. However, every organization allocates and interprets charges differently, so you may choose to further tailor the models and transformations to your needs. Common extensions include joining infrastructure cost data with additional Databricks System Tables (AWS | AZURE | GCP) to improve attribution accuracy, building logic to separate or reallocate shared VM costs when using instance pools, modeling VM reservations differently or incorporating historical backfills to support long-term cost trending. As with any hyperscaler cost model, there is substantial room to customize the pipelines beyond the default implementation to align with internal reporting, tagging strategies and FinOps requirements.
Databricks Delivery Solutions Architects (DSAs) accelerate Data and AI initiatives across organizations. They provide architectural leadership, optimize platforms for cost and performance, enhance developer experience, and drive successful project execution. DSAs bridge the gap between initial deployment and production-grade solutions, working closely with various teams, including data engineering, technical leads, executives, and other stakeholders to ensure tailored solutions and faster time to value. To benefit from a custom execution plan, strategic guidance and support throughout your data and AI journey from a DSA, please contact your Databricks Account Team.
Product
November 27, 2024/6 min read
Healthcare & Life Sciences
December 19, 2024/5 min read