CUSTOMER
STORY

From data chaos to a trust score for every table

Watch video

100K+ tables

Now classified and governed automatically eliminating manual audit cycles

Exabyte scale

Data Governance Score gives every dataset a measurable trust rating tied to AI readiness

10,000+

Databricks employees — all using the platform daily, from engineering to finance

The Databricks internal lakehouse holds over 100,000 tables spanning engineering, go-to-market, HR, finance and product telemetry. Before Unity Catalog, governance was inconsistent: access patterns varied by team, classifications were incomplete, and no one could confidently say which datasets were safe for AI training. The Data Platform team rebuilt governance from first principles. They make governance at Databricks automatic, measurable and ready for the agentic AI era. Here is how they did it.

When no one can answer 'is this data governed?'

Databricks was scaling fast and its data was scaling faster. Over 100,000 tables sat across dozens of teams, workspaces and catalogs. Each team managed access differently. Classifications were inconsistent or missing entirely.

The friction was not theoretical. Security, legal and engineering teams spent hours chasing spreadsheets to answer basic questions during compliance reviews. New AI projects stalled because no one could confirm whether a dataset met privacy requirements. When a team wanted to train a model, there was no systematic way to check whether the training data was classified, governed or even current.

“Everyone had read and write access to every single table, every single job,” says Bruce Wong, Senior Director of Engineering and Head of Data Platform at Databricks. “There wasn’t just no single source of truth. There were many sources, and there was no truth.” Wong describes what the team inherited as a “data swamp,” siloed, ungoverned and unable to scale.

The challenge was not a lack of good intentions. Teams cared about data quality. But the governance model that worked at 10 people and a gigabyte does not work at 10,000 people and an exabyte. Without a unified system, governance was a patchwork of manual processes that could not keep up.

Meanwhile, the data platform team overhauled the data estate from SaaS data ingestion through Lakeflow Connect, to multi-cloud portability, to federated analytics across business domains. To keep up with the company’s growth, each line of business needed reliable, governed data to operate. The platform team needed a solution that would work not just for the engineers building pipelines, but for every analyst, marketer, finance lead and sales operator across the company. Li Yang, Senior Engineering Manager at Databricks, knew this challenge intimately, having spent a decade at other tech companies navigating similar scaling pains. “In a B2B world, trust isn’t a feature. Trust is the product,” says Yang. “We needed to move away from a model that felt like trying to fix an airplane while flying it.”

Classification, access controls and a score for every dataset

The team started as alpha customers of Unity Catalog with a guiding principle: every dataset should be trusted, governed, and AI-ready by default—through guardrails, not gates. Unity Catalog enabled them to make policy and governance decisions at the catalog level rather than managing individual users and tables, a paradigm shift that fundamentally changed how they approached the lakehouse.

First, they built a universal classification model inside Unity Catalog. Every table is tagged by data sensitivity (Public, Internal, Confidential, Restricted, Highly Restricted) and by domain (Engineering, GTM, HR, Finance, etc.). These tags map to governance zones that determine access enforcement. Second, they automated access controls. An internal platform called Fortress, built on Unity Catalog’s security APIs, enforces purpose-bound, time-limited access. No more permanent entitlements. Every request states a reason, expires on schedule and is fully auditable.

To measure trust at scale, the team built the Data Governance Score that provides a continuous metric from 0 to 100 for every dataset. The score evaluates three pillars: documentation (are columns described and tables annotated?), reliability (do data quality checks pass?), and governance (is the data classified and access-controlled?). The Data Governance Score propagates downstream through lineage: if an upstream table loses trust, every table that depends on it reflects that change automatically. Amit Pahwa, a Staff Software Engineer on the Databricks Data Platform team, helped engineer the Data Governance Score to move the company away from relying on tribal knowledge. He highlights that without structured data, “trust is the difference between like going by facts or going by your gut.”

Beyond the core governance layer, additional data platform workstreams extend the platform’s reach. Lakeflow Connect handles secure SaaS ingestion at scale, which replaced fragile connectors with governed pipelines that land data directly into the medallion architecture. A multicloud platform ensures workloads are portable and secure across AWS, Azure and GCP. A federated analytics layer gives each business domain, from finance to marketing to sales operations, self-service access to governed data through tools like AI/BI Genie and Databricks Apps. A dedicated metric store built on Unity Catalog Metric Views ensures that teams across the company share a single definition for every key metric. The implementation of Fortress was a cultural shift for the company. "Before Fortress, requesting data access felt like a black box for users," says Yang. "Now, it’s a self-service, transparent process where policy is enforced automatically, freeing up engineers to focus on innovation rather than manually approving tickets."

Agentic AI demands the governance they already built

The Data Governance Score gave the Data Platform team something they had never had: real-time visibility into which domains were compliant and which needed attention. Every one of the 100,000-plus tables is now scanned continuously and assigned a 0-to-100 trust rating. Instead of compliance reviews triggering fire drills, the team spots weak areas in real time and addresses them proactively. Security and legal teams spend their time preventing risks rather than chasing spreadsheets.

The broader impact extends beyond compliance. Trusted data accelerates every downstream workstream. When a team wants to train a model, they check the governance score and know immediately whether the data meets the bar—no manual review, no waiting for approvals. The metric store ensures that when leadership asks a question about pipeline, revenue or usage, the answer is the same regardless of who runs the query or which tool they use.

But the transformation that matters most is the one the team did not plan for. Wong frames it bluntly: “Data governance is table stakes for agentic AI.” A human will never browse 100,000 tables. An AI agent will find the stale data, the over-permissioned table or the unclassified dataset that a human would never stumble across. “AI is more likely to find bad data and give you an answer based on bad data than it is to actually hallucinate at this point,” Wong warns. The governance infrastructure the team spent years building from the classification model to the Data Governance Score and the lineage-aware trust propagation, now serves as the guardrail layer for agents across the entire company.

Today, every function at Databricks, engineering, finance, marketing, sales, and even facilities, is using agents built on this governed foundation. The amount of data generated by agentic AI is growing at a rate Wong expects will eclipse that of human-generated data. Because the governance structures were designed for humans at scale, they translate directly to agents at scale. The team that rebuilt governance from first principles is now the team enabling the company’s AI future.