Find Sensitive Data at Scale with Data Classification in Unity Catalog

Automatically detect, tag, and track PII across your data estate using AI agents, without manual tagging or scripts.

Published: October 27, 2025

by Jacqueline Li, Viswesh Periyasamy and Xiao Zhu

Summary

• Databricks Data Classification makes it easy to continuously discover sensitive data and eliminate compliance blind spots across your entire data estate.
• Data classification leverages agentic AI to automatically identify and tag PII at scale, keeping sensitive data visible, auditable, and governed as new tables and columns are created.
• Teams can use data classification to automate protection with ABAC, enforce consistent access policies, and confidently share data without increasing risk.

Why Sensitive Data Gets Missed

As organizations scale their data platforms, sensitive information often hides in plain sight. New tables land every day, regulatory landscapes are becoming increasingly complex, and the stakes are higher than ever. According to the GDPR Enforcement Tracker Report, GDPR fines alone exceeded €5.6 billion in 2025, a growth of €1.17 billion since 2024.

Manual discovery methods simply don’t scale. What worked for hundreds of tables fails at thousands. The result? Compliance blind spots, costly audits, and stalled democratization of data. The fundamental problem is that you simply can’t protect what you can’t find.

Introducing Agentic Data Classification

Today, we’re excited to announce the Public Preview of Databricks Data Classification on AWS, Azure Databricks, and GCP.

Data Classification uses an agentic AI system to automatically discover and tag sensitive data across all your catalogs. It provides continuous visibility into where personally identifiable information (PII) resides, enabling you to stay compliant, automate protection, and confidently share data across teams, even as your data grows.

Turn manual audits into continuous visibility

With automated classification in place, your teams can shift from manual classification to strategic governance:

Audit-readiness: Pull complete logs to show where PII resides and exactly which users and groups have access to it.
Full lineage: Trace exactly when PII exists and where it flows downstream. Don’t risk missing spots where PII accidentally got copied into downstream datasets.
Data deletion requests: Locate and clean up all instances of user data across all your tables.

How Data Classification works

Data classification is designed to bring automated, agentic classification that covers all your data. Here’s how we do it:

Agentic AI for precise classification: Combines proven pattern recognition, metadata, and large language models with up to 60% higher accuracy than regex-only tools. Your data never leaves your environment following standards of Databricks AI security controls (AWS | Azure | GCP).

Efficient and intelligent scanning for enterprise scale: Scans your entire catalog once, then only rescans new or changed tables and columns. Unity Catalog lineage ensures critical datasets are incrementally scanned, ensuring PII is caught as it appears. Since our initial Beta launch, we’ve significantly improved detection speed and reduced scanning costs by up to 75%. This system is battle-tested to ensure high performance as your data platform grows.

Review and validation: Get complete visibility of the columns containing PII, and who currently has access to this data. Our focused review UI surfaces high-confidence detections with sample data, letting you easily bulk-apply tags. Full results are stored in system tables for custom reporting or tagging.

Build Scalable Access Control

Once you know where sensitive data lives, it’s easier to protect and access can scale safely.

Automate sensitivity tiers: Automate existing access request workflows where users are approved based on dataset sensitivity. For example, use Data Classification tags to automatically categorize tables by your organization’s sensitivity levels (e.g., confidential, restricted, internal, or public).
Scale governance with ABAC policies: Attribute-Based Access Control (ABAC) policies automatically mask or encrypt sensitive columns. For example, set up a policy that masks all columns tagged as [class.name], [class.email_address], and [class.phone_number] for everyone except your security team. Once configured, this policy automatically applies to data tagged as sensitive, ensuring consistent data protection that scales with your business.
Use ABAC to securely open up access: Consider the customer transactions table in the example above, which might contain both sensitive columns (e.g., customer_name, email, phone) and non-sensitive columns (e.g., transaction_id or customer_id columns). ABAC policies mask only the sensitive columns while leaving non-sensitive fields open. There is no need to block entire tables or maintain complex view logic.

What’s next?

Here's what's on our roadmap in the coming months:

API and Terraform support *Coming to Public Preview soon*
Built-in Regional and Domain-Specific Classifiers like PHI and PCI *Coming to Public Preview soon*
Custom Classification Rules for business-specific data patterns. We’re using agentic AI systems to develop patterns specific to your company's data *In Private Preview*

Get Started with Public Preview Today

Ready to transform manual processes into automated Data Classification? Get started with our resources below:

Read our product documentation (AWS | Azure | GCP)
The product is HIPAA compliant and follows trust and safety standards of Databricks AI features. Read more in our security FAQs here (AWS | Azure | GCP).
Reach out to your account representative to sign up for our custom classifiers Private Preview
Get started today and enable Data Classification from any Catalog Details tab

What's next?

November 20, 2024/4 min read

Introducing Predictive Optimization for Statistics

November 21, 2024/3 min read