Today we are excited to announce that Unity Catalog, a unified governance solution for all data assets on the Lakehouse, will be generally available on AWS and Azure in the upcoming weeks. Currently, you can apply for a public preview or reach out to a member of your Databricks account team.
In a previous blog, we set out our vision for a governed lakehouse and how Unity Catalog can help customers simplify governance at scale. This blog will explore the most recent updates to Unity Catalog and our growing partner ecosystem.
What’s new with Unity Catalog for the Data+AI Summit 2022?
Automated Data Lineage for all workloads
Unity Catalog now automatically tracks data lineage across queries executed in any language. Data lineage is captured down to the table and column level, while key assets such as notebooks, dashboards and jobs are tracked. Lineage opens up several use cases – including assessing the impact changes to tables will have on your data consumers, and auto-generating documentation that consumers can use to understand data in the lakehouse. For more information, see our recent blog post.
Built-in Data Search and Discovery
Unity Catalog now includes a built-in search capability. Once data is registered in Unity Catalog, end users can easily search across metadata fields including table names, column names, and comments to find the data they need for their analysis. This search capability automatically leverages the governance model put in place by Unity Catalog. Users will only see search results for data they have access to, which serves as a productivity boost for the user, and a critical control for data administrators who want to ensure that sensitive data is protected.
Search and Discovery in Unity Catalog
Simplified access controls with privilege inheritance
Unity Catalog offers a simple model to control access to data via a UI or SQL. We have now extended this model to allow data admins to set up access to 1000s of tables via a single click or SQL statement. This is achieved through a privilege inheritance model which allows admins to set access policies on whole catalogs or schemas of objects. For example, executing the following SQL statement will give the ml_team read access to all current tables and views in the main catalog, and any that are created in the future.
GRANT SELECT ON CATALOG main TO ml_team
This also serves as a way to set safe access defaults on catalogs and schemas. A common pattern may be to give a team a schema to store their data. Now an admin can set a policy on that schema so that by default all team members can read objects created by others.
Azure Managed Identities in Unity Catalog
We are excited that Unity Catalog now supports using a Azure Managed Identity for accessing both managed storage and external storage in a Unity Catalog metastore. Managed Identities are a Microsoft Azure construct that provide an identity for applications to use when connecting to resources that support Azure Active Directory (AAD). Up to this point, Unity Catalog relied on Service Principals as an identity to gain access to data in Azure Data Lake Storage (ADLS). Managed Identities have two major benefits over Service Principals for this use case. Firstly, Managed Identities do not require maintaining credentials or rotating secrets. Secondly they offer a way to connect to ADLS that is protected via a storage firewall.
Upgrade your Hive Metastore to Unity Catalog
Unity Catalog now offers a seamless upgrade experience from your existing Hive Metastore to take advantage of all the new features described above! Users can select 1000s of tables to upgrade at once within our purpose built user interface. The upgrade tool works by copying metadata for tables from existing Hive Metastores to a Unity Catalog metastore. This will also automatically resolve DBFS mount points that have been used in the definition of the tables so that data can be securely accessed across your entire Databricks account. For those who prefer code over UIs, we also make the SQL syntax (‘CREATE TABLE LIKE…’) available for running against a Databricks cluster or SQL Warehouse.
Upgrade Hive Metastore
Better together with our governance and catalog partners
In addition to all the features and capabilities you’ve read about, we also have a healthy and vibrant ecosystem of partners who are joining us in supporting Unity Catalog with their products. The ecosystem is growing every day.
“Privacera integrates with Unity Catalog by leveraging the new APIs built by the Databricks team and through a policy translation layer built by Privacera. The integration is transparent to data consumers and IT administrators and supports the same fine-grained access control functionality that is supported in Privacera integration with legacy Databricks High Concurrency clusters.” –Don Bosco Durai
Read more about Privacera and Unity Catalog.
With Unity Catalog, physical data policy enforcement is native to Databricks, less invasive to data consumers, and no longer tied to plugins specifically built for different Spark runtimes – enforcement done correctly. Meanwhile, Immuta continues to solve management challenges by providing active data monitoring, metadata discovery/centralization, scalable policy orchestration (table-, row-, column-, and cell-level controls) to include leveraging Unity Catalog’s lineage features to simplify policy enforcement, and compliance reporting/alerting. – Steve Touw
Read more about Immuta and Unity Catalog.
Alation and Databricks help organizations to gain data intelligence, eliminate silos, and promote governance capabilities to drive digital transformation projects. Alation enables organizations to nurture data as an asset – helping to enhance data discovery, aid understanding, promote trust and ensure compliance with relevant policies. Leveraging the data captured by the Unity metastore, Alation will enhance our existing integration with Databricks by easily including metadata from multiple workspaces. Together Databricks and Alation will ultimately provide catalog, lineage and policy management and enforcement for the Lakehouse. Alation is thrilled to partner with Databricks and looking forward to working jointly to enable data scientists, engineers, and analysts to quickly turn data into business insights. – Ibrahim “Ibby” Rahmani
Many of Collibra’s most strategic customers have found great value from the power of Databricks. This has been the focus of our technical integration with Unity Catalog. Collibra’s enterprise catalog brings value to business and governance personas and, thus, we think that Unity Catalog’s tactical platform focus is a perfect pairing. There are also benefits at the metadata ingestion level because there is no longer a need to have a Databricks cluster running to pull metadata. We feel that lineage, direct from a platform API like Unity Catalog, is better quality and easier to update over time as processing changes. –Vaughn Micciche
Atlan connects to Databricks Unity Catalog’s API to extract all relevant metadata powering discovery, governance, and insights inside Atlan. This integration allows Atlan to generate lineage for tables, views, and columns for all the jobs and languages that run on Databricks. By pairing this with metadata extracted from other tools in the data stack (e.g. BI, transformation, ELT), Atlan can create true end-to-end lineage. Thanks to Unity Catalog’s simplified delivery system, which sends complete lineage through its API, this entire experience is near instantaneous with drastically reduced compute and cost for customers. This allows Databricks customers to holistically understand the flow of their data, gain deeper insight into the data populating their models, run RCA exercises, and even power programmatic governance at scale with Atlan’s metadata activation engine. –Amit Prabhu