Platform blog

How to Seamlessly Upgrade Your Hive Metastore Objects to the Unity Catalog Metastore Using SYNC

Share this post

In this blog, we explore how you can seamlessly upgrade your Hive metastore* schemas and external tables to the Unity Catalog metastore using the new SYNC command. SYNC command can also be used to push updates from the source schemas and external tables in Hive metastore to the Unity Catalog metastore schemas and tables, which have been previously upgraded. SYNC command is currently in public preview on AWS and Azure.

*Note: Hive metastore could be your default or external metastore or even AWS Glue metastore. For simplicity, we have used the term "Hive metastore" throughout this document

Common use cases for upgrading and syncing your Hive Metastore to Unity Catalog

Unity Catalog, now generally available on AWS and Azure, offers a multitude of out-of-box centralized governance features such as unified access and audit controls for all data assets in your data Lakehouse, automated data lineage for all workloads, built-in data search and discovery, privilege inheritance for simplified access management, open cross-platform data sharing, and many more.

One of the common questions that come to mind is how you can easily upgrade or migrate your tables and schemas registered in the existing Hive metastore to the Unity Catalog metastore and keep Unity Catalog in sync with the Hive metastore. While you would want to take advantage of all the rich features Unity Catalog has to offer, there can be various scenarios where you need the Hive metastore objects to co-exist even after migrating the objects to the Unity Catalog metastore. For example, you might have an ETL pipeline that writes data to tables stored in Hive metastore and you need to perform a detailed impact analysis before gradually migrating the tables to the Unity Catalog metastore. Until such time, you need to keep your Hive metastore and the unity catalog metastore in sync.

Here are the common questions we heard from our customers:

  1. How do you migrate our data workloads from two-level namespaces (Schema and tables/views) to the Unity Catalog's 3-level namespaces (Catalog, Schema, Tables/Views)?
  2. Do you need to copy data from the existing location to a new location for the table in the unity catalog metastore or just need to create a new schema and table in the unity catalog metastore and point to the existing location?
  3. How can we maintain access to Hive metastore tables while beginning to leverage Unity Catalog, and keep changes to the schema in sync?
  4. Can we have an assessment on what steps would be required to move our HMS objects to Unity Catalog metastore?

Introducing SYNC Command in Unity Catalog

To facilitate the seamless migration of your schemas and external tables from your existing Hive metastore to the Unity Catalog metastore, we have introduced a utility called SYNC. SYNC command helps you migrate your existing Hive metastore to the Unity Catalog metastore and also helps to keep both your metastores in sync on an ongoing basis until you completely migrate all your dependent applications from Hive metastore to the Unity Catalog metastore. Instead of allocating resources to build a custom solution, SYNC provides you with an easy out of the box solution to keep your existing Hive metastore and the Unity Catalog metastore in sync.

Key features of SYNC

  • Ability to upgrade an external table from Hive metastore to the Unity Catalog metastore and keep metadata of the two tables in sync.
  • Ability to upgrade all eligible tables in Hive metastore schema to the Unity Catalog metastore and keep the metadata in sync. It uses multithreading while upgrading several tables in parallel
  • Dry run mode to display the result of the SYNC command without creating or updating the target tables.
  • Ability to run SYNC multiple times on the same schema or tables to keep the source and target metastore in sync.

How does it work

The SYNC command abstracts all the complexities of migrating a schema and external tables from the Hive metastore to the Unity Catalog metastore and keeping them in sync. Once executed, it analyses the source and target tables or schemas and performs the below operations:

  1. If the target table does not exist, the sync operation creates a target table with the same name as the source table in the provided target schema. The owner of the target table will default to the user who is running the SYNC command
  2. If the target table exists, and if the table is determined to be created by a previous SYNC command or upgraded via Web Interface, the sync operation will update the table such that its schema matches with the schema of the source table.

The command outputs one row per table which is upgraded and includes a status_code and description column. The status_code column indicates the status of the upgrade for that table and the description provides a descriptive message for each table.

Getting started with the SYNC command

Prerequisites

The users running the sync command should:

  • Be the owner of the source table in case of using "SYNC TABLE"
  • Be the owner of the source schema in case of using "SYNC SCHEMA"

Note: The current version of SYNC only supports upgrades of External Tables. Please refer to the documentation for upgrading your Hive metastore Managed Tables and views to the Unity Catalog metastore. You can also use the table clone command to create a copy of an existing Hive metastore managed table at a specific version to the Unity Catalog metastore. Read this blog to learn further about table clones in Databricks.

Usage

There are two options for the upgrade using SYNC:

  • SYNC TABLE: It upgrades a table from Hive metastore to the Unity Catalog metastore
  • SYNC SCHEMA: It upgrades all eligible tables in a Schema from Hive metastore to the Unity Catalog metastore

The SYNC command upgrades tables or schemas from Hive metastore to the Unity Catalog metastore. It can be used to create new tables in the Unity Catalog metastore from existing tables in Hive metastore. It can be used to push updates from the source tables in Hive metastore to the Unity Catalog metastore tables, which have been previously upgraded using the SYNC command or via WebUI.

An optional DRY RUN clause can be used to evaluate the upgradability of the table to Unity Catalog. In the DRY RUN mode, the command checks if the given source table can be upgraded to the Unity Catalog metastore and provides a status_code and descriptive error message in case it cannot upgrade. If the table can be upgraded from Hive metastore to the Unity Catalog metastore then the status code will show ’DRY_RUN_SUCCESS’ in the DRY RUN mode and SUCCESS when the table is successfully synced.

Syntax


SYNC { SCHEMA target_schema | TABLE target_tbl }
FROM { source_schema | source_table }
[DRY RUN]

Please visit our documentation to look up details on the parameters of SYNC command.

Note: The user who runs the SYNC command will be the owner of the newly created tables

SYNC Examples

Note: We are using sample data for this example. Databricks also provides a variety of data sets that are already mounted to DBFS in your Databricks workspace. You can find more details here.

Upgrade external table to Unity Catalog

Create Hive metastore schema

use catalog hive_metastore;
drop database if exists hmsdb_sync cascade;
create database hmsdb_sync;
Create a Unity Catalog schema

use catalog main;
drop database if exists main.ucdb_sync cascade;
create database main.ucdb_sync;
Create External Table in Hive metastore

-- create an external delta table in Hive metastore
drop table if exists hive_metastore.hmsdb_sync.people_delta;
create table hive_metastore.hmsdb_sync.people_delta 
location "<<Your Object Storage Location>>"
as
select * from delta.`dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta` limit 100000;
Select the Table to verify

select * from hive_metastore.hmsdb_sync.people_delta;
Execute Dry Run

sync table main.ucdb_sync.people_delta from hive_metastore.hmsdb_sync.people_delta DRY RUN;
Observe the Results of the Dry Run

Result of SYNC table Dry Run

Upgrade the Table and observe the result

sync table main.ucdb_sync.people_delta from hive_metastore.hmsdb_sync.people_delta;

Result of SYNC table

Describe both source and target tables and compare

describe extended hive_metastore.hmsdb_sync.people_delta;
desc extended main.ucdb_sync.people_delta;

Describe Hive Metastore table and UC tables

Describe the Hive Metastore table and UC tables
Select from the Target table to verify the data

select * from main.ucdb_sync.people_delta;

Upgraded Unity Catalog table with sample data

Upgrade the schema and all eligible tables in one go

sync schema main.ucdb_schema_sync from hive_metastore.hmsdb_schema_sync DRY RUN;

Result of SYNC schema Dry Run


sync schema main.ucdb_schema_sync from hive_metastore.hmsdb_schema_sync;

Result of SYNC schema

Conclusion

In this blog, we have shown how you can use the SYNC command to abstract the complexity of upgrading your Hive metastore objects to Unity Catalog metastore. To learn more about the SYNC command and how to get started, please visit the guides (AWS, Azure). Please refer to the Notebook to try different options with SYNC and keep your Hive metastore schemas and external tables and your Unity Catalog metastore in sync.

SYNC can be run multiple times to ensure Hive metastore objects and the Unity Catalog metastore objects are in sync. SYNC makes it seamless and easy for customers to adopt Unity Catalog and benefit from unified governance features. If you no longer need your Hive metastore schemas and tables, you can drop them. Dropping an external table does not modify the data files on your cloud tenant.

 

Try Databricks for free

Related posts

Platform blog

What's new with Databricks Unity Catalog at the Data+AI Summit 2022

Update: Unity Catalog is now generally available on AWS and Azure. Today we are excited to announce that Unity Catalog, a unified governance...
Platform blog

Announcing Gated Public Preview of Unity Catalog on AWS and Azure

Update: Unity Catalog is now generally available on AWS and Azure. At the Data and AI Summit 2021, we announced Unity Catalog, a...
Platform blog

Announcing the Availability of Data Lineage With Unity Catalog

Update: Data Lineage is now in public preview on AWS and Azure. We are excited to announce that data lineage for Unity Catalog...
See all Platform Blog posts