Skip to main content

Cataloging data for a lakehouse

Providing seamless access across the platform requires a strong catalog server

Using AWS Glue as a catalog for Databricks

To discover data across all your services, you need a strong catalog to be able to find and access data. The AWS Glue service is an Apache-compatible Hive serverless metastore that allows you to easily share table metadata across AWS services, applications or AWS accounts. Databricks and Delta Lake are integrated with AWS Glue to discover data in your organization and to register data in Delta Lake and to discover data between Databricks instances.


Databricks comes pre-integrated with AWS Glue

Icon Graphic


Simplifies manageability by using the same AWS Glue catalog across multiple Databricks workspaces.

home icon


Integrated security by using Identity and Access Management Credential Pass-Through for metadata in AWS Glue. For a detailed explanation, see the Databricks blog introducing Databricks AWS IAM Credential Pass-Through.

Icon Graphic


Provides easier access to metadata across the Amazon services and access to data cataloged in AWS Glue.

Databricks Delta Lake integration with AWS core services

This reference implementation illustrates the uniquely positioned Databricks Delta Lake integration with AWS core services to help you solve your most complex data lake challenges. Delta Lake runs on top of S3, and it is integrated with Amazon Kinesis, AWS Glue, Amazon Athena, Amazon Redshift and Amazon QuickSight, just to name a few.

If you are new to Delta Lake, you can learn more here.


Amazon athena and presto support for delta lake

When an external table is defined in the Hive metastore using manifest files, Presto and Amazon Athena can use the list of files in the manifest file rather than finding the files by directory listing. These tables can be queried just like tables with data stored in formats like Parquet.

Integrating Databricks with AWS Glue

Step 1

How to configure a Databricks cluster to access the AWS Glue Catalog


First launch the Databricks computation cluster with the necessary AWS Glue Catalog IAM role. The IAM role and policy requirements are clearly outlined in a step-by-step manner in the Databricks AWS Glue as Metastore documentation.

In this example, create an AWS IAM role called Field_Glue_Role, which also has delegated access to my S3 bucket. Attach the role to the cluster configuration, as depicted in the demo video.

Watch the demo video



Next, the Spark Configuration properties of the cluster configuration must be set prior to the cluster startup as shown in the how to update video.

See how to update the Databricks Cluster Spark Configuration properties


Step 2

HSetting up the AWS Glue database using a Databricks notebook



Before creating an AWS Glue database, attach the cluster to your notebook, created in the previous step, and test your setup with the command shown here.



Then validate that the same list of databases is displayed using the AWS Glue console and list the databases.



Create a new AWS Glue database directly from the notebook and verify that the new AWS Glue database has been created successfully by re-issuing the SHOW DATABASES. The AWS Glue database can also be viewed via the data pane.

Step 3

Create a Delta Lake table and manifest file using the same metastore


Create and catalog

Create and catalog the table directly from the notebook into the AWS Glue data catalog. Refer to Populating the AWS Glue data catalog for creating and cataloging tables using crawlers.

The demo data set here is from a movie recommendation site called MovieLens, which is comprised of movie ratings. Create a DataFrame with this python code.



Then register the DataFrame as a temporary table and access it using this SQL command.

code snippet

Delta Lake

Now create a Delta Lake table using the temporary table created in the previous step and this SQL command.

Note: It’s very easy to create a Delta Lake table as described in the Delta Lake Quickstart Guide


Generating a manifest for Amazon Athena

Now generate the manifest file required by Amazon Athena using the following steps.

1. Generate manifests by running this Scala method. Remember to prefix the cell with %scala if you have created a python, SQL or R notebook.

2. Create a table in the Hive metastore connected to Athena using the special format SymlinkTextInputFormat and the manifest file location.

In the sample code, the manifest file is created in the s3a://aws-airlifts/movies_delta/_symlink_format_manifest/ file location.

Step 4

Query the Delta Lake table using Amazon Athena

Amazon Athena

Athena is a serverless service that does not need any infrastructure to manage and maintain. Therefore, you can query the Delta table without the need of a Databricks cluster running.

From the Amazon Athena console, select the database, then preview the table as shown in the video.



Integrating AWS Glue provides a powerful serverless metastore strategy for all enterprises using the AWS ecosystem. Elevate the reliability of data lakes with Delta Lake and provide seamless, serverless data access by integrating with Amazon Athena. The Databricks Lakehouse Platform powers the data lake strategy on AWS that enables data analysts, data engineers and data scientists to get performant and reliable data access.






Customer Story


Ready to get started?