Cataloging data for a lakehouse

Providing seamless access across the platform requires a strong catalog server

Essai gratuitPlanifier une démo

Using AWS Glue as a catalog for Databricks

To discover data across all your services, you need a strong catalog to be able to find and access data. The AWS Glue service is an Apache-compatible Hive serverless metastore that allows you to easily share table metadata across AWS services, applications or AWS accounts. Databricks and Delta Lake are integrated with AWS Glue to discover data in your organization and to register data in Delta Lake and to discover data between Databricks instances.


Databricks comes pre-integrated with AWS Glue


Simplifies manageability by using the same AWS Glue catalog across multiple Databricks workspaces.


Integrated security by using Identity and Access Management Credential Pass-Through for metadata in AWS Glue. For a detailed explanation, see the Databricks blog introducing Databricks AWS IAM Credential Pass-Through.


Provides easier access to metadata across the Amazon services and access to data cataloged in AWS Glue.

Databricks Delta Lake integration with AWS core services

This reference implementation illustrates the uniquely positioned Databricks Delta Lake integration with AWS core services to help you solve your most complex data lake challenges. Delta Lake runs on top of S3, and it is integrated with Amazon Kinesis, AWS Glue, Amazon Athena, Amazon Redshift and Amazon QuickSight, just to name a few.

If you are new to Delta Lake, you can learn more here.


aws glue integration call out

Integrating Databricks with AWS Glue


How to configure a Databricks cluster to access the AWS Glue Catalog


First launch the Databricks computation cluster with the necessary AWS Glue Catalog IAM role. The IAM role and policy requirements are clearly outlined in a step-by-step manner in the Databricks AWS Glue as Metastore documentation.

In this example, create an AWS IAM role called Field_Glue_Role, which also has delegated access to my S3 bucket. Attach the role to the cluster configuration, as depicted in the demo video.


Next, the Spark Configuration properties of the cluster configuration must be set prior to the cluster startup as shown in the how to update video.


HSetting up the AWS Glue database using a Databricks notebook


Before creating an AWS Glue database, attach the cluster to your notebook, created in the previous step, and test your setup with the command shown here.


Then validate that the same list of databases is displayed using the AWS Glue console and list the databases.


Create a new AWS Glue database directly from the notebook and verify that the new AWS Glue database has been created successfully by re-issuing the SHOW DATABASES. The AWS Glue database can also be viewed via the data pane.


Create a Delta Lake table and manifest file using the same metastore

Create and catalog

Create and catalog the table directly from the notebook into the AWS Glue data catalog. Refer to Populating the AWS Glue data catalog for creating and cataloging tables using crawlers.

The demo data set here is from a movie recommendation site called MovieLens, which is comprised of movie ratings. Create a DataFrame with this python code.


Then register the DataFrame as a temporary table and access it using this SQL command.

Delta Lake

Now create a Delta Lake table using the temporary table created in the previous step and this SQL command.

Note: It’s very easy to create a Delta Lake table as described in the Delta Lake Quickstart Guide

Generating a manifest for Amazon Athena

Now generate the manifest file required by Amazon Athena using the following steps.

1. Generate manifests by running this Scala method. Remember to prefix the cell with %scala if you have 
 created a python, SQL or R notebook.

2. Create a table in the Hive metastore connected to 
Athena using the special format 
SymlinkTextInputFormat and the manifest file location.

In the sample code, the manifest file is created in the s3a://aws-airlifts/movies_delta/_symlink_format_
manifest/ file location.


Query the Delta Lake table using Amazon Athena

Amazon Athena

Athena is a serverless service that does not need any infrastructure to manage and maintain. Therefore, you can query the Delta table without the need of a Databricks cluster running.

From the Amazon Athena console, select the database, then preview the table as shown in the video.


Integrating AWS Glue provides a powerful serverless metastore strategy for all enterprises using the AWS ecosystem. Elevate the reliability of data lakes with Delta Lake and provide seamless, serverless data access by integrating with Amazon Athena. The Databricks Lakehouse Platform powers the data lake strategy on AWS that enables data analysts, data engineers and data scientists to get performant and reliable data access.

Prêt à vous lancer ?

Essai gratuit




Témoignage de client

Lire la suite