Cataloging data for a lakehouse
Providing seamless access across the platform requires a strong catalog server
Using AWS Glue as a catalog for Databricks
To discover data across all your services, you need a strong catalog to be able to find and access data. The AWS Glue service is an Apache-compatible Hive serverless metastore that allows you to easily share table metadata across AWS services, applications or AWS accounts. Databricks and Delta Lake are integrated with AWS Glue to discover data in your organization and to register data in Delta Lake and to discover data between Databricks instances.
Benefits
Databricks comes pre-integrated with AWS Glue
Simple
Simplifies manageability by using the same AWS Glue catalog across multiple Databricks workspaces.
Secure
Integrated security by using Identity and Access Management Credential Pass-Through for metadata in AWS Glue. For a detailed explanation, see the Databricks blog introducing Databricks AWS IAM Credential Pass-Through.
Collaborative
Provides easier access to metadata across the Amazon services and access to data cataloged in AWS Glue.
Databricks Delta Lake integration with AWS core services
This reference implementation illustrates the uniquely positioned Databricks Delta Lake integration with AWS core services to help you solve your most complex data lake challenges. Delta Lake runs on top of S3, and it is integrated with Amazon Kinesis, AWS Glue, Amazon Athena, Amazon Redshift and Amazon QuickSight, just to name a few.
If you are new to Delta Lake, you can learn more here.
Amazon athena and presto support for delta lake
When an external table is defined in the Hive metastore using manifest files, Presto and Amazon Athena can use the list of files in the manifest file rather than finding the files by directory listing. These tables can be queried just like tables with data stored in formats like Parquet.
Integrating Databricks with AWS Glue
Step 1
How to configure a Databricks cluster to access the AWS Glue Catalog
Launch
First launch the Databricks computation cluster with the necessary AWS Glue Catalog IAM role. The IAM role and policy requirements are clearly outlined in a step-by-step manner in the Databricks AWS Glue as Metastore documentation.
In this example, create an AWS IAM role called Field_Glue_Role, which also has delegated access to my S3 bucket. Attach the role to the cluster configuration, as depicted in the demo video.
Update
Next, the Spark Configuration properties of the cluster configuration must be set prior to the cluster startup as shown in the how to update video.
See how to update the Databricks Cluster Spark Configuration properties
Step 2
HSetting up the AWS Glue database using a Databricks notebook
Attach
Before creating an AWS Glue database, attach the cluster to your notebook, created in the previous step, and test your setup with the command shown here.
Validate
Then validate that the same list of databases is displayed using the AWS Glue console and list the databases.
Verify
Create a new AWS Glue database directly from the notebook and verify that the new AWS Glue database has been created successfully by re-issuing the SHOW DATABASES. The AWS Glue database can also be viewed via the data pane.
Step 3
Create a Delta Lake table and manifest file using the same metastore
Create and catalog
Create and catalog the table directly from the notebook into the AWS Glue data catalog. Refer to Populating the AWS Glue data catalog for creating and cataloging tables using crawlers.
The demo data set here is from a movie recommendation site called MovieLens, which is comprised of movie ratings. Create a DataFrame with this python code.
Register
Then register the DataFrame as a temporary table and access it using this SQL command.
Delta Lake
Now create a Delta Lake table using the temporary table created in the previous step and this SQL command.
Note: It’s very easy to create a Delta Lake table as described in the Delta Lake Quickstart Guide
Generating a manifest for Amazon Athena
Now generate the manifest file required by Amazon Athena using the following steps.
1. Generate manifests by running this Scala method. Remember to prefix the cell with %scala if you have created a python, SQL or R notebook.
2. Create a table in the Hive metastore connected to Athena using the special format SymlinkTextInputFormat and the manifest file location.
In the sample code, the manifest file is created in the s3a://aws-airlifts/movies_delta/_symlink_format_manifest/ file location.
Step 4
Query the Delta Lake table using Amazon Athena
Amazon Athena
Athena is a serverless service that does not need any infrastructure to manage and maintain. Therefore, you can query the Delta table without the need of a Databricks cluster running.
From the Amazon Athena console, select the database, then preview the table as shown in the video.
Conclusion
Integrating AWS Glue provides a powerful serverless metastore strategy for all enterprises using the AWS ecosystem. Elevate the reliability of data lakes with Delta Lake and provide seamless, serverless data access by integrating with Amazon Athena. The Databricks Lakehouse Platform powers the data lake strategy on AWS that enables data analysts, data engineers and data scientists to get performant and reliable data access.