Databricks runs on AWS and integrates with all of the major services you use like S3, EC2, Redshift and more. In this demo, we’ll show you how Databricks integrates with each of these services simply and seamlessly to enable you to build a lakehouse architecture.
Databricks Lakehouse on AWS overview
The Databricks Lakehouse Platform sits at the heart of the AWS ecosystem, and easily integrates with popular Data + AI services like Kinesis streams, S3 buckets, Glue, Athena, Redshift, QuickSight and much more. In this demo, we’ll show you how Databricks integrates with each of these services in a simple, seamless way.
Connecting to EC2, S3, Glue and IAM
When we start up a Spark cluster on Databricks, we can configure it to use the Glue Data Catalog, and also attach it to an IAM instance profile that allows Databricks to provision and manage EC2 instances, S3 buckets and other AWS services.
One of the first things we do while working with AWS Databricks is to set up a Spark cluster in your Virtual Private Cloud, which can autoscale up and down to control cloud costs as your data workloads change. Databricks Spark clusters use EC2 instances on the back end, and you can configure them to use the AWS Glue Data Catalog. You can also set up AWS instance profiles on your cluster to control and manage access to S3 buckets and other resources.
Click to expand the transcript →
Click to collapse the transcript →
Ingesting Kinesis streams into Delta Lake
Now that our autoscaling Spark cluster is up and running, let’s start by ingesting real-time data from a Kinesis stream in just a few lines of code, using Spark Structured Streaming and the built-in Databricks–Kinesis connector. First, we’ll view some of the raw data from our streaming DataFrame. Next, we can save it in Delta Lake format to a Delta Lake Bronze table stored in S3, using the code you see here. Delta Lake is the foundation of a lakehouse architecture, providing ACID transactions on cloud object storage, as well as tables that unify batch and streaming data processing to simplify your data architecture.
Viewing Delta Lake tables in the Glue console
We can view the table we just created from within Databricks by running the SHOW TABLES command in a notebook, or by clicking the Data tab and navigating to the database where the tables are stored. Since we set up our cluster to integrate with the AWS Glue Data Catalog, we can also view these Delta Lake tables directly in the Glue console. When we search for them, you can see that all of the tables we viewed in Databricks are now present in Glue.
Databricks – Redshift integration
Databricks also makes it easy to work with data stored in your Redshift data warehouse, too. Here, we’re writing some sample data to Redshift using the built-in Databricks Redshift connector. We can also read from Redshift using the same connector. Alternatively, you can choose a PostgreSQL connector or Redshift Data API from Databricks to do the same. Or we can jump into the Redshift console and query the table we just created from Databricks.
Databricks – QuickSight integration
Finally, we can also connect to AWS QuickSight dashboards from Databricks to explore our data visually and create attractive dashboards and reports.
As we’ve seen, Databricks provides a simple, open and collaborative lakehouse platform that deeply integrates with all of your AWS services. Download the notebooks used in this demo on the Databricks Demo Hub by clicking the link in the description below. Or visit databricks.com/try to get started with a free trial of Databricks on AWS today.