Unlock the Power of Real-time Data Processing with Databricks and Google Cloud

Published: June 15, 2023

We are excited to announce the official launch of the Google Pub/Sub connector for the Databricks Lakehouse Platform. This new connector adds to an extensive ecosystem of external data source connectors enabling you to effortlessly subscribe to Google Pub/Sub directly from Databricks to process and analyze data in real time.

With the Google Pub/Sub connector you can easily tap into the wealth of real-time data flowing through Pub/Sub topics. Whether it's streaming data from IoT devices, user interactions, or application logs, the ability to subscribe to Pub/Sub streams opens up a world of possibilities for your real-time analytics, and machine learning use cases:

Google Pub/Sub connector

You can also use the Pub/Sub connector to drive low latency operational use cases fueled with real-time data from Google Cloud:

Pub/Sub connector

In this blog, we'll discuss the key benefits of the Google Pub/Sub connector coupled with Structured Streaming on the Databricks Lakehouse Platform.

Exactly-Once Processing Semantics

Data integrity is critical when processing streaming data. The Google Pub/Sub connector in Databricks ensures exactly-once processing semantics, such that every record is processed without duplicates (on the subscribe side) or data loss, and when combined with a Delta Lake sink you get exactly-once delivery for the entire data pipeline. This means you can trust your data pipelines to operate reliably, accurately, and with low latency, even when handling high volumes of real-time data.

Simple to Configure

Simplicity is critical to effectively enable data engineers to work with streaming data. That's why we designed the Google Pub/Sub connector with a user-friendly syntax allowing you to easily configure your connection with Spark Structured Streaming using Python or Scala. By providing subscription, topic, and project details, you can quickly establish a connection and start consuming your Pub/Sub data streams right away.

Code example:

Enhanced Data Security and Granular Access Control

Data security is a top priority for any organization. Databricks recommends using secrets to securely authorize your Google Pub/Sub connection. To establish a secure connection, the following options are required:

clientID
clientEmail
privateKey
privateKeyId

Once the secure connection is established, fine-tuning access control is essential for maintaining data privacy and control. The Pub/Sub connector supports role-based access control (RBAC), allowing you to grant specific permissions to different users or groups ensuring your data is accessed and processed only by authorized individuals. The table below describes the roles required for the configured credentials:

Roles	Required / Optional	How it's used
roles/pubsub.viewer or roles/viewer	Required	Check if subscription exists and get subscription
roles/pubsub.subscriber	Required	Fetch data from a subscription
roles/pubsub.editor or roles/editor	Optional	Enables creation of a subscription if one doesn't exist and also enables use of the deleteSubscriptionOnStreamStop to delete subscriptions on stream termination

Schema Matching

Knowing what records you've read and how they map to the DataFrame schema for the stream is very straight forward. The Google Pub/Sub connector matches the incoming data to simplify the overall development process. The DataFrame schema matches the records that are fetched from Pub/Sub, as follows:

Field	Type
messageId	StringType
payload	ArrayType[ByteType]
attributes	StringType
publishTimestampInMillis	LongType

NOTE: The record must contain either a non-empty payload field or at least one attribute.

Flexibility for Latency vs Cost

Spark Structured Streaming allows you to process data from your Google Pub/Sub sources incrementally while adding the option to control the trigger interval for each micro-batch. This gives you the greatest flexibility to control costs while ingesting and processing data at the frequency you need to maintain data freshness. The Trigger.AvailableNow() option consumes all the available records as an incremental batch while providing the option to configure batch size with options such as maxBytesPerTrigger (note that sizing options vary by data source). For more information on configuration options for Pub/Sub with incremental data processing in Spark Structured Streaming please refer to our product documentation.

Monitoring Your Streaming Metrics

To provide insights into your data streams, Spark Structured Streaming contains progress metrics. These metrics include the number of records fetched and ready to process, the size of those records, and the number of duplicates seen since the stream started. This enables you to track the progress and performance of your streaming jobs and tune them over time. The following is an example:

Customer Success

Xiaomi is a consumer electronics and smart manufacturing company with smartphones and smart hardware connected by an IoT platform at its core. It is one of the largest smartphone manufacturers in the world, the youngest member of the Fortune Global 500, and was one of our early preview customers for the Pub/Sub connector. Pan Zhang, Senior Staff Software Engineer in Xiaomi's International Internet Business Department, discusses the technical and business impact of the Pub/Sub connector on his team:

Melexis is a Belgian microelectronics manufacturer that specializes in automotive sensors and advanced mixed-signal semiconductor solutions and is another one of our customers that has already seen early success with the Google Pub/Sub connector. David Van Hemelen, Data & Analytics Team Lead at Melexis, summarizes his team's satisfaction with the Pub/Sub connector and Databricks partnership at large:

Get Started Today

The Google Pub/Sub connector is available starting in Databricks Runtime 13.1. For more detail on how to get started, please refer to the "Subscribe to Google Pub/Sub" documentation.

At Databricks, we are constantly working to enhance our integrations and provide even more powerful data streaming capabilities to power your real-time analytics, AI, and operational applications from the unified Lakehouse Platform. We will also shortly announce support for Google Pub/Sub in Delta Live Tables so stay tuned for even more updates coming soon!

In the meantime, browse dozens of customer success stories around streaming on the Databricks Lakehouse Platform, or test-drive Databricks for free on your choice of cloud.

What's next?

November 20, 2024/4 min read

Introducing Predictive Optimization for Statistics

November 21, 2024/3 min read