Skip to main content
Platform blog

We are excited to announce the official launch of the Google Pub/Sub connector for the Databricks Lakehouse Platform. This new connector adds to an extensive ecosystem of external data source connectors enabling you to effortlessly subscribe to Google Pub/Sub directly from Databricks to process and analyze data in real time.

With the Google Pub/Sub connector you can easily tap into the wealth of real-time data flowing through Pub/Sub topics. Whether it's streaming data from IoT devices, user interactions, or application logs, the ability to subscribe to Pub/Sub streams opens up a world of possibilities for your real-time analytics, and machine learning use cases:

Google Pub/Sub connector

You can also use the Pub/Sub connector to drive low latency operational use cases fueled with real-time data from Google Cloud:

Pub/Sub connector

In this blog, we'll discuss the key benefits of the Google Pub/Sub connector coupled with Structured Streaming on the Databricks Lakehouse Platform.

Exactly-Once Processing Semantics

Data integrity is critical when processing streaming data. The Google Pub/Sub connector in Databricks ensures exactly-once processing semantics, such that every record is processed without duplicates (on the subscribe side) or data loss, and when combined with a Delta Lake sink you get exactly-once delivery for the entire data pipeline. This means you can trust your data pipelines to operate reliably, accurately, and with low latency, even when handling high volumes of real-time data.

Simple to Configure

Simplicity is critical to effectively enable data engineers to work with streaming data. That's why we designed the Google Pub/Sub connector with a user-friendly syntax allowing you to easily configure your connection with Spark Structured Streaming using Python or Scala. By providing subscription, topic, and project details, you can quickly establish a connection and start consuming your Pub/Sub data streams right away.

Code example:

authOptions =
  {"clientId" -> clientId,
      "clientEmail" -> clientEmail,
      "privateKey" -> privateKey,
      "privateKeyId" -> privateKeyId}

query = spark.readStream
  .format("pubsub")
  // we will create a Pubsub subscription if none exists with this id
  .option("subscriptionId", "mysub") // required
  .option("topicId", "mytopic") // required
  .option("projectId", "myproject") // required
  .options(authOptions)
  .load()

Enhanced Data Security and Granular Access Control

Data security is a top priority for any organization. Databricks recommends using secrets to securely authorize your Google Pub/Sub connection. To establish a secure connection, the following options are required:

  • clientID
  • clientEmail
  • privateKey
  • privateKeyId

Once the secure connection is established, fine-tuning access control is essential for maintaining data privacy and control. The Pub/Sub connector supports role-based access control (RBAC), allowing you to grant specific permissions to different users or groups ensuring your data is accessed and processed only by authorized individuals. The table below describes the roles required for the configured credentials:

RolesRequired / OptionalHow it's used
roles/pubsub.viewer or roles/viewerRequiredCheck if subscription exists and get subscription
roles/pubsub.subscriberRequiredFetch data from a subscription
roles/pubsub.editor or roles/editorOptionalEnables creation of a subscription if one doesn't exist and also enables use of the deleteSubscriptionOnStreamStop to delete subscriptions on stream termination

Schema Matching

Knowing what records you've read and how they map to the DataFrame schema for the stream is very straight forward. The Google Pub/Sub connector matches the incoming data to simplify the overall development process. The DataFrame schema matches the records that are fetched from Pub/Sub, as follows:

FieldType
messageIdStringType
payloadArrayType[ByteType]
attributesStringType
publishTimestampInMillisLongType

NOTE: The record must contain either a non-empty payload field or at least one attribute.

Flexibility for Latency vs Cost

Spark Structured Streaming allows you to process data from your Google Pub/Sub sources incrementally while adding the option to control the trigger interval for each micro-batch. This gives you the greatest flexibility to control costs while ingesting and processing data at the frequency you need to maintain data freshness. The Trigger.AvailableNow() option consumes all the available records as an incremental batch while providing the option to configure batch size with options such as maxBytesPerTrigger (note that sizing options vary by data source). For more information on configuration options for Pub/Sub with incremental data processing in Spark Structured Streaming please refer to our product documentation.

Monitoring Your Streaming Metrics

To provide insights into your data streams, Spark Structured Streaming contains progress metrics. These metrics include the number of records fetched and ready to process, the size of those records, and the number of duplicates seen since the stream started. This enables you to track the progress and performance of your streaming jobs and tune them over time. The following is an example:

"metrics" : {
  "numDuplicatesSinceStreamStart" : "1",
  "numRecordsReadyToProcess" : "1",
  "sizeOfRecordsReadyToProcess" : "8"
}

Customer Success

Xiaomi is a consumer electronics and smart manufacturing company with smartphones and smart hardware connected by an IoT platform at its core. It is one of the largest smartphone manufacturers in the world, the youngest member of the Fortune Global 500, and was one of our early preview customers for the Pub/Sub connector. Pan Zhang, Senior Staff Software Engineer in Xiaomi's International Internet Business Department, discusses the technical and business impact of the Pub/Sub connector on his team:

"We are thrilled about this connector as it revolutionizes the way we handle data ingestion and integration. Partnering with Databricks has significantly reduced the effort required in these critical areas. This collaboration has elevated our data capabilities, enabling us to stay at the forefront of technological advancements and drive innovation across our ecosystem. Furthermore, this collaboration has proven instrumental in reducing Time to Market (TTM) for Xiaomi international internet business when integrating with Google Cloud Platform (GCP) services. Pub/Sub connector has streamlined the process, enabling us to swiftly harness the power of GCP services and deliver innovative solutions to our customers. By accelerating the integration process, we can expedite the deployment of new features and enhancements, ultimately providing our users with a seamless and efficient experience."

Melexis is a Belgian microelectronics manufacturer that specializes in automotive sensors and advanced mixed-signal semiconductor solutions and is another one of our customers that has already seen early success with the Google Pub/Sub connector. David Van Hemelen, Data & Analytics Team Lead at Melexis, summarizes his team's satisfaction with the Pub/Sub connector and Databricks partnership at large:

"We wanted to stream and process large volumes of manufacturing data from our pub/sub setup into the Databricks platform so that we could leverage real-time insights for production monitoring, equipment efficiency analysis and expand our AI initiatives. The guidance provided by the Databricks team and the pub/sub connector itself exceeded our expectations in terms of functionality and performance making this a smooth and fast-paced implementation project."

Get Started Today

The Google Pub/Sub connector is available starting in Databricks Runtime 13.1. For more detail on how to get started, please refer to the "Subscribe to Google Pub/Sub" documentation.

At Databricks, we are constantly working to enhance our integrations and provide even more powerful data streaming capabilities to power your real-time analytics, AI, and operational applications from the unified Lakehouse Platform. We will also shortly announce support for Google Pub/Sub in Delta Live Tables so stay tuned for even more updates coming soon!

In the meantime, browse dozens of customer success stories around streaming on the Databricks Lakehouse Platform, or test-drive Databricks for free on your choice of cloud.

Try Databricks for free

Related posts

Company blog

Apache Spark’s Structured Streaming with Amazon Kinesis on Databricks

August 9, 2017 by Jules Damji in Company Blog
On July 11, 2017, we announced the general availability of Apache Spark 2.2.0 as part of Databricks Runtime 3.0 (DBR) for the Unified...
Engineering blog

Latency goes subsecond in Apache Spark Structured Streaming

Apache Spark Structured Streaming is the leading open source stream processing platform. It is also the core technology that powers streaming on the...
Platform blog

Low-latency Streaming Data Pipelines with Delta Live Tables and Apache Kafka

August 9, 2022 by Frank Munz in Product
Delta Live Tables (DLT) is the first ETL framework that uses a simple declarative approach for creating reliable data pipelines and fully manages...
Platform blog

Streaming in Production: Collected Best Practices

December 12, 2022 by Angela Chu and Tristen Wentling in Platform Blog
Releasing any data pipeline or application into a production state requires planning, testing, monitoring, and maintenance. Streaming pipelines are no different in this...
Platform blog

Streaming in Production: Collected Best Practices, Part 2

January 10, 2023 by Angela Chu and Tristen Wentling in Platform Blog
In our two-part blog series titled "Streaming in Production: Collected Best Practices," this is the second article. Here we discuss the "After Deployment"...
See all Platform Blog posts