Apache Spark™ Structured Streaming Real-Time Mode, announced in the summer of 2025, unlocks sub-second latency use cases across industries. This article explores an AdTech use case and how real-time mode, coupled with another recent Spark enhancement (transformWithState), can be used to achieve sub-second latency delivery of advertising event data.
By eliminating the need for an external streaming engine in low-latency use cases, Databricks customers can now benefit from a simplified stack, unified governance, and seamless integration with other Databricks capabilities. By delivering on its mission to democratize data and AI, enterprises can deliver business value more quickly, with less complexity and risk.
Picture this: you’re streaming your favorite ad-supported show. The moment an ad break begins, a rapid chain of events unfolds in the background—ad requests are fired based on viewer profiles, auctions run in real-time, and winning creatives are delivered in milliseconds. But serving the ad is just the start.
Every impression must be stitched together with downstream signals, such as clicks, view completions, or purchases. Only when these events are correlated in real time can advertisers measure true performance and ensure accurate billing—an attribution challenge that demands both speed and precision. In the programmatic advertising sector, effectively managing ad inventory necessitates sub-second latency.
Advertising contracts typically mandate a specific number of impressions (for instance, a shoe company seeking one million household views for a new ad). As soon as this contractual obligation is fulfilled, subsequent ad slots can be sold to a different bidder. This impression-matching operation occurs on a massive scale. Top streaming platforms handle millions of simultaneous ad events, frequently accumulating hundreds of terabytes of data daily, all while adhering to stringent latency standards.
Modern programmatic advertising isn't just about moving data; it's about correlating independent, high-velocity event streams in near real-time. A typical pipeline must ingest and harmonize three distinct signals:
The Correlation Challenge: The complexity lies in stitching these disparately timed events together. This is not a simple stateless join; it is a sophisticated stateful operation that must:
For years, engineers faced a brutal choice: the simplicity of Spark for analytics, or the speed of specialized engines for operations. You couldn't have both. To hit sub-second SLAs, teams were forced to glue together fragmented stacks—often maintaining massive dedicated state stores just to track ad impressions.
Real-world complexity: One major media provider currently maintains a 170-node HBase cluster solely to manage state for their legacy ad platform. This isn't just expensive, it creates a governance nightmare where operational data is siloed away from the analytical lakehouse.
Now Spark can address the entire spectrum of streaming use cases. With Real-Time Mode (RTM) in Apache Spark Structured Streaming, combined with the powerful transformWithState operator, we're witnessing the convergence of analytical and operational streaming capabilities within a single, unified platform. For the first time, teams can build sophisticated, low-latency ad attribution pipelines entirely within the Spark ecosystem.
Real-time mode represents a fundamental architectural evolution in Spark Structured Streaming. Unlike traditional micro-batching, Real-time mode processes events continuously as they arrive, achieving p99 latencies as low as the single-digit milliseconds. Databricks benchmarks demonstrate remarkable performance characteristics:
This breakthrough enables Spark to compete directly with specialized streaming engines for the most demanding operational workloads, while maintaining the familiar DataFrame APIs and rich ecosystem integration that makes Spark so powerful for analytical use cases.
Teams can enable Real-Time Mode with a single configuration change — no rewrites or replatforming required — while keeping the same Structured Streaming APIs they use today. All you need to do is apply the new trigger interval, as follows:
The transformWithState operator provides the sophisticated state management capabilities required for complex event correlation, a core requirement for the Ad Attribution use case. Key features of transformWithState include Object-Oriented State Management, Composite Data Types, Timer-Driven Logic, Automatic TTL Support, and Schema Evolution.
A production ad attribution pipeline built on real-time mode follows this high-level architecture:
To explore an in-depth technical explanation of example code that you can run and observe RTM's performance, read this Databricks Community blog. The source code in that blog post produces a benchmark chart like the one shown below, and the numbers illustrate the significant latency differences between micro-batch mode (MBM) and the new real-time mode (RTM):
Effective ad attribution requires sophisticated state management strategies:
Because Real-Time Mode fundamentally changes the execution model from periodic micro-batches to continuous processing, engineers must adapt their observability strategy to prioritize tail latency and stability over simple averages.
The introduction of real-time mode in Apache Spark™ Structured Streaming marks a pivotal moment in the evolution of stream processing. For the first time, organizations can build sophisticated, millisecond-latency operational workloads – like real-time ad attribution – entirely within the Spark ecosystem.
This breakthrough eliminates the architectural fragmentation that has long plagued data engineering teams. You no longer must choose between Spark's analytical power and specialized engines’ operational performance. With real-time mode and transformWithState, you can build unified pipelines that seamlessly bridge analytical and operational requirements.
The implications extend beyond technical capabilities. Teams can now:
Ready to revolutionize your real-time processing architecture? Real-time mode is available now in Public Preview on Databricks Runtime 16.4 LTS and above. We recommend using the latest DBR LTS version to take advantage of the latest platform improvements. Start building next-generation streaming applications today and discover the power of millisecond-latency processing with the simplicity and scale of Apache Spark™.
What is Structured Streaming in Apache Spark™?
Structured Streaming is Apache Spark’s stream processing framework that processes unbounded data incrementally using the same DataFrame and Dataset APIs as batch workloads, with Spark managing execution and fault tolerance.
Where can I find the Structured Streaming Programming Guide for Spark 4.0.0 and later?
As of Spark 4.0.0, the guide has been divided into smaller pages available here.
What is the difference between Kafka Streams and Spark Structured Streaming?
Kafka streams and Spark Structured Streaming are two separate systems that serve different use cases. Kafka Streams is a lightweight stream processing library designed to build applications that process data stored in Kafka topics, typically embedded directly within Kafka client applications. Spark Structured Streaming is a distributed stream processing engine designed for large-scale analytics, stateful processing, and integration with diverse data sources and sinks beyond Kafka, including files, databases, and data lakes.
What is the difference between Spark Streaming (DStreams) and Apache Spark Structured Streaming?
Spark Structured Streaming is the modern and recommended streaming engine in Apache Spark. It uses DataFrame and Dataset APIs to provide a unified programming model for batch and streaming workloads, The older Spark Streaming API based on DStreams is considered legacy. Spark recommends migrating to Structured Streaming and avoiding new development using DStreams.
Solution Accelerators
September 4, 2024/8 min read