This talk will cover the details of Continuous Processing in Structured Streaming and my work implementing the initial version in Spark 2.3 as well as the updates for 2.4. DStreams was Spark’s first attempt at streaming, and through dstream Spark became the first framework to provide both batch and streaming functionalities in one unified execution engine.
The way streaming execution happens is through this “micro-batch” model, in which the underlying execution engine simply runs on batches of data over and over again. Dstream’s design tightly couples the user-facing APIs with the execution model, and as a result was very difficult to accomplish certain tasks important in streaming, e.g. using event time and working with late data, without breaking the user-facing APIs. Structured Streaming was the 2nd (and the latest) major streaming effort in Spark. Its design decouples the frontend (user-facing APIs) and backend (execution), and allows us to change the execution model without any user API change.
However, the (historical) minimum possible latency for any record for DStreams or Structured Streaming was bounded by the amount of time that it takes to launch a task. This limitation is a result of the fact that the engine requires us to know both the starting and the ending offset, before any tasks are launched. In the worst case, the end-to-end latency is actually closer to the average batch time + task launching time. Continuous Processing removes this constraints and allows users to achieve sub-millisecond end-to-end latencies with the new execution engine.
This talk will take a technical deep dive into its capabilities, what it took to implement, and discuss the future developments.
Session hashtag: #Dev4SAIS
Jose is a software engineer working on the Spark execution engine and Delta Lake. He holds a bachelor’s degree in computer science from Caltech.