Skip to main content

What is Spark Streaming?

How Spark Streaming processes micro-batches of real-time data with DStreams and why Structured Streaming is now the preferred engine

4 Personas Analytics AIBI 3b

Summary

  • Understand what Apache Spark Streaming is, how it extends the core Spark API and why it is now considered a legacy streaming engine in favor of Structured Streaming
  • See how Spark Streaming ingests data from sources like Kafka, Flume and Amazon Kinesis, processes it in micro-batches and pushes results to files, databases or dashboards using DStreams
  • Explore the key benefits Spark Streaming introduced, such as unified batch and streaming processing, fault tolerance and integration with MLlib and Spark SQL

Apache Spark Streaming is the previous generation of Apache Spark’s streaming engine. There are no longer updates to Spark Streaming and it’s a legacy project. There is a newer and easier to use streaming engine in Apache Spark called Structured Streaming. You should use Spark Structured Streaming for your streaming applications and pipelines. See Structured Streaming.

What is Spark Streaming?

Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads. Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources including (but not limited to) Kafka, Flume, and Amazon Kinesis. This processed data can be pushed out to file systems, databases, and live dashboards. Its key abstraction is a Discretized Stream or, in short, a DStream, which represents a stream of data divided into small batches. DStreams are built on RDDs, Spark’s core data abstraction. This allows Spark Streaming to seamlessly integrate with any other Spark components like MLlib and Spark SQL. Spark Streaming is different from other systems that either have a processing engine designed only for streaming, or have similar batch and streaming APIs but compile internally to different engines. Spark’s single execution engine and unified programming model for batch and streaming lead to some unique benefits over other traditional streaming systems.

A 5X LEADER

Gartner®: Databricks Cloud Database Leader

Four Major Aspects of Spark Streaming

  • Fast recovery from failures and stragglers
  • Better load balancing and resource usage
  • Combining of streaming data with static datasets and interactive queries
  • Native integration with advanced processing libraries (SQL, machine learning, graph processing)

apache spark

This unification of disparate data processing capabilities is the key reason behind Spark Streaming’s rapid adoption. It makes it very easy for developers to use a single framework to satisfy all their processing needs.

Additional Resources

Never miss a Databricks post

Subscribe to our blog and get the latest posts delivered to your inbox

What's next?

4 Personas Analytics AIBI

Data + AI Foundations

6 min read

What is Data Ingestion?

4 Personas Analytics AIBI 4

Data + AI Foundations

14 min read

What is Augmented Analytics?