Data streaming is the continuous collection, processing and analysis of data as it is generated, allowing organizations to act on information in real time. Over the last several years, the need for real-time data has grown exponentially. Organizations are increasingly building applications and platforms that leverage data streams to deliver real-time analytics and machine learning to drive business growth. By continuously collecting, processing and analyzing data, leaders can gain immediate insights, enable faster decision-making and make more accurate predictions.
Companies may leverage real-time data streaming to track things like business transactions in operational systems and potential fraud, as well as inform dynamic pricing models. Meanwhile, the proliferation of the Internet of Things (IoT) means that everyday devices and sensors transmit enormous quantities of raw data, and immediate access to those datasets can help troubleshoot potential issues or make location-specific recommendations.
To handle their data, organizations have traditionally relied on batch processing, which refers to the collection and processing of data in large chunks, or “batches,” at specified intervals. Today, companies may leverage batch processing when they require timely, but not real-time, data. This includes applications such as:
However, many organizations now need access to data as it’s collected. Streaming data helps organizations make timely decisions by ensuring data is processed quickly, accurately and in near real time. By processing data within seconds or milliseconds, streaming is an ideal solution for use cases such as:
While organizations may recognize their need for streaming data, it can be difficult to transition from batch to streaming data because of:

Streaming and real-time processing are closely related concepts, and they are often used interchangeably. However, they do have subtle but important distinctions.
“Streaming data” refers to the continuous data streams generated by data in motion. It is a data pipeline approach where data is processed in small chunks or events as they are generated. “Real-time processing,” on the other hand, emphasizes the immediacy of analysis and response, aiming to deliver insights with minimal delay after data is received. In other words, a streaming data system ingests real-time data and processes it as it arrives.
It is important to note that, even within the scope of “real-time streaming,” there is a further distinction between “real time” and “near real time,” primarily with respect to latency.
Real-time processing: Real-time data refers to systems that analyze and act on data with negligible delays, usually within milliseconds of data generation. These systems are designed for scenarios where immediate action is critical, such as:
Near real-time processing: This involves a slight delay, usually measured in seconds. This approach is suitable for situations where an instantaneous response is not necessary, but timely updates are still preferred, such as:
While stream processing can be the right choice for some organizations, it can be costly and resource-intensive to run. One way to gain the benefit of data streaming without continuous data processing is via incrementalization. This method processes only newly added, modified or changed data rather than a complete dataset.
One example of how incrementalization can be run is via materialized views in Databricks. A materialized view is a database object that stores the results of a query as a physical table. Unlike regular database views, which are virtual and derive their data from the underlying tables, materialized views contain precomputed data that is incrementally updated on a schedule or on demand. This precomputation of data allows for faster query response times and improved performance in certain scenarios.
Materialized views can be useful when processing smaller sets of data, rather than entire datasets. Overall, incrementalization of data within a pipeline can boost efficiency by:
This is especially ideal for large-scale pipelines, where processing updates can lead to faster analysis and decision-making.
As organizations implement real-time data streams, there are some important factors to consider within the data processing architecture. How you design your system can introduce some important trade-offs and depends on your organization’s workload demands and business outcomes. Some features to consider include:
Not all streaming architectures are created equal, and it is important to find the right balance to meet the demands of your workload as well as your budget. Think of it as accessing your data at the right time — when you need it — instead of in real time.
Apache SparkTM Structured Streaming is the core technology that unlocks data streaming on the Databricks Data Intelligence Platform, providing a unified API for batch and stream processing. Spark is an open source project that divides continuous data streams into small, manageable batches for processing. Structured Streaming allows you to take the same operations that you perform in batch mode using Spark’s structured APIs and run them in a streaming fashion. This can reduce latency and allow for incremental processing, with latencies as low as 250ms.
Structured Streaming: Data is treated as an infinite table and processed incrementally. Spark collects incoming data over a short time interval, forms a batch and then processes it like traditional batch jobs. This approach combines the simplicity of batch processing with near real-time capabilities, and features checkpoints that enable fault tolerance and failure recovery.
Incremental pipeline processing: Spark’s approach to the data pipeline is designed to efficiently use resources. The pipeline begins with the ingestion of raw data, which is then filtered, aggregated or mapped on its way to the data sink. However, each stage processes data incrementally as it moves through the pipeline, looking for any anomalies or errors before it’s stored in a database.
For workloads that demand high responsiveness, Spark features a Continuous Processing mode that offers real-time capabilities by processing each record individually as it arrives. You can learn more about managing streaming data on Databricks here.
Streaming ETL (extract, transform, load) helps organizations process and analyze data in real time or near real time to meet the demands of data-driven applications and workflows. ETL has usually been run in batches; however, streaming ETL ingests data as it is generated to ensure data is ready for analysis almost immediately.
Streaming ETL minimizes latency by processing data incrementally, allowing for continuous updates rather than waiting for a dataset to batch. It also reduces the risks associated with data that is out of date or irrelevant, ensuring decisions are based on the latest available information.
Inherent in any ETL tool must be the ability to scale as a business grows. Databricks launched DLT as the first ETL framework that uses a simple declarative approach to building reliable data pipelines. Your teams can use languages and tools they already know, such as SQL and Python, to build and run your batch and streaming data pipelines in one place with controllable and automated refresh settings. This not only saves time but also reduces operational complexity. No matter where you plan to send your data, building streaming data pipelines on the Databricks Data Intelligence Platform ensures you don’t lose time between raw and cleaned data.
As we’ve seen, data streaming offers continuous processing of data at low latency and the ability to transmit real-time analytics as events occur. Access to real-time (or near real-time) raw data can be critical for business operations, as it gives decision-makers access to the latest and most relevant data. Some of the advantages of streaming analytics include:
As artificial intelligence (AI) and ML models develop and mature, traditional batch processing can struggle to keep pace with the size and diversity of data these applications require. Delays in data transmission can lead to inaccurate responses and an uptick in application inefficiency.
Streaming data provides a continuous flow of real-time information based on the most current available data, ensuring AI/ML models adapt and make predictions as events happen. There are two ways streaming data helps prepare AI models:
Organizations across sectors leverage the insights of AI built on streaming datasets. Health and wellness retailers leverage real-time reporting on customer data to help pharmacists provide personalized recommendations and advice. Telecommunications companies can use real-time machine learning to detect fraudulent activity like illegal device unlocks and identify theft. Meanwhile, retailers can leverage streaming data to automate real-time pricing based on inventory and market factors.
While streaming data is crucial for these models, it’s important to note that integrating AI/ML with data streaming presents a unique set of challenges. Some of these challenges include:
Databricks is addressing these problems through Mosaic AI, which provides customers with unified tooling to build, deploy, evaluate and govern AI and ML solutions. Users receive accurate outputs customized with enterprise data and can train and serve their own custom large learning models (LLMs) at 10x lower cost.
Deploying data streaming within your organization can require a good deal of effort. Databricks makes it easier by simplifying data streaming. The Databricks Data Intelligence Platform delivers real-time analytics, machine learning and applications — all on one platform. By building streaming applications on Databricks, you can:
Databricks is helping customers move beyond the traditional bifurcation of batch versus streaming data with the Data Intelligence Platform. By integrating real-time analytics, machine learning (ML) and applications on one platform, organizations benefit from simplified data processing in a singular platform that handles both batch and streaming data.
With the Databricks Data Intelligence Platform, users can:
Additionally, with the help of DLT, customers receive automated tooling to simplify data ingestion and ETL, preparing datasets for deployment across real-time analytics, ML and operational applications.
Spark Structured Streaming lies at the heart of Databricks’ real-time capabilities. Widely adopted by hundreds of thousands of individuals and organizations, it provides a single and unified API for batch and stream processing, making it easy for data engineers and developers to build real-time applications without changing code or learning new skills.
Across the world, organizations have leveraged data streaming on the Databricks Data Intelligence Platform to optimize their operational systems, manage digital payment networks, explore new innovations in renewable energy and help protect consumers from fraud.
Databricks offers all of these tightly integrated capabilities to support your real-time use cases on one platform.
