Data pipeline architecture is the end-to-end design of how data is collected, processed, stored and delivered from source systems to the people, applications and models that use it. The word “architecture” refers to the blueprint, not the pipeline itself. It covers the choices about how data flows, where it gets transformed and which tools handle each step along the way.
Good architecture is matched to the use case rather than picked off a shelf. A data pipeline built for real-time fraud detection looks very different from one that produces a nightly sales report, even though both move data from source to destination. This glossary page covers the core layers every pipeline shares, the common stage models, the major architectural patterns and the best practices that keep pipelines reliable as they scale.
A data pipeline moves data through a series of stages, and each stage has a specific job: gather the data, clean it up, store it and make it usable. Architecture is the plan for how those stages connect. It defines what happens to the data at each step, in what order and under what rules.
Architecture decisions sit at two levels. The logical design defines which stages exist and what each one does: this is “the what.” The physical design defines which specific tools and infrastructure run each stage: this is “the how.” Orchestration (the automatic scheduling and coordination of each step) and monitoring don’t belong to any single stage. They run across the whole pipeline. Modern platforms have also collapsed an old divide. With Lakeflow, Databricks unifies batch and streaming pipelines on a single foundation, so teams don’t have to build and maintain two parallel systems.
Regardless of the pattern a team chooses, every data pipeline is built on the same four layers. Each layer answers a different question about the data: how it gets in, how it becomes useful, where it lives and who consumes it.
Ingestion pulls data into the pipeline from source systems: databases, applications, APIs, files in cloud storage, event streams and sensors. Data ingestion comes in two flavors. Batch ingestion pulls data on a schedule, such as every hour or every night. Streaming ingestion captures data continuously as events happen. Many pipelines also use change data capture (CDC), a method that tracks row-level changes in a source database so the pipeline moves only what’s new or updated instead of reloading everything.
This layer is where raw data gets cleaned, reshaped, enriched and prepared for use. Typical work includes fixing missing values, standardizing formats, joining datasets and applying business logic, the same tasks at the heart of ETL. Processing follows the same split as ingestion. Batch processing works on large chunks of data together, while stream processing handles records one at a time or in tiny micro-batches as they arrive.
Storage is where processed data lands so it can be queried, analyzed or fed to models. The destination is typically a data lake, a data warehouse or a lakehouse, a single system that combines the strengths of both. Format matters as much as location. Open formats like Lakehouse Storage and Apache Iceberg let multiple tools read the same data without copying it from system to system. Delta Lake also adds reliability features such as ACID transactions (a guarantee that writes either fully succeed or fully fail, preventing corruption) and time travel (the ability to query older versions of a table).
The final layer delivers prepared data to the people and systems that need it: analysts running SQL queries, business users working in dashboards, data scientists training models and applications calling APIs. Destinations range from BI tools to ML platforms to operational systems, with a data warehouse often sitting at the center of analytics workloads. Across all four layers, orchestration and observability do the connective work: scheduling jobs, tracking data quality and raising alerts when something breaks.
Different sources describe data pipelines as having three, four or five stages, which causes plenty of confusion. The reality is simpler. All three models describe the same underlying work at different levels of detail.
| Model | Stages | When you'll see it used |
|---|---|---|
| 3-stage | Sources → Processing → Destination | High-level explanations, executive overviews, intro-level content |
| 4-stage | Ingestion → Processing → Storage → Serving | Most common in modern data engineering. Balances clarity and detail |
| 5-stage | Collection → Ingestion → Processing → Storage → Analysis | Detailed technical breakdowns. Splits “getting data” into collection (from the source) and ingestion (into the pipeline) |
The number of stages is a labeling choice. The work the pipeline performs is the same.
Architectural patterns are the established designs teams choose from when building pipelines. The right one depends on latency requirements, data volume and how the data will be used downstream.
Batch architecture processes data in scheduled chunks: every hour, every night or every week. It fits reporting, historical analysis, ML training data and any use case where minutes or hours of delay are acceptable. Batch pipelines are simpler to build, cheaper to run and easier to debug than their streaming counterparts. The trade-off is freshness. When decisions depend on what happened seconds ago, batch can’t keep up.
Streaming architecture processes data continuously, record by record, as it’s generated. It serves use cases where sub-minute response matters: fraud detection, real-time personalization and IoT monitoring. The trade-off is cost. Streaming pipelines typically cost more to run and operate than batch pipelines because they require always-on infrastructure.
Lambda architecture runs two parallel paths. A batch path delivers accurate historical data, a streaming path delivers fast, fresh data and a serving layer merges the results. The design works, but it carries a well-known downside. Maintaining two pipelines means duplicate code, duplicate logic and double the operational burden.
Kappa architecture simplifies Lambda by using a single streaming pipeline for everything. When historical analysis is needed, the stream is replayed from the beginning. Kappa suits teams that want streaming-grade freshness without the cost of maintaining two parallel systems.
Medallion architecture is a popular pattern on lakehouse platforms that organizes data into three quality tiers: Bronze (raw, as ingested), Silver (cleaned and conformed) and Gold (curated, business-ready). As Databricks documentation puts it, “the medallion architecture uses three layers: bronze, silver, and gold, each serving a distinct purpose in the pipeline.” Each tier can run as its own pipeline, which makes scheduling, monitoring and troubleshooting easier because problems stay isolated to a single layer.
ETL and ELT differ in when data gets transformed. ETL (extract, transform, load) transforms data before loading it into storage. ELT (extract, load, transform) loads raw data first and transforms it inside the destination. Modern cloud platforms such as Databricks, Snowflake and BigQuery have made ELT the dominant pattern because cloud storage and compute are now cheap and elastic enough to transform data in place. For a deeper comparison, see ETL vs. ELT.
| ETL | ELT | |
|---|---|---|
| Order | Extract → Transform → Load | Extract → Load → Transform |
| Where transformation happens | In a separate processing tool, before storage | Inside the destination (lakehouse or warehouse) |
| Typical use case | Legacy on-prem warehouses, strict pre-load validation | Modern cloud lakehouses and warehouses |
| Strengths | Cleaner data lands in storage. Predictable schemas | Flexible, scalable, keeps raw data available for reprocessing |
| Trade-offs | Less flexible. Harder to reuse raw data later | Requires capable compute at the destination |
No. ETL is one type of data pipeline, but not every data pipeline is ETL. A data pipeline is the broad category: any system that moves data from one place to another. ETL is a specific approach within that category, defined by transforming data before it lands in storage. Pipelines can also be ELT, streaming, replication-only (moving data with no transformation at all) or reverse ETL (sending warehouse data back into operational systems).
These 10 design principles separate pipelines that scale from pipelines that break.
Pipeline architecture determines whether teams can trust their data, whether decisions rest on fresh information and whether AI and ML projects make it from prototype to production. It’s the difference between a data platform that compounds in value and one that generates support tickets.
Brittle architecture creates real costs: stale dashboards, conflicting metrics, failed ML deployments and engineers who spend more time firefighting than building. The modern lakehouse approach addresses the root cause. By unifying batch and streaming, analytics and AI, and governance on a single platform like the Databricks Platform, teams remove the fragile handoffs between systems that make traditional architectures break.
Databricks delivers every layer of pipeline architecture in one platform. Lakeflow Connect handles ingestion from databases, SaaS applications, file sources and event streams. Lakeflow Spark Declarative Pipelines builds batch and streaming ETL pipelines with data quality checks built in, and Lakeflow Jobs orchestrates and schedules pipeline runs across the platform. Underneath, Delta Lake provides the open storage format along with reliability features like ACID transactions and time travel, while Unity Catalog applies governance, lineage and access control across every stage.
Because batch and streaming pipelines run on the same engine and write to the same storage, teams don’t need to maintain Lambda-style parallel systems. One pipeline definition can serve both the nightly report and the real-time dashboard.
It’s the plan for how data gets from where it’s created to where it’s useful. The plan covers how data is collected, how it’s cleaned and prepared, where it’s stored and how it’s delivered to the people and applications that need it.
Lambda runs two parallel pipelines, one batch and one streaming, and merges their results in a serving layer. Kappa uses a single streaming pipeline for everything and replays the stream when historical analysis is needed. Kappa is simpler to operate, while Lambda persists in environments where batch and streaming paths evolved separately.
Use streaming when the value of data drops within seconds or minutes, as in fraud detection, live personalization or equipment monitoring. Use batch for everything else, including reporting, historical analysis and ML training data. Batch is simpler and cheaper, so it’s the sensible default until a use case proves it needs real-time data.
Logical architecture defines the stages of a pipeline and what each one does, independent of any tool. Physical architecture maps those stages onto specific technologies and infrastructure. Teams usually settle the logical design first, then choose the platforms that implement it.
Data pipeline architecture is the design behind how data moves and becomes useful. The right architecture is the one that balances freshness, cost and reliability for the specific job at hand, whether that’s a nightly sales report or a fraud check that runs in milliseconds.
See how Databricks unifies batch and streaming pipelines, storage and governance on one platform.
Subscribe to our blog and get the latest posts delivered to your inbox.