Skip to main content

What is data pipeline architecture?

by Databricks Staff

  • A well-designed data pipeline architecture separates ingestion, transformation, storage, and serving into distinct layers, with the choice of pattern (batch, streaming, medallion, Kappa, etc.) driven by your latency and cost requirements, not convention.
  • ELT has largely replaced ETL as the domina bnnt approach because modern cloud platforms make it practical to load raw data first and transform it in place, preserving flexibility for reprocessing and downstream reuse.
  • Databricks unifies batch and streaming pipelines on a single platform (Lakeflow + Delta Lake + Unity Catalog), eliminating the duplicate infrastructure and governance gaps that make traditional Lambda-style architectures brittle.

Data pipeline architecture is the end-to-end design of how data is collected, processed, stored and delivered from source systems to the people, applications and models that use it. The word “architecture” refers to the blueprint, not the pipeline itself. It covers the choices about how data flows, where it gets transformed and which tools handle each step along the way.

Good architecture is matched to the use case rather than picked off a shelf. A data pipeline built for real-time fraud detection looks very different from one that produces a nightly sales report, even though both move data from source to destination. This glossary page covers the core layers every pipeline shares, the common stage models, the major architectural patterns and the best practices that keep pipelines reliable as they scale.

How does data pipeline architecture work?

A data pipeline moves data through a series of stages, and each stage has a specific job: gather the data, clean it up, store it and make it usable. Architecture is the plan for how those stages connect. It defines what happens to the data at each step, in what order and under what rules.

Architecture decisions sit at two levels. The logical design defines which stages exist and what each one does: this is “the what.” The physical design defines which specific tools and infrastructure run each stage: this is “the how.” Orchestration (the automatic scheduling and coordination of each step) and monitoring don’t belong to any single stage. They run across the whole pipeline. Modern platforms have also collapsed an old divide. With Lakeflow, Databricks unifies batch and streaming pipelines on a single foundation, so teams don’t have to build and maintain two parallel systems.

The core layers of a data pipeline

Regardless of the pattern a team chooses, every data pipeline is built on the same four layers. Each layer answers a different question about the data: how it gets in, how it becomes useful, where it lives and who consumes it.

Ingestion

Ingestion pulls data into the pipeline from source systems: databases, applications, APIs, files in cloud storage, event streams and sensors. Data ingestion comes in two flavors. Batch ingestion pulls data on a schedule, such as every hour or every night. Streaming ingestion captures data continuously as events happen. Many pipelines also use change data capture (CDC), a method that tracks row-level changes in a source database so the pipeline moves only what’s new or updated instead of reloading everything.

Processing and transformation

This layer is where raw data gets cleaned, reshaped, enriched and prepared for use. Typical work includes fixing missing values, standardizing formats, joining datasets and applying business logic, the same tasks at the heart of ETL. Processing follows the same split as ingestion. Batch processing works on large chunks of data together, while stream processing handles records one at a time or in tiny micro-batches as they arrive.

Storage

Storage is where processed data lands so it can be queried, analyzed or fed to models. The destination is typically a data lake, a data warehouse or a lakehouse, a single system that combines the strengths of both. Format matters as much as location. Open formats like Lakehouse Storage and Apache Iceberg let multiple tools read the same data without copying it from system to system. Delta Lake also adds reliability features such as ACID transactions (a guarantee that writes either fully succeed or fully fail, preventing corruption) and time travel (the ability to query older versions of a table).

Serving and consumption

The final layer delivers prepared data to the people and systems that need it: analysts running SQL queries, business users working in dashboards, data scientists training models and applications calling APIs. Destinations range from BI tools to ML platforms to operational systems, with a data warehouse often sitting at the center of analytics workloads. Across all four layers, orchestration and observability do the connective work: scheduling jobs, tracking data quality and raising alerts when something breaks.

How many stages are in a data pipeline? (3 vs. 4 vs. 5)

Different sources describe data pipelines as having three, four or five stages, which causes plenty of confusion. The reality is simpler. All three models describe the same underlying work at different levels of detail.

ModelStagesWhen you'll see it used
3-stageSources → Processing → DestinationHigh-level explanations, executive overviews, intro-level content
4-stageIngestion → Processing → Storage → ServingMost common in modern data engineering. Balances clarity and detail
5-stageCollection → Ingestion → Processing → Storage → AnalysisDetailed technical breakdowns. Splits “getting data” into collection (from the source) and ingestion (into the pipeline)

The number of stages is a labeling choice. The work the pipeline performs is the same.

Common data pipeline architecture patterns

Architectural patterns are the established designs teams choose from when building pipelines. The right one depends on latency requirements, data volume and how the data will be used downstream.

Batch architecture

Batch architecture processes data in scheduled chunks: every hour, every night or every week. It fits reporting, historical analysis, ML training data and any use case where minutes or hours of delay are acceptable. Batch pipelines are simpler to build, cheaper to run and easier to debug than their streaming counterparts. The trade-off is freshness. When decisions depend on what happened seconds ago, batch can’t keep up.

Streaming architecture

Streaming architecture processes data continuously, record by record, as it’s generated. It serves use cases where sub-minute response matters: fraud detection, real-time personalization and IoT monitoring. The trade-off is cost. Streaming pipelines typically cost more to run and operate than batch pipelines because they require always-on infrastructure.

Lambda architecture

Lambda architecture runs two parallel paths. A batch path delivers accurate historical data, a streaming path delivers fast, fresh data and a serving layer merges the results. The design works, but it carries a well-known downside. Maintaining two pipelines means duplicate code, duplicate logic and double the operational burden.

Kappa architecture

Kappa architecture simplifies Lambda by using a single streaming pipeline for everything. When historical analysis is needed, the stream is replayed from the beginning. Kappa suits teams that want streaming-grade freshness without the cost of maintaining two parallel systems.

Medallion architecture (lakehouse pattern)

Medallion architecture is a popular pattern on lakehouse platforms that organizes data into three quality tiers: Bronze (raw, as ingested), Silver (cleaned and conformed) and Gold (curated, business-ready). As Databricks documentation puts it, “the medallion architecture uses three layers: bronze, silver, and gold, each serving a distinct purpose in the pipeline.” Each tier can run as its own pipeline, which makes scheduling, monitoring and troubleshooting easier because problems stay isolated to a single layer.

ETL vs. ELT: how transformation order shapes architecture

ETL and ELT differ in when data gets transformed. ETL (extract, transform, load) transforms data before loading it into storage. ELT (extract, load, transform) loads raw data first and transforms it inside the destination. Modern cloud platforms such as Databricks, Snowflake and BigQuery have made ELT the dominant pattern because cloud storage and compute are now cheap and elastic enough to transform data in place. For a deeper comparison, see ETL vs. ELT.

 ETLELT
OrderExtract → Transform → LoadExtract → Load → Transform
Where transformation happensIn a separate processing tool, before storageInside the destination (lakehouse or warehouse)
Typical use caseLegacy on-prem warehouses, strict pre-load validationModern cloud lakehouses and warehouses
StrengthsCleaner data lands in storage. Predictable schemasFlexible, scalable, keeps raw data available for reprocessing
Trade-offsLess flexible. Harder to reuse raw data laterRequires capable compute at the destination
REPORT

The agentic AI playbook for the enterprise

Is ETL the same as a data pipeline?

No. ETL is one type of data pipeline, but not every data pipeline is ETL. A data pipeline is the broad category: any system that moves data from one place to another. ETL is a specific approach within that category, defined by transforming data before it lands in storage. Pipelines can also be ELT, streaming, replication-only (moving data with no transformation at all) or reverse ETL (sending warehouse data back into operational systems).

Best practices for data pipeline architecture

These 10 design principles separate pipelines that scale from pipelines that break.

  1. Separate ingestion from transformation. Keep raw data landing and data cleaning in different stages so issues in one don’t cascade into the other.
  2. Design for idempotency. A pipeline should be safe to re-run without creating duplicate records or corrupting results. This is critical for handling failures and backfills.
  3. Build in data quality checks. Strong data quality checks validate schema, value ranges, null counts and freshness at each stage, and they fail loudly when something is wrong rather than letting bad data flow downstream.
  4. Plan for schema drift. Source systems change. Pipelines should detect when columns are added, removed or renamed and handle the change gracefully instead of breaking.
  5. Use open storage formats. Formats like Delta Lake and Apache Iceberg prevent lock-in and let multiple tools read the same data without copies.
  6. Decouple pipeline layers. Splitting medallion tiers (Bronze, Silver and Gold) into separate pipelines makes each one easier to schedule, monitor and troubleshoot independently.
  7. Version control everything. Store pipeline code and configuration in Git so changes are reviewed, traceable and reversible.
  8. Treat governance as a first-class concern. Apply consistent permissions, lineage tracking and audit controls across every stage with a tool like Unity Catalog, rather than bolting them on at the end.
  9. Right-size streaming vs. batch. Use streaming only where freshness genuinely matters, and default to batch everywhere else to control cost.
  10. Monitor end to end. Track data freshness, volume, quality and pipeline run times so problems are caught before downstream users notice them.

Why data pipeline architecture matters

Pipeline architecture determines whether teams can trust their data, whether decisions rest on fresh information and whether AI and ML projects make it from prototype to production. It’s the difference between a data platform that compounds in value and one that generates support tickets.

Brittle architecture creates real costs: stale dashboards, conflicting metrics, failed ML deployments and engineers who spend more time firefighting than building. The modern lakehouse approach addresses the root cause. By unifying batch and streaming, analytics and AI, and governance on a single platform like the Databricks Platform, teams remove the fragile handoffs between systems that make traditional architectures break.

Data pipeline architecture on Databricks

Databricks delivers every layer of pipeline architecture in one platform. Lakeflow Connect handles ingestion from databases, SaaS applications, file sources and event streams. Lakeflow Spark Declarative Pipelines builds batch and streaming ETL pipelines with data quality checks built in, and Lakeflow Jobs orchestrates and schedules pipeline runs across the platform. Underneath, Delta Lake provides the open storage format along with reliability features like ACID transactions and time travel, while Unity Catalog applies governance, lineage and access control across every stage.

Because batch and streaming pipelines run on the same engine and write to the same storage, teams don’t need to maintain Lambda-style parallel systems. One pipeline definition can serve both the nightly report and the real-time dashboard.

Frequently asked questions

What is data pipeline architecture in simple terms?

It’s the plan for how data gets from where it’s created to where it’s useful. The plan covers how data is collected, how it’s cleaned and prepared, where it’s stored and how it’s delivered to the people and applications that need it.

What is the difference between Lambda and Kappa architecture?

Lambda runs two parallel pipelines, one batch and one streaming, and merges their results in a serving layer. Kappa uses a single streaming pipeline for everything and replays the stream when historical analysis is needed. Kappa is simpler to operate, while Lambda persists in environments where batch and streaming paths evolved separately.

When should you use batch vs. streaming pipelines?

Use streaming when the value of data drops within seconds or minutes, as in fraud detection, live personalization or equipment monitoring. Use batch for everything else, including reporting, historical analysis and ML training data. Batch is simpler and cheaper, so it’s the sensible default until a use case proves it needs real-time data.

What’s the difference between logical and physical pipeline architecture?

Logical architecture defines the stages of a pipeline and what each one does, independent of any tool. Physical architecture maps those stages onto specific technologies and infrastructure. Teams usually settle the logical design first, then choose the platforms that implement it.

Match your architecture to the job

Data pipeline architecture is the design behind how data moves and becomes useful. The right architecture is the one that balances freshness, cost and reliability for the specific job at hand, whether that’s a nightly sales report or a fraud check that runs in milliseconds.

See how Databricks unifies batch and streaming pipelines, storage and governance on one platform.

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.