Skip to main content

Spark Declarative Pipelines: Why Data Engineering Needs to Become End-to-End Declarative

Spark Declarative Pipelines: Why Data Engineering Needs to Become End-to-End Declarative

Published: February 23, 2026

Announcements6 min read

Summary

  • Why hand-built pipelines break down as data volume and complexity grow
  • How Spark Declarative Pipelines replace glue code with pipeline-aware execution
  • What changes when Spark handles dependencies, incrementality, and recovery

Data engineering teams are under pressure to deliver higher quality data faster, but the work of building and operating pipelines is getting harder, not easier. We interviewed hundreds of data engineers and studied millions of real-world workloads and found something surprising: data engineers spend the majority of their time not on writing code but on the operational burden generated by stitching together tools. The reason is simple: existing data engineering frameworks force data engineers to manually handle orchestration, incremental data processing, data quality and backfills - all common tasks for production pipelines. As data volumes and use cases grow, this operational burden compounds, turning data engineering into a bottleneck for the business rather than an accelerator.

This isn’t the first time the industry has hit this wall. Early data processing required writing a new program for every question, which didn’t scale. SQL changed that by making individual queries declarative: you specify what result you want, and the engine figures out how to compute it. SQL databases now underpin every business.

But data engineering isn’t about running a single query. Pipelines repeatedly update multiple interdependent datasets over time. Because SQL engines stop at the query boundary, everything beyond it - incremental processing, dependency management, backfills, data quality, retries - still has to be hand-assembled. At scale, reasoning about execution order, parallelism, and failure modes quickly becomes the dominant source of complexity.

What’s missing is a way to declare the pipeline as a whole. Spark Declarative Pipelines (SDP) extend declarative data processing from individual queries to entire pipelines, letting Apache Spark plan and execute them end to end. Instead of manually moving data between steps, you declare what datasets you want to exist and SDP is responsible for how to keep them correct over time. For example, in a pipeline that computes weekly sales, SDP infers dependencies between datasets, builds a single execution plan, and updates results in the right order. It automatically processes only new or changed data, expresses data quality rules inline, and handles backfills and late-arriving data without manual intervention. Because SDP understands query semantics, it can validate pipelines upfront, execute safely in parallel, and recover correctly from failures—capabilities that require first-class, pipeline-aware declarative APIs built directly into Apache Spark.

End-to-end declarative data engineering in SDP brings powerful benefits:

  • Greater productivity: Data engineers can focus on writing business logic instead of glue code.
  • Lower costs: The framework automatically handles orchestration and incremental data processing, making it more cost-efficient than hand-written pipelines.
  • Lower operational burden: Common use cases such as backfills, data quality and retries are integrated and automated.

To illustrate the benefits of end-to-end declarative data engineering, let’s start with a weekly sales pipeline written in PySpark. Because PySpark is not end-to-end declarative, we must manually encode execution order, incremental processing, and data quality logic, and rely on an external orchestrator such as Airflow for retries, alerting, and monitoring (omitted here for brevity).

This pipeline expressed as a SQL dbt project suffers from many of the same limitations: we must still manually code incremental data processing, data quality is handled separately and we still have to rely on an orchestrator such as Airflow for retries and failure handling:

Let’s rewrite this pipeline in SDP to explore its benefits. First, let’s install SDP and create a new pipeline:

A 5X LEADER

Gartner®: Databricks Cloud Database Leader

Next, define your pipeline with the following code. Note that we comment out the expect_or_drop data quality expectation API as we are working with the community to open source it:

To run the pipeline, type the following command in your terminal:

We can even validate our pipeline upfront without running it first with this command - it’s handy for catching syntax errors and schema mismatches:

Backfills become much simpler - to backfill the raw_sales table, run this command:

The code is much simpler - just 20 lines that deliver everything the PySpark and dbt versions require external tools to provide. We also get these powerful benefits:

  • Automatic incremental data processing. The framework tracks which data has been processed and only reads new or changed records. No MAX queries, no checkpoint files, no conditional logic needed.
  • Integrated data quality. The @dp.expect_or_drop decorator quarantines bad records automatically. In PySpark, we manually split and wrote good/bad records to separate tables. In dbt, we needed a separate model and manual handling.
  • Automatic dependency tracking. The framework detects that weekly_sales depends on raw_sales and orchestrates execution order automatically. No external orchestrator needed.
  • Integrated retries and monitoring. The framework handles failures and provides observability through a built-in UI. No external tools required.

SDP in Apache Spark 4.1 has the following capabilities which make it a great choice for data pipelines:

  • Python and SQL APIs for defining datasets
  • Support for batch and streaming queries
  • Automatic dependency tracking between datasets, and efficient parallel updates
  • CLI to scaffold, validate, and run pipelines locally or in production

We are excited about SDP’s roadmap, which is being developed in the open with the Spark community. Upcoming Spark releases will build on this foundation with support for continuous execution, and more efficient incremental processing. We also plan to bring core capabilities like Change Data Capture (CDC) into SDP, shaped by real-world use cases and community feedback. Our aim is to make SDP a shared, extensible foundation for building reliable batch and streaming pipelines across the Spark ecosystem.

Never miss a Databricks post

Subscribe to our blog and get the latest posts delivered to your inbox

What's next?

Introducing Predictive Optimization for Statistics

Product

November 20, 2024/4 min read

Introducing Predictive Optimization for Statistics

How to present and share your Notebook insights in AI/BI Dashboards

Product

November 21, 2024/3 min read

How to present and share your Notebook insights in AI/BI Dashboards