Data engineering teams are under pressure to deliver higher quality data faster, but the work of building and operating pipelines is getting harder, not easier. We interviewed hundreds of data engineers and studied millions of real-world workloads and found something surprising: data engineers spend the majority of their time not on writing code but on the operational burden generated by stitching together tools. The reason is simple: existing data engineering frameworks force data engineers to manually handle orchestration, incremental data processing, data quality and backfills - all common tasks for production pipelines. As data volumes and use cases grow, this operational burden compounds, turning data engineering into a bottleneck for the business rather than an accelerator.
This isn’t the first time the industry has hit this wall. Early data processing required writing a new program for every question, which didn’t scale. SQL changed that by making individual queries declarative: you specify what result you want, and the engine figures out how to compute it. SQL databases now underpin every business.
But data engineering isn’t about running a single query. Pipelines repeatedly update multiple interdependent datasets over time. Because SQL engines stop at the query boundary, everything beyond it - incremental processing, dependency management, backfills, data quality, retries - still has to be hand-assembled. At scale, reasoning about execution order, parallelism, and failure modes quickly becomes the dominant source of complexity.
What’s missing is a way to declare the pipeline as a whole. Spark Declarative Pipelines (SDP) extend declarative data processing from individual queries to entire pipelines, letting Apache Spark plan and execute them end to end. Instead of manually moving data between steps, you declare what datasets you want to exist and SDP is responsible for how to keep them correct over time. For example, in a pipeline that computes weekly sales, SDP infers dependencies between datasets, builds a single execution plan, and updates results in the right order. It automatically processes only new or changed data, expresses data quality rules inline, and handles backfills and late-arriving data without manual intervention. Because SDP understands query semantics, it can validate pipelines upfront, execute safely in parallel, and recover correctly from failures—capabilities that require first-class, pipeline-aware declarative APIs built directly into Apache Spark.
End-to-end declarative data engineering in SDP brings powerful benefits:
To illustrate the benefits of end-to-end declarative data engineering, let’s start with a weekly sales pipeline written in PySpark. Because PySpark is not end-to-end declarative, we must manually encode execution order, incremental processing, and data quality logic, and rely on an external orchestrator such as Airflow for retries, alerting, and monitoring (omitted here for brevity).
This pipeline expressed as a SQL dbt project suffers from many of the same limitations: we must still manually code incremental data processing, data quality is handled separately and we still have to rely on an orchestrator such as Airflow for retries and failure handling:
Let’s rewrite this pipeline in SDP to explore its benefits. First, let’s install SDP and create a new pipeline:
Next, define your pipeline with the following code. Note that we comment out the expect_or_drop data quality expectation API as we are working with the community to open source it:
To run the pipeline, type the following command in your terminal:
We can even validate our pipeline upfront without running it first with this command - it’s handy for catching syntax errors and schema mismatches:
Backfills become much simpler - to backfill the raw_sales table, run this command:
The code is much simpler - just 20 lines that deliver everything the PySpark and dbt versions require external tools to provide. We also get these powerful benefits:
@dp.expect_or_drop decorator quarantines bad records automatically. In PySpark, we manually split and wrote good/bad records to separate tables. In dbt, we needed a separate model and manual handling.weekly_sales depends on raw_sales and orchestrates execution order automatically. No external orchestrator needed.SDP in Apache Spark 4.1 has the following capabilities which make it a great choice for data pipelines:
We are excited about SDP’s roadmap, which is being developed in the open with the Spark community. Upcoming Spark releases will build on this foundation with support for continuous execution, and more efficient incremental processing. We also plan to bring core capabilities like Change Data Capture (CDC) into SDP, shaped by real-world use cases and community feedback. Our aim is to make SDP a shared, extensible foundation for building reliable batch and streaming pipelines across the Spark ecosystem.
Product
November 21, 2024/3 min read

