How to choose between Lakehouse, Spark Declarative Pipelines, or PySpark, and when to combine them
by Rafael Aielo
Your team has hundreds of stored procedures, a couple of schedulers, permissions scattered across roles and schemas, and a cloud data warehouse renewal deadline coming up. Nobody agrees on what to move first. Some want to rewrite everything in PySpark. Others want to move SQL as-is and call it done. Lost in the conversation: the metadata, lineage, and permissions that move with the code, plus the opportunity to consolidate them on the way.
Neither extreme works. The teams that succeed at data warehouse migration look at each workload individually and pick the right tool for the job. This post suggests a decision framework for selection: when to use Lakehouse (Databricks SQL), Spark Declarative Pipelines, or PySpark, and how to phase the work so you ship results instead of stalling on a plan.
On Databricks, you can migrate ETL pipelines in three primary ways, often used together.
This is the most direct path for SQL-heavy teams. It covers a spectrum from simple to complex. It runs on SQL warehouses, which are Photon-accelerated by default and fully compatible with ANSI and Spark SQL (%sql). Choose Serverless for variable or unpredictable workloads (fast startup, scales to zero, pay per second). Choose Classic for steady workloads or when you need specific networking or cost controls.
A straightforward SQL task:
When the logic requires control-of-flow (conditionals), loops, variables, error handling, or parameter-driven execution, stored procedures give you that procedural layer. They are governed via Unity Catalog and can be called from Workflows with parameters.
The rule of thumb: if your legacy code is a single SQL statement, migrate it as a SQL task. If it has procedural logic (variables, loops, parameters, error handling), wrap it in a stored procedure, governed by Unity Catalog and callable from Workflows. Do not wrap simple SQL in a procedure just because the original system required it.
Which is part of Lakeflow, take a different approach. You declare what your pipeline should produce and the engine handles execution order, retries, and scaling. You get built-in data quality constraints, automatic dependency resolution, and unified batch-plus-streaming in the same definition.
Under the hood, Enzyme decides when to incrementally update versus fully recompute derived tables. Autoscaling adjusts capacity to data volume changes without manual tuning. Companies like Block lean on this declarative model to simplify pipeline orchestration as usage grows.
Which gives you full control. They run on job clusters and handle the workloads that don't fit a SQL Warehouse or a declarative pipeline.
Reach for PySpark when the workload needs complex business logic, ML feature engineering, API integrations, or custom validation. The example below scores transactions with a model registered in Unity Catalog:
Reach for Spark SQL in a notebook when the language is still SQL but the workload may exceed SQL Warehouse fit: very large tables, heavy shuffles, long-running batch ETL where you want explicit control over partitioning, broadcast joins, or caching.
Enable Photon on the job cluster for compute-bound SQL or DataFrame work: large joins, aggregations, window functions, scans over big columnar tables. Photon is a native, vectorized engine that accelerates these patterns without code changes, including Pandas UDFs (Arrow-based). Skip Photon when row-wise Python UDFs dominate, datasets are small, or the job is pure I/O.
Notebooks also fit well in hybrid pipelines: ingestion in SDP, enrichment in a notebook task.
The table below is a starting point for team conversations, not a hard rule.
| Criteria | Lakehouse(tasks and stored procs) | Spark Declarative Pipelines | PySpark + Spark SQL notebooks |
|---|---|---|---|
| Team profile | SQL-heavy, DBAs, DW engineers | Data engineers and SQL teams building managed pipelines | Python/Spark developers, ML engineers |
| Type of logic | SQL ETL: simple tasks for single statements, stored procs for procedural logic | Declarative pipelines, CDC, SCD | Complex logic, custom UDFs, ML prep |
| SQL migration speed | High for SQL ANSI-like workloads | Medium: pipeline redesign, but SQL reuse | Variable: may require significant refactoring |
| Pipeline orchestration | Workflows with SQL tasks or CALL procedure | Embedded in pipelines | Workflows with notebook tasks |
| Batch vs. streaming | Primarily batch | Unified batch and streaming | Batch and streaming via Structured Streaming |
| Data quality | Manual SQL checks | Declarative constraints | Custom validation in code |
Find your team in the column and your workload complexity in the row. The cell may suggest where to start.
| Workload complexity | SQL-first team | Hybrid team | Code-first team |
|---|---|---|---|
| Low (batch loads, aggregations, MERGE) | SQL tasks in Workflows | SQL tasks or SDP | PySpark or SDP |
| Medium (multi-step pipelines, CDC, data quality) | Stored procedures or SDP | SDP | SDP or PySpark |
| High (ML prep, Custom UDFs, APIs, dense business logic) | SDP + PySpark assist | PySpark + SDP for ingestion | PySpark |
Rather than deciding "which approach for everything," decide "what to do next" in each phase.
Phase 1 — Assess. Collect metrics from your legacy data warehouse: CPU time, runtime, frequency, source and target tables. Classify workloads by complexity. Use migration tools, when possible, to build an inventory scored by value versus difficulty. Where you find this data depends on the source. On Teradata, query DBC.QryLog. On SQL Server, use sys.dm_exec_query_stats. On Oracle, AWR reports. On Snowflake, QUERY_HISTORY. The specifics may vary. If you have an integration tool in place, you can leverage its metadata to identify relationships between tables, or rely on an LLM to help build this lineage. The output is a map, not a rewrite plan. The goal remains the same: rank workloads by resource consumption and dependency level so you know where to start. Done well, this assessment takes days using migration tools, not weeks of manual scripting.
Phase 2 — Quick wins. Pick workloads that combine low migration risk with use cases with high business visibility. That might mean starting with heavy SQL jobs that are easy to convert, or with reporting pipelines that put the new platform in front of stakeholders early. Simple statements will become SQL tasks in Workflows. Procedural logic becomes stored procedures. Use transpilers and AI-assisted conversion for the initial translation. Run both systems side by side and compare row counts, checksums, sample records. The point is to build confidence, both technical and organizational.
Walgreens, for example, retired on-premises Teradata in a phased migration and now processes around 40,000 data events per second on the lakehouse, powering supply-chain optimization across nearly 9,000 stores.
Phase 3 — Modernize. Now redesign the pipelines worth modernizing. Candidates: flows where data quality constraints and lineage reduce manual checks, batch jobs that benefit from streaming tables and CDC, pipelines where materialized views reduce complexity, and the metadata, permissions, and audit that previously lived in separate tools, now consolidated under Unity Catalog. A common pattern is keeping the legacy procedure as a fallback while the new pipeline runs in parallel until it passes validation. Modernized pipelines often cut batch windows from hours to minutes and remove the need for separate DQ tooling.
Phase 4 — Optimize. Consolidate redundant ETL pipelines that only existed to work around old DW limitations. Move complex hot spots to PySpark when it simplifies logic. Revisit batch vs. streaming boundaries now that you have a unified engine. This is where the migration pays off: the legacy platform is off, redundant pipelines are gone, and the architecture runs on one system instead of two.
Migration tooling automates the mechanical work but does not replace architecture decisions. Three typical roles:
A pragmatic approach: let tooling cover 60-80% of the initial conversion and reserve engineer time for the patterns you actually want to modernize. This avoids a one-to-one port of technical debt.
Successful migrations actively retire systems, not just convert code: standalone scheduler servers, custom DQ frameworks, separate lineage and metadata tools, vendor-specific stored-procedure compilers, and hand-rolled validation harnesses. The migration is not done until those systems are off and the bills stop.
If SQL ETL stays fragmented across engines, layers, and tools, the platform stays fragmented, even if the data sits in open formats.
There is no single right way to migrate data pipelines. Lakehouse gets you there fast: simple tasks for simple logic, stored procedures when you need procedural control. SDP gives you modern ETL pipelines with quality and lineage built in. Notebooks handle the rest, whether you reach for PySpark or Spark SQL. Phase the work, start with quick wins, and use every accelerator you can.
Explore the Databricks Migration Guide for technical walkthroughs, or try it yourself with a Databricks Free Edition.
Subscribe to our blog and get the latest posts delivered to your inbox.