Skip to main content
Data Warehousing

A Decision Framework for ETL Migration to Databricks

How to choose between Lakehouse, Spark Declarative Pipelines, or PySpark, and when to combine them

by Rafael Aielo

  • Three paths, not one: Lakehouse, Spark Declarative Pipelines (SDP), and PySpark or Spark SQL notebooks address different migration scenarios. Most organizations end up using a combination.
  • Phase for outcomes: A four-stage approach (assess, quick wins, modernize, optimize) lets you retire legacy systems incrementally instead of betting on a big-bang cutover.
  • Let the tooling do the heavy lifting: Lakebridge, partner transpilers, and AI-assisted code conversion automate much of the mechanical translation so your team can focus on validation and optimization.

Your team has hundreds of stored procedures, a couple of schedulers, permissions scattered across roles and schemas, and a cloud data warehouse renewal deadline coming up. Nobody agrees on what to move first. Some want to rewrite everything in PySpark. Others want to move SQL as-is and call it done. Lost in the conversation: the metadata, lineage, and permissions that move with the code, plus the opportunity to consolidate them on the way.

Neither extreme works. The teams that succeed at data warehouse migration look at each workload individually and pick the right tool for the job. This post suggests a decision framework for selection: when to use Lakehouse (Databricks SQL), Spark Declarative Pipelines, or PySpark, and how to phase the work so you ship results instead of stalling on a plan.

Three paths, one migration

On Databricks, you can migrate ETL pipelines in three primary ways, often used together.

Lakehouse (Databricks SQL)

This is the most direct path for SQL-heavy teams. It covers a spectrum from simple to complex. It runs on SQL warehouses, which are Photon-accelerated by default and fully compatible with ANSI and Spark SQL (%sql). Choose Serverless for variable or unpredictable workloads (fast startup, scales to zero, pay per second). Choose Classic for steady workloads or when you need specific networking or cost controls.

A straightforward SQL task:

When the logic requires control-of-flow (conditionals), loops, variables, error handling, or parameter-driven execution, stored procedures give you that procedural layer. They are governed via Unity Catalog and can be called from Workflows with parameters.

The rule of thumb: if your legacy code is a single SQL statement, migrate it as a SQL task. If it has procedural logic (variables, loops, parameters, error handling), wrap it in a stored procedure, governed by Unity Catalog and callable from Workflows. Do not wrap simple SQL in a procedure just because the original system required it.

Spark Declarative Pipelines (SDP)

Which is part of Lakeflow, take a different approach. You declare what your pipeline should produce and the engine handles execution order, retries, and scaling. You get built-in data quality constraints, automatic dependency resolution, and unified batch-plus-streaming in the same definition.

Under the hood, Enzyme decides when to incrementally update versus fully recompute derived tables. Autoscaling adjusts capacity to data volume changes without manual tuning. Companies like Block lean on this declarative model to simplify pipeline orchestration as usage grows.

PySpark and Spark SQL notebooks

Which gives you full control. They run on job clusters and handle the workloads that don't fit a SQL Warehouse or a declarative pipeline.

Reach for PySpark when the workload needs complex business logic, ML feature engineering, API integrations, or custom validation. The example below scores transactions with a model registered in Unity Catalog:

Reach for Spark SQL in a notebook when the language is still SQL but the workload may exceed SQL Warehouse fit: very large tables, heavy shuffles, long-running batch ETL where you want explicit control over partitioning, broadcast joins, or caching.

Enable Photon on the job cluster for compute-bound SQL or DataFrame work: large joins, aggregations, window functions, scans over big columnar tables. Photon is a native, vectorized engine that accelerates these patterns without code changes, including Pandas UDFs (Arrow-based). Skip Photon when row-wise Python UDFs dominate, datasets are small, or the job is pure I/O.

Notebooks also fit well in hybrid pipelines: ingestion in SDP, enrichment in a notebook task.

Decision matrix

The table below is a starting point for team conversations, not a hard rule.

CriteriaLakehouse(tasks and stored procs)Spark Declarative PipelinesPySpark + Spark SQL notebooks
Team profileSQL-heavy, DBAs, DW engineersData engineers and SQL teams building managed pipelinesPython/Spark developers, ML engineers
Type of logicSQL ETL: simple tasks for single statements, stored procs for procedural logicDeclarative pipelines, CDC, SCDComplex logic, custom UDFs, ML prep
SQL migration speedHigh for SQL ANSI-like workloadsMedium: pipeline redesign, but SQL reuseVariable: may require significant refactoring
Pipeline orchestrationWorkflows with SQL tasks or CALL procedureEmbedded in pipelinesWorkflows with notebook tasks
Batch vs. streamingPrimarily batchUnified batch and streamingBatch and streaming via Structured Streaming
Data qualityManual SQL checksDeclarative constraintsCustom validation in code

Quick decision grid

Find your team in the column and your workload complexity in the row. The cell may suggest where to start.

Workload complexitySQL-first teamHybrid teamCode-first team
Low
(batch loads, aggregations, MERGE)
SQL tasks in WorkflowsSQL tasks or SDPPySpark or SDP
Medium
(multi-step pipelines, CDC, data quality)
Stored procedures or SDPSDPSDP or PySpark
High
(ML prep, Custom UDFs, APIs, dense business logic)
SDP + PySpark assistPySpark + SDP for ingestionPySpark

Four phases instead of big-bang

Rather than deciding "which approach for everything," decide "what to do next" in each phase.

Phase 1 — Assess. Collect metrics from your legacy data warehouse: CPU time, runtime, frequency, source and target tables. Classify workloads by complexity. Use migration tools, when possible, to build an inventory scored by value versus difficulty. Where you find this data depends on the source. On Teradata, query DBC.QryLog. On SQL Server, use sys.dm_exec_query_stats. On Oracle, AWR reports. On Snowflake, QUERY_HISTORY. The specifics may vary. If you have an integration tool in place, you can leverage its metadata to identify relationships between tables, or rely on an LLM to help build this lineage. The output is a map, not a rewrite plan. The goal remains the same: rank workloads by resource consumption and dependency level so you know where to start. Done well, this assessment takes days using migration tools, not weeks of manual scripting.

Phase 2 — Quick wins. Pick workloads that combine low migration risk with use cases with high business visibility. That might mean starting with heavy SQL jobs that are easy to convert, or with reporting pipelines that put the new platform in front of stakeholders early. Simple statements will become SQL tasks in Workflows. Procedural logic becomes stored procedures. Use transpilers and AI-assisted conversion for the initial translation. Run both systems side by side and compare row counts, checksums, sample records. The point is to build confidence, both technical and organizational.

Walgreens, for example, retired on-premises Teradata in a phased migration and now processes around 40,000 data events per second on the lakehouse, powering supply-chain optimization across nearly 9,000 stores.

Phase 3 — Modernize. Now redesign the pipelines worth modernizing. Candidates: flows where data quality constraints and lineage reduce manual checks, batch jobs that benefit from streaming tables and CDC, pipelines where materialized views reduce complexity, and the metadata, permissions, and audit that previously lived in separate tools, now consolidated under Unity Catalog. A common pattern is keeping the legacy procedure as a fallback while the new pipeline runs in parallel until it passes validation. Modernized pipelines often cut batch windows from hours to minutes and remove the need for separate DQ tooling.

Phase 4 — Optimize. Consolidate redundant ETL pipelines that only existed to work around old DW limitations. Move complex hot spots to PySpark when it simplifies logic. Revisit batch vs. streaming boundaries now that you have a unified engine. This is where the migration pays off: the legacy platform is off, redundant pipelines are gone, and the architecture runs on one system instead of two.

Where migration tools and AI fit

Migration tooling automates the mechanical work but does not replace architecture decisions. Three typical roles:

  • Profiling and assessment. Discover stored procedures, SQL scripts, and ETL jobs. Map dependencies. Lakebridge ships an Analyzer component that scans legacy data warehouse platforms and builds an inventory of objects, usage patterns, and complexity.
  • Code conversion. Translate SQL and ETL from Teradata, Oracle, SQL Server, DataStage, Informatica, and SSIS into Lakehouse or declarative pipelines. Lakebridge's Converter handles stored procedures and ETL flows, with public guidance citing up to 80% automation and roughly 2x faster project timelines.
  • Validation. Compare results across systems with automated checks on schemas, row counts, and aggregates. Lakebridge includes a validator. The Databricks migration methodology treats reconciliation as a first-class phase, not an afterthought.

A pragmatic approach: let tooling cover 60-80% of the initial conversion and reserve engineer time for the patterns you actually want to modernize. This avoids a one-to-one port of technical debt.

What gets removed

Successful migrations actively retire systems, not just convert code: standalone scheduler servers, custom DQ frameworks, separate lineage and metadata tools, vendor-specific stored-procedure compilers, and hand-rolled validation harnesses. The migration is not done until those systems are off and the bills stop.

Three anti-patterns that stall migrations

  • Picking one path for everything without considering team skills, risk profile, and workload type. The SQL-only team misses modernization opportunities. The PySpark-only team rewrites simple SQL for no reason.
  • Measuring only "percent migrated" while ignoring parallel-run duration, validation time, and actual retirement of legacy systems. Fifty percent migrated means nothing if the old data warehouse platform is still running at full cost.
  • Recreating old schedulers and intermediate layers on the lakehouse instead of using workflows and declarative pipelines. Migration is a chance to simplify pipeline orchestration. Take it.

If SQL ETL stays fragmented across engines, layers, and tools, the platform stays fragmented, even if the data sits in open formats.

There is no single right way to migrate data pipelines. Lakehouse gets you there fast: simple tasks for simple logic, stored procedures when you need procedural control. SDP gives you modern ETL pipelines with quality and lineage built in. Notebooks handle the rest, whether you reach for PySpark or Spark SQL. Phase the work, start with quick wins, and use every accelerator you can.

Explore the Databricks Migration Guide for technical walkthroughs, or try it yourself with a Databricks Free Edition.

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.