Migrating from Redshift to Databricks: A Field Guide for Data Teams

Practical steps to plan your migration, validate results, and optimize your data warehouse migration for scale and flexibility

Published: October 14, 2025

Summary

Solve for rising costs and rigid scaling by migrating to Databricks Lakehouse architecture
Follow a strategic, phased approach to migrate from Redshift with minimal disruption
Support modern workloads from analytics to AI on a unified and governed platform

Amazon Redshift has been a reliable data warehouse for many organizations; however, as data volumes and workloads grow, costs rise quickly, performance tuning becomes more challenging, and the pace of innovation has slowed. Teams that once relied on Redshift now face scale limits and higher maintenance overhead, with fewer new capabilities to support modern analytics.

That’s why more data teams are moving to Databricks. The lakehouse combines the openness of a data lake with the performance and reliability of a warehouse. It scales elastically, handles all types of workloads, and unifies governance across analytics and AI.

This blog shares practical strategies for planning your Redshift migration, avoiding common pitfalls, and helping your team make the most of the transition.

Understanding the key differences

Before discussing migration strategies, it’s essential to understand the core differences between Amazon Redshift and Databricks—not just in technology but also in architectural differences.

Redshift's traditional warehouse architecture vs. Databricks’ lakehouse architecture

Redshift is a managed, cloud-based warehouse built around a proprietary storage format and a cluster-based compute model. Because storage and compute are tied together, scaling one often means scaling both, which can drive up costs and create bottlenecks.

Databricks uses a lakehouse architecture that separates storage and compute. Data is stored once, in open formats such as Delta or Iceberg, and can be accessed by multiple teams without copying or moving it. This design supports modern workloads from analytics to AI on a unified and governed platform.

With Databricks, teams can:

Share a single source of truth across analytics and AI workloads
Scale compute independently of storage to handle spikes efficiently
Simplify pipelines by removing redundant ETL steps and avoiding bulk copies from S3
Govern all assets through Unity Catalog for consistent policies across data, analytics, and AI

SQL and processing differences

Both platforms support ANSI SQL, but differ in syntax and optimizer behavior. Databricks follows open standards and provides extensions for stored procedures, temporary tables, and multi-statement transactions.

Data processing and scaling

How Redshift works

Redshift runs on fixed clusters where storage and compute are coupled. To handle spikes, you can resize the cluster or enable concurrency scaling, both of which incur additional costs. Workload queues need manual tuning to avoid contention. Sort and distribution keys help performance, but are hard to manage as data grows.

How Databricks works

Databricks separates storage and compute. Data lives in open formats and can be read by multiple SQL warehouses without copies. Warehouses scale out and in automatically. The Photon engine accelerates queries across mixed workloads, so you size for steady state and let elasticity absorb peaks.

Databricks reads and writes open columnar formats designed for large scans. Schema evolution, time travel, and ACID transactions are built in. This reduces the need for the heavy maintenance jobs Redshift users often depend on.

In Redshift, performance tuning means adjusting keys, queues, and cluster sizes. In Databricks, it means right-sizing warehouses, monitoring workload patterns, and iterating with real metrics. The goal shifts from cluster tuning to efficiency and cost visibility.

Pre-migration steps

A successful migration starts with a clear inventory and a phased plan. Avoid a “big bang” approach. The goal is to reduce risk and build confidence with each phase.

Inventory your environment: Catalog all database objects, dependencies, usage patterns, and ETL or data integration workflows. This provides the foundation for understanding scope and complexity.
Analyze workflow patterns: Document how data flows through your current system. Include batch, streaming workloads, workload dependencies, and any platform-specific logic that may require redesign.
Prioritize and phase your migration: Group workloads by risk and business impact. Migrate low-risk workloads first to validate migration patterns before expanding. Collaborate with Databricks teams and certified integration partners to build a realistic, low-risk plan that aligns with your goals and timelines.

Automated tools like Lakebridge from Databricks and tools from partners like Datafold can accelerate this process. They are purpose-built to manage all phases of a SQL migration, including surveying your existing landscape to help with these steps.

Now that we've completed our prep work, let's focus on developing a data migration strategy that addresses schema translation, data movement, and performance optimization.

Data migration strategies

Migration success depends on both technical planning and understanding the characteristics of your data.

Schema translation and optimization

Avoid copying Redshift schemas directly. For example, some numeric column types in the DDL generated from the Redshift table may differ from the types in the source Parquet files. Suppose you don’t use the source column types during DDL creation. In that case, you will be forced to have unnecessary casting on our ingestion pipeline after the data is migrated from Redshift.

Translating schemas thoughtfully ensures compatibility and avoids performance or data accuracy issues down the line.

Beyond translating the schema, this migration also provides a chance to re-evaluate the data model design selectively. This is particularly beneficial when a strong business case exists, such as for heavily utilized data models.

For more details, check out the Amazon Redshift to Databricks Migration Guide.

Data extraction and loading approaches

Migrating from Redshift to Databricks is straightforward since both platforms are cloud-based. Common data extraction and loading approaches include:

Using Redshift UNLOAD to export data into Parquet, then load into Delta tables with Auto Loader or COPY INTO
Using Lakehouse Federation to replicate data marts directly
Using the Spark Redshift Connector to read data into Spark DataFrames
Using partner tools such as Hevo for no-code transfers

For large tables, native export tools are recommended. Federation and the Spark Redshift Connectors work best for smaller datasets.

Performance Optimization

During the migration, a portion of the data pipeline queries can be optimized to improve performance and/or reduce costs.

In Redshift, skewed distribution keys or poorly chosen sort keys can significantly impact performance, presenting a complex and time-consuming issue to resolve.

Databricks offers a comprehensive set of capabilities to automate table maintenance for optimal performance. This includes predictive optimization for Unity Catalog-managed tables, which automatically runs key operations (e.g., removing unused data files, optimizing data file layout) to maximize query performance. Turning on predictive optimization should be the default for most cases.

One of predictive optimization’s key operations for improving performance is liquid clustering, which replaces table partitioning to simplify data layout decisions and optimize query performance. Liquid Clustering provides the flexibility to redefine clustering keys without rewriting existing data, allowing data layout to evolve alongside analytic needs over time.

When automatic liquid clustering is enabled, key selection and clustering operations run asynchronously as a maintenance operation, further simplifying table management. Let's look at an example to see why this is so powerful.

In the first image, partitioning is based on transaction dates, resulting in uneven data distribution that hinders query performance. With Automatic Liquid Clustering, partitions are intelligently balanced, improving performance for time-based queries.

unbalanced dataset — Partitioning with the unbalanced and unoptimized dataset

Liquid Clustering produces a Balanced Transaction Dataset

Designing with Databricks’ processing model in mind ensures that your workloads scale efficiently and remain maintainable post-migration.

Code and logic migration

Migrating code and logic is often the most complex and critical part of a Redshift to Databricks transition. It involves translating SQL, refactoring stored procedures, and rethinking ETL workflows to align with Databricks’ distributed and declarative model. Getting this right ensures that your migrated environment performs well and remains maintainable over time.

SQL translation strategies

Both Redshift and Databricks support ANSI SQL, but Redshift includes proprietary syntax and custom functions. Tools like Lakebridge can automatically scan your SQL environment, classify code complexity, and perform bulk translations. It identifies statements that require review, automatically translates compatible ones, and highlights the rest for manual refinement.

Databricks Assistant adds another layer of support by helping developers generate or modify SQL directly in the workspace. You can paste a Redshift query and have the assistant explain it, suggest syntax equivalents, or rewrite it for Databricks SQL. This accelerates migration while maintaining accuracy.

Databricks Assistant translating Redshift code to Databricks SQL

These tools help reduce manual effort and identify areas that require rework or architectural changes post-migration.

Stored procedures migration

Stored procedures are often among the most customized components in a Redshift environment. Redshift uses PL/pgSQL, while Databricks implements SQL/PSM, the open standard for procedural SQL defined by ISO/IEC 9075.

Most core programming constructs—loops, conditionals, variable declarations, and exception handling—map directly. However, Redshift-specific packages and helper functions (for example, those referencing system tables or administrative metadata) will need to be refactored. Anything that relies on Redshift PL/pgSQL extension will need to be rewritten or replaced with equivalent SQL logic.

For many workloads, stored procedures can be easily migrated into Databricks with minimal effort. In some cases, procedures that perform heavy computation or multi-step data transformations may benefit from being converted into Lakeflow Declarative Pipelines or Jobs, allowing them to run more efficiently and be monitored centrally.

ETL workflow transformation

This is where teams often see the biggest benefit from migration. Databricks provides a unified environment for building, scheduling, and maintaining data pipelines, removing much of the operational overhead that Redshift’s external orchestration systems require.

Databricks Notebooks

Notebooks are ideal for lightweight or exploratory pipelines. They combine SQL, Python, and Scala in a single environment. Parameters can be passed at runtime, and version control integration ensures reproducibility. They are great for small-scale jobs or quick prototyping.

Lakeflow Declarative Pipelines

Declarative Pipelines take the complexity out of traditional ETL code. You define your transformations in SQL or Python, and Databricks handles the operational aspects automatically—scheduling, dependency management, and incremental loading.

They also include built-in data quality checks, error handling, and support for both batch and streaming. Declarative Pipelines support Slowly Changing Dimensions (SCD Type 1 and Type 2) out of the box, which normally require custom logic in Redshift.

Lakeflow Jobs

Jobs provide orchestration across multiple tasks or languages. You can chain SQL, Python, or notebook tasks and define dependencies, retries, and alerting policies. This enables you to consolidate scheduling and orchestration in Databricks, rather than relying on external tools or custom scripts.

Workload management

Migrating to Databricks also means rethinking workload management. Redshift uses queues and manual configuration to allocate resources. Databricks handles this automatically through warehouse sizing and elastic scaling.

Each SQL warehouse defines its compute size and the minimum and maximum number of clusters it can scale to. Databricks automatically routes queries across these clusters, balancing the load in real time. This eliminates the need to tune Workload Management (WLM) queues.

To set up your environment effectively:

Begin with a medium warehouse as a baseline
Review historical query metrics to understand concurrency and throughput
Adjust minimum and maximum clusters to balance performance and cost
Monitor query latency and cost metrics through built-in dashboards to spot inefficiencies

Elastic scaling ensures that performance remains consistent even when workload patterns change. Instead of managing queues or resizing clusters manually, you simply adjust warehouse parameters and let Databricks handle the rest.

Post-migration: validation, optimization, and adoption

Once workloads are live in Databricks, focus on validation, optimization, and team readiness. Treat this as part of the migration, not a separate phase.

Best practices for validation

Run both systems in parallel for an initial validation window and compare results before decommissioning Redshift
Verify data completeness by checking row counts, aggregates, and key metrics across systems. Automate checks with Lakebridge or Datafold
Reconcile business outputs to confirm that dashboards and KPIs produce expected results. Involve business stakeholders early
Test performance baselines to capture query response times and resource usage, then use these to fine-tune warehouse sizing and clustering

Best practices for optimization

Enable Predictive Optimization to let Databricks manage compaction and file layout automatically
Review liquid clustering keys as query patterns evolve
Optimize file sizes between 100 and 500 MB for better read performance and parallelism
Use query history and dashboards to identify heavy workloads and inefficient queries
Revisit warehouse scaling policies quarterly to ensure resources match real usage

Best practices for team readiness

Provide hands-on training for notebooks, SQL syntax, and Lakeflow pipelines
Document common workflows for ingestion, transformations, and dashboards so teams have consistent guidance
Assign ownership for data quality, governance, and cost monitoring
Encourage collaboration through shared notebooks and unified governance with Unity Catalog

Migration is more than a technical shift

Migration does not end at cutover. The goal is ongoing improvement. Utilize built-in observability, warehouse metrics, and governance tools to maintain an efficient and transparent environment. As the platform evolves, adopt new capabilities gradually without disruption.

Migrating from Redshift to Databricks is more than a platform change. It is a shift in how teams manage, process, and use data. With careful planning, phased execution, and collaboration between teams, you can minimize risk and gain a platform built for openness, scale, and long-term growth.

For detailed steps and templates, download the Amazon Redshift to Databricks Migration Guide or contact Databricks Professional Services for an assessment.

What's next?

October 24, 2024/4 min read

Building a Cost-Optimized Chatbot with Semantic Caching

November 20, 2024/4 min read