Amazon Redshift has been a reliable data warehouse for many organizations; however, as data volumes and workloads grow, costs rise quickly, performance tuning becomes more challenging, and the pace of innovation has slowed. Teams that once relied on Redshift now face scale limits and higher maintenance overhead, with fewer new capabilities to support modern analytics.
That’s why more data teams are moving to Databricks. The lakehouse combines the openness of a data lake with the performance and reliability of a warehouse. It scales elastically, handles all types of workloads, and unifies governance across analytics and AI.
This blog shares practical strategies for planning your Redshift migration, avoiding common pitfalls, and helping your team make the most of the transition.
Before discussing migration strategies, it’s essential to understand the core differences between Amazon Redshift and Databricks—not just in technology but also in architectural differences.
We achieved 30-40% cost savings thanks to better performance and reduced data duplication. We now have the flexibility to scale up on demand to meet urgent SLAs and don’t have users stepping on each other’s toes with shared compute. All resulting in happier developers.—Kavya Atmakuri, Data Engineer and Team Lead at HP
Redshift is a managed, cloud-based warehouse built around a proprietary storage format and a cluster-based compute model. Because storage and compute are tied together, scaling one often means scaling both, which can drive up costs and create bottlenecks.
Databricks uses a lakehouse architecture that separates storage and compute. Data is stored once, in open formats such as Delta or Iceberg, and can be accessed by multiple teams without copying or moving it. This design supports modern workloads from analytics to AI on a unified and governed platform.
With Databricks, teams can:
Both platforms support ANSI SQL, but differ in syntax and optimizer behavior. Databricks follows open standards and provides extensions for stored procedures, temporary tables, and multi-statement transactions.
Redshift runs on fixed clusters where storage and compute are coupled. To handle spikes, you can resize the cluster or enable concurrency scaling, both of which incur additional costs. Workload queues need manual tuning to avoid contention. Sort and distribution keys help performance, but are hard to manage as data grows.
Databricks separates storage and compute. Data lives in open formats and can be read by multiple SQL warehouses without copies. Warehouses scale out and in automatically. The Photon engine accelerates queries across mixed workloads, so you size for steady state and let elasticity absorb peaks.
Databricks reads and writes open columnar formats designed for large scans. Schema evolution, time travel, and ACID transactions are built in. This reduces the need for the heavy maintenance jobs Redshift users often depend on.
In Redshift, performance tuning means adjusting keys, queues, and cluster sizes. In Databricks, it means right-sizing warehouses, monitoring workload patterns, and iterating with real metrics. The goal shifts from cluster tuning to efficiency and cost visibility.
A successful migration starts with a clear inventory and a phased plan. Avoid a “big bang” approach. The goal is to reduce risk and build confidence with each phase.
Automated tools like Lakebridge from Databricks and tools from partners like Datafold can accelerate this process. They are purpose-built to manage all phases of a SQL migration, including surveying your existing landscape to help with these steps.
Now that we've completed our prep work, let's focus on developing a data migration strategy that addresses schema translation, data movement, and performance optimization.
Migration success depends on both technical planning and understanding the characteristics of your data.
Avoid copying Redshift schemas directly. For example, some numeric column types in the DDL generated from the Redshift table may differ from the types in the source Parquet files. Suppose you don’t use the source column types during DDL creation. In that case, you will be forced to have unnecessary casting on our ingestion pipeline after the data is migrated from Redshift.
Translating schemas thoughtfully ensures compatibility and avoids performance or data accuracy issues down the line.
Beyond translating the schema, this migration also provides a chance to re-evaluate the data model design selectively. This is particularly beneficial when a strong business case exists, such as for heavily utilized data models.
For more details, check out the Amazon Redshift to Databricks Migration Guide.
Migrating from Redshift to Databricks is straightforward since both platforms are cloud-based. Common data extraction and loading approaches include:
For large tables, native export tools are recommended. Federation and the Spark Redshift Connectors work best for smaller datasets.
During the migration, a portion of the data pipeline queries can be optimized to improve performance and/or reduce costs.
In Redshift, skewed distribution keys or poorly chosen sort keys can significantly impact performance, presenting a complex and time-consuming issue to resolve.
Databricks offers a comprehensive set of capabilities to automate table maintenance for optimal performance. This includes predictive optimization for Unity Catalog-managed tables, which automatically runs key operations (e.g., removing unused data files, optimizing data file layout) to maximize query performance. Turning on predictive optimization should be the default for most cases.
One of predictive optimization’s key operations for improving performance is liquid clustering, which replaces table partitioning to simplify data layout decisions and optimize query performance. Liquid Clustering provides the flexibility to redefine clustering keys without rewriting existing data, allowing data layout to evolve alongside analytic needs over time.
When automatic liquid clustering is enabled, key selection and clustering operations run asynchronously as a maintenance operation, further simplifying table management. Let's look at an example to see why this is so powerful.
In the first image, partitioning is based on transaction dates, resulting in uneven data distribution that hinders query performance. With Automatic Liquid Clustering, partitions are intelligently balanced, improving performance for time-based queries.
Designing with Databricks’ processing model in mind ensures that your workloads scale efficiently and remain maintainable post-migration.
Migrating code and logic is often the most complex and critical part of a Redshift to Databricks transition. It involves translating SQL, refactoring stored procedures, and rethinking ETL workflows to align with Databricks’ distributed and declarative model. Getting this right ensures that your migrated environment performs well and remains maintainable over time.
Both Redshift and Databricks support ANSI SQL, but Redshift includes proprietary syntax and custom functions. Tools like Lakebridge can automatically scan your SQL environment, classify code complexity, and perform bulk translations. It identifies statements that require review, automatically translates compatible ones, and highlights the rest for manual refinement.
Databricks Assistant adds another layer of support by helping developers generate or modify SQL directly in the workspace. You can paste a Redshift query and have the assistant explain it, suggest syntax equivalents, or rewrite it for Databricks SQL. This accelerates migration while maintaining accuracy.
These tools help reduce manual effort and identify areas that require rework or architectural changes post-migration.
Stored procedures are often among the most customized components in a Redshift environment. Redshift uses PL/pgSQL, while Databricks implements SQL/PSM, the open standard for procedural SQL defined by ISO/IEC 9075.
Most core programming constructs—loops, conditionals, variable declarations, and exception handling—map directly. However, Redshift-specific packages and helper functions (for example, those referencing system tables or administrative metadata) will need to be refactored. Anything that relies on Redshift PL/pgSQL extension will need to be rewritten or replaced with equivalent SQL logic.
For many workloads, stored procedures can be easily migrated into Databricks with minimal effort. In some cases, procedures that perform heavy computation or multi-step data transformations may benefit from being converted into Lakeflow Declarative Pipelines or Jobs, allowing them to run more efficiently and be monitored centrally.
This is where teams often see the biggest benefit from migration. Databricks provides a unified environment for building, scheduling, and maintaining data pipelines, removing much of the operational overhead that Redshift’s external orchestration systems require.
Notebooks are ideal for lightweight or exploratory pipelines. They combine SQL, Python, and Scala in a single environment. Parameters can be passed at runtime, and version control integration ensures reproducibility. They are great for small-scale jobs or quick prototyping.
Lakeflow Declarative Pipelines
Declarative Pipelines take the complexity out of traditional ETL code. You define your transformations in SQL or Python, and Databricks handles the operational aspects automatically—scheduling, dependency management, and incremental loading.
They also include built-in data quality checks, error handling, and support for both batch and streaming. Declarative Pipelines support Slowly Changing Dimensions (SCD Type 1 and Type 2) out of the box, which normally require custom logic in Redshift.
Jobs provide orchestration across multiple tasks or languages. You can chain SQL, Python, or notebook tasks and define dependencies, retries, and alerting policies. This enables you to consolidate scheduling and orchestration in Databricks, rather than relying on external tools or custom scripts.
Migrating to Databricks also means rethinking workload management. Redshift uses queues and manual configuration to allocate resources. Databricks handles this automatically through warehouse sizing and elastic scaling.
Each SQL warehouse defines its compute size and the minimum and maximum number of clusters it can scale to. Databricks automatically routes queries across these clusters, balancing the load in real time. This eliminates the need to tune Workload Management (WLM) queues.
To set up your environment effectively:
Elastic scaling ensures that performance remains consistent even when workload patterns change. Instead of managing queues or resizing clusters manually, you simply adjust warehouse parameters and let Databricks handle the rest.
Once workloads are live in Databricks, focus on validation, optimization, and team readiness. Treat this as part of the migration, not a separate phase.
Migration does not end at cutover. The goal is ongoing improvement. Utilize built-in observability, warehouse metrics, and governance tools to maintain an efficient and transparent environment. As the platform evolves, adopt new capabilities gradually without disruption.
Migrating from Redshift to Databricks is more than a platform change. It is a shift in how teams manage, process, and use data. With careful planning, phased execution, and collaboration between teams, you can minimize risk and gain a platform built for openness, scale, and long-term growth.
For detailed steps and templates, download the Amazon Redshift to Databricks Migration Guide or contact Databricks Professional Services for an assessment.