What is Change Data Capture?
What is Change Data Capture?
Change Data Capture (CDC) is a data integration technique that identifies and records row-level changes made to a dataset, such as inserts, updates, and deletes. Instead of repeatedly extracting entire tables, CDC captures only the modified records and applies them to downstream systems. This incremental approach keeps analytics platforms, operational applications and machine learning pipelines aligned with current information without the cost or delay of full refreshes.
Traditional batch pipelines rely on periodic ingestion jobs that perform full scans or reload large datasets. These workflows are simple and cost effective, but inefficient at scale since they add latency and repeatedly process unchanged data. CDC addresses these limitations by continuously detecting modifications through mechanisms such as transaction logs, triggers, timestamps, or native change feeds, allowing data lakehouse architecture platforms to operate with fresher data and reduced compute overhead.
Here’s more to explore
How CDC Works in the ETL Process
Within an ETL pipeline, CDC is the mechanism that extracts only the data that has changed since the last load. Rather than executing scheduled full-table extractions, CDC captures new or modified rows as they occur in the source database. By changing only those events collected from logs, triggers, or snapshot deltas, it can form an incremental stream representing the dataset's ongoing evolution through extract, transform, and load (ETL) processes.
Once events enter the pipeline, the ETL process takes over, with any cleaning, enrichment, or validation executed on each changed record, not the entire dataset. The final load step applies only these incremental updates to the target table or repository, resulting in lightweight, continuous ingestion. This approach reduces I/O and keeps downstream systems closely aligned with the source.
By enabling continuous extraction, transformation, and loading, CDC modernizes ETL from a batch-oriented workflow into a real-time pipeline. Analytics, dashboards and machine learning pipelines consistently reflect the latest data without relying on long-running jobs or maintenance windows, enabled by streaming analytics.
Why CDC Matters for Modern Data Architecture
Modern data ecosystems depend on timely and accurate information flowing between operational systems, analytics platforms and machine learning pipelines. In environments such as e-commerce, banking or logistics, data changes constantly as new data is created through actions like purchases, profile updates or adjusted inventory. Without CDC, those updates remain isolated in source systems until the next batch ETL job, and this can leave dashboards, reports and models to rely on outdated datasets.
CDC solves this problem by enabling real-time synchronization, keeping all connected systems aligned to the same single source of truth.
This process also supports zero-downtime migrations, which is a key part of cloud modernization. Instead of freezing writes or performing risky cutovers, CDC continuously replicates changes between old and new systems in order to allow seamless migrations.
CDC vs. Traditional ETL: Key Differences
While traditional ETL pipelines remain central to many analytic workloads, they operate very differently from CDC. ETL typically moves data in scheduled batches, such as hourly, nightly, or at another fixed interval. Each run extracts data from the source system, transforms it, and reloads it into downstream platforms powered by Databricks Data Engineering. This model is predictable, but it can introduce latency and requires the system to scan entire tables or large partitions, even when only a small portion of records have changed.
By capturing changes as they happen, CDC eliminates the gap between when data changes in the source system and when it becomes available for analytics or operations.
The importance of CDC becomes even clearer when comparing how CDC and ETL handle data movement. While traditional ETL often relies on full table scans or bulk reloads, CDC transmits only incremental changes, This significantly reduces the compute overhead and improves overall efficiency in your data pipeline.
Batch ETL also depends on maintenance windows to ensure consistent reads. CDC removes this dependency by capturing changes without interrupting normal database activity. This makes CDC a strong fit for systems that require highly current data, such as real-time dashboards, recommendation engines or operational analytics. However, ETL remains suitable for large historical backfills or periodic transformations, and together, CDC and ETL can form a complementary ingestion strategy in modern architectures.
CDC in Modern Data Ecosystems
CDC allows data to flow continuously and reliably across data warehouses, lakehouses, and streaming platforms. Because every change is captured in the order it occurs, dashboards and applications remain synchronized with operational systems. CDC also supports auditability and governance by preserving a clear record of how data evolved, which is a key requirement for regulated industries such as finance and healthcare, particularly when implementing data warehouse to lakehouse migration strategies.
CDC Implementation Methods: Comparison and Selection
CDC vs. SCD: Understanding the Relationship
CDC and SCD serve different roles within a data pipeline. CDC is responsible for detecting and extracting row-level changes from a source system, while SCD determines how those changes are stored in the target system.
When CDC identifies a change, such as a customer updating their address, SCD Type 1 overwrites the existing record because historical values are not needed. SCD Type 2 instead creates a new versioned record with start and end timestamps to preserve the full history. In other words, CDC supplies the incremental change events; SCD applies the rules that shape how those events are represented, either as current-state snapshots or historical timelines.
Organizations can implement CDC in several ways, depending on their system performance, complexity and business needs. The most common methods organizations leverage detect changes differently.
Log-based CDC: This process reads directly from database transaction logs such as MySQL binlog, PostgreSQL WAL or Oracle redo logs. Because it works at the database level instead of querying live tables, it minimizes impact on production systems while still capturing all inserts, updates and deletes in real time. Frameworks like Debezium and Apache Kafka integration use this method to deliver reliable, high-volume data streams.
Trigger-based CDC: This method uses database triggers or stored procedures to record changes in shadow tables. Though it introduces minor write overhead, it offers precise control and can include custom logic or transformations, which can be useful for regulated workloads.
Query-based CDC: This method identifies modified records using timestamps or version numbers. It's simple and works well for smaller or legacy systems, but it may miss deletes and can be less efficient at scale.
Once changes are captured by the system, Slowly Changing Dimensions (SCD) patterns define how they're applied. These occur in two different ways:
SCD Type 1 overwrites existing records to keep only the most recent version. This is great for corrections or non-critical updates, such as fixing a misspelled customer name or updating a user's email address. In Spark Declarative Pipelines, this can be configured with just a few lines of code, while Lakeflow handles sequencing, dependencies and out-of-order events automatically.
SCD Type 2 preserves full history with automatic management of _START_AT and _END_AT columns, supporting audits and time-based analysis with ACID transactions with Delta Lake, ensuring past states remain available for analysis. This is ideal for tasks like tracking a customer's address over time, monitoring product price changes, or maintaining audit trails for compliance.
By combining CDC methods with Spark Declarative Pipelines, users can create low-maintenance, production-ready CDC pipelines that scale across both batch and streaming environments.
CDC Implementation: Step-by-Step Deployment
Effective CDC starts with planning and preparation. First, assess your business and system requirements for things like data volume, latency tolerance and update frequency. High-throughput systems may need streaming ingestion, while slower-moving sources may rely on periodic updates. Next, confirm source system access and permissions to ensure the ability to read transaction logs or snapshots. Finally, design target schemas that can store both current and historical data using partitioning or versioning strategies.
Databricks simplifies CDC through Lakeflow Declarative Pipelines, which provides scalable, incremental data processing, including an ACID-compliant storage layer with Delta Lake, allowing a single data copy to serve both batch and streaming workloads for consistency and cost savings.
Lakeflow builds on this with the AUTO CDC APIs, which automatically manage sequencing, resolve out-of-order records and maintain schema consistency. Users can sequence data by timestamp, ID or a composite key for deterministic ordering.
For systems without native change feeds, AUTO CDC FROM SNAPSHOT compares consecutive snapshots – such as tables or exports from Oracle and MySQL – to detect changes efficiently.
Compared to manual methods like MERGE INTO or foreachBatch, AUTO CDC is a low-code alternative with built-in support for DELETE and TRUNCATE operations provided by Databricks Lakeflow Connect. Integrated with Delta tables, these pipelines can stream updates into Kafka, Iceberg or data warehouses, supporting diverse analytics and streaming use cases.
Together, Delta Lake and Lakeflow make CDC declarative, reliable and production-ready, aligning with the Databricks lakehouse vision for real-time, unified analytics.
Platform-Specific CDC Implementation
CDC behavior varies across source databases:
SQL Server: SQL Server's native CDC features automatically capture inserts, updates, and deletes from a source table into dedicated change tables within the database. These tables include metadata such as the operation type and commit timestamp, making it easy to determine which rows changed in a given interval. SQL Server also provides retention controls to prevent unbounded growth while ensuring downstream systems have sufficient time to read captured events. Organizations can leverage SQL Server to Databricks migration strategies to modernize their data infrastructure.
Oracle: Oracle enables CDC through technologies such as LogMiner and GoldenGate, which read redo logs to detect committed changes without impacting the source workload. These tools allow high-volume, low-latency replication, and teams can follow Oracle to Databricks migration best practices for successful implementation.
MySQL: MySQL exposes change events through its binary log, allowing CDC tools to consume row-level updates efficiently.
PostgreSQL: PostgreSQL uses its Write-Ahead Log to enable logical decoding, which surfaces change events that downstream consumers can process.
Across all platforms, the pattern is consistent: the source database writes changes to logs or change tables, and CDC tools extract those events to feed downstream pipelines.
CDC Optimization: Performance and Data Quality
Once running, CDC pipelines must be tuned for performance, quality and resilience. Strong data quality management keeps pipelines dependable.
This starts with parallelization and partitioning which splits data by region, date or key to process multiple streams in parallel. Adjusting batch size and resource allocation helps further balance latency and cost; for instance, smaller batches reduce lag while larger ones improve throughput.
When moving data between multiple systems, CDC ensures consistency across target systems without the resource-intensive overhead of full replication. By processing only the changes from source systems, you maintain low latency for downstream consumers while ensuring downstream applications receive updated data for time-sensitive decisions.
By regularly monitoring key metrics such as commit latency and failure counts, users can catch performance issues early. Furthermore, well-defined retention policies prevent unnecessary growth in change tables, and automated schema evolution maintains compatibility as source structures shift. Built-in Databricks validations confirm updates meet schema requirements, while audit trails track every insert, update and delete for transparency.
Of course, working with data introduces multiple challenges, such as multiple updates within a single microbatch. To solve this and ensure accuracy, Databricks groups records by primary key and applies only the latest change using a sequencing column. Out-of-order updates are handled through deterministic sequencing, and soft-delete patterns mark records as inactive before cleanup jobs remove them later. These strategies preserve data integrity without interrupting operations.
Advanced Use Cases and Future Considerations
CDC extends beyond simple replication. Organizations use CDC to connect multiple systems and clouds, synchronize distributed environments, and power real-time analytics. Because CDC preserves the order of events, it maintains consistent state across platforms without heavy batch jobs.
CDC also supports machine learning feature pipelines by delivering continuous updates that keep training and inference aligned, reducing online/offline skew. Feature stores such as the Databricks Feature Store rely on CDC data for accurate, time-aware lookups, enabling advanced feature engineering for machine learning.
As architectures evolve, automation through Lakeflow Jobs and Spark Declarative Pipelines simplifies orchestration and monitoring. Serverless CDC reduces operational overhead, open table formats like Delta and Iceberg increase flexibility, and event-driven designs leverage CDC as the backbone for fast, reliable data movement.
CDC and Event Streaming: The Kafka Connection
Like we saw with CDC and SCD, CDC and Apache Kafka address different parts of the data movement pipeline, but they are highly complementary. While CDC captures new data, Kafka is a distributed streaming platform designed to transport and process event data at scale through data streaming platform capabilities. The two are often used together within a data pipeline.
In a typical architecture, a log-based CDC tool such as Debezium reads change events directly from database transaction logs. Instead of writing these events to a target table immediately, Debezium publishes them into Kafka topics, where they become part of a durable event stream. Kafka Connect provides the integration layer that makes this possible, allowing sources like MySQL, PostgreSQL, or SQL Server to feed new data into Kafka without custom code. Once the CDC events are in Kafka, other systems, such as data warehouses or lakehouses, store the latest data as it arrives.
Additionally, services can subscribe to change events rather than repeatedly polling the database, which reduces latency and improves scalability. Because CDC ensures that the latest data enters Kafka as soon as it is generated, downstream processes always operate on current information, whether they are updating materialized views, triggering workflows, or performing real-time analytics. In this way, CDC supplies the change events, and Kafka acts as the system that distributes those events efficiently across the organization's data ecosystem.
FAQs
Common Questions About Change Data Capture
What is the CDC process in ETL?
CDC is the mechanism that identifies and delivers only the rows that have changed since the last extraction. Instead of scanning or reloading entire tables, CDC captures inserts, updates, and deletes directly from the source system and sends them downstream as incremental events. This allows the transformation and loading stages to run continuously rather than in fixed batch intervals. As each new event flows through the pipeline, it is validated, transformed, and applied to the target system in near real time.
What is the difference between ETL and CDC?
ETL is a broad data workflow that extracts data from source systems, transforms it for consistency, and loads it into downstream data warehouses or lakehouses. Traditional ETL often relies on batch processing, where full tables or large partitions are moved at scheduled intervals. CDC instead focuses on identifying and transmitting only the changes that occur between ETL cycles. CDC does not replace ETL, but enhances it by making the extraction step incremental and continuous. This reduces compute overhead, eliminates dependence on batch windows, and ensures downstream systems receive timely updates without full reloads.
What is the difference between CDC and SCD?
CDC and SCD operate at different layers of a data pipeline. CDC captures changes from the source system, while SCD is a modeling pattern that determines how those captured changes should be stored in the target system. For instance, when CDC detects an update, SCD Type 1 overwrites the existing record, while SCD Type 2 adds a new versioned row with start and end timestamps to maintain full history.
What is the difference between CDC and Kafka?
CDC and Kafka serve complementary purposes. CDC is the technique used to capture row-level changes from source databases, while Kafka is a distributed event-streaming platform designed to store, transport, and process those events at scale. In many modern architectures, CDC tools like Debezium use log-based capture to detect new data in the source system, then publish the resulting events into Kafka topics. From there, downstream applications, services, or data platforms consume the latest data in real time.
Conclusion
Change Data Capture has become a core capability for modern data teams. Whether powering real-time dashboards, feeding machine learning models or enabling seamless data migrations through cloud data warehouse modernization and lakehouse architecture, CDC helps keep systems aligned and data trustworthy, with real-time data synchronization.
The success of this process depends on thoughtful design: select the right method, plan for scale and monitor for quality. From here, assess how these principles fit your architecture, and begin small with a proof of concept then refine as you grow. With the right approach, CDC moves beyond a pipeline task to become a lasting business advantage.
Want more tips and best practices for moder data engineering? Get the Data Engineering toolkit, a selection of resources for reliable data pipelines.


