Spark DSV2: Growing Up Fast
Overview
| Experience | In Person |
|---|---|
| Track | Data Engineering & Streaming |
| Industry | Enterprise Technology |
| Technologies | Databricks SQL |
| Skill Level | Intermediate |
Spark’s DataSource V2 integration has taken a major step forward, beginning with the addition of a procedure catalog and row identifier support to enable richer table management and row-level operations across modern data sources. Building on this foundation, recent updates improve MERGE INTO with safer schema evolution and enhance partition filtering for more efficient query planning and execution. DML summaries now provide clearer visibility into write behavior, while key cache fixes resolve long-standing correctness issues in Spark execution. DataSource V2 has also been extended with first-class SQL features such as table constraints, complex default values, and generated columns, laying the groundwork for more advanced table semantics. We’ll also look ahead to deeper Change Data Feed (CDF) integration in Spark to support robust incremental processing, highlighting what’s available today and what’s coming next as Spark continues to close the gap with traditional warehouse systems.
Session Speakers
Szehon Ho
/Software Engineer
Databricks
Anton Okolnychyi
/Senior Staff Software Engineer
Databricks