SESSION

Fast Copy-On-Write in Apache Parquet for Data Lakehouse Upserts

OVERVIEW

EXPERIENCEIn Person
TYPEBreakout
TRACKData Lakehouse Architecture
INDUSTRYRetail and CPG - Food, Travel and Hospitality
TECHNOLOGIESApache Spark, Delta Lake, SQL Analytics / BI / Visualizations
SKILL LEVELIntermediate
DURATION40 min

Efficient table ACID upsert is essential for today’s Lakehouse. Important use cases, such as GDPR Right to be Forgotten and Change Data Capture, rely heavily on it. While Apache Delta Lake, Iceberg, and Hudi are widely adopted, the slowness of upserts is seen when the data volume scales up, particularly for copy-on-write mode. Sometimes, the slow upserts become a blocker to finishing compliance requirements on time. We introduced partial copy-on-write within Parquet with row-level index to skip unnecessary column chunks efficiently. The term partial here means only performing copy-on-write for the needed chunks but skipping unrelated ones. Generally, only a small portion of the file needs to be updated, and most of the data chunks can be skipped. We have observed an increased speed of up to 20x compared to existing upserts.

SESSION SPEAKERS

Xinli Shang

/Engineering Manager
Uber

Mingmin Chen

/Director of Engineering
Uber Technologies, Inc