HomepageData + AI Summit 2023 Logo
JUNE 26-29, 2023
SAN FRANCISCO + VIRTUAL
Attend Live

Git for Data Lakes—How lakeFS Scales Data Versioning to Billions of Objects

On Demand

Type

  • Session

Format

  • In-Person

Track

  • Data Lakes, Data Warehouses and Data Lakehouses

Difficulty

  • Intermediate

Room

  • Moscone South | Upper Mezzanine | 151

Duration

  • 35 min
Download session slides

Overview

Modern data lake architectures rely on object storage as the single source of truth. We use them to store an increasing amount of data, which is increasingly complex and interconnected. While scalable, these object stores provide little safety guarantees: lacking semantics that allow atomicity, rollbacks, and reproducibility capabilities needed for data quality and resiliency.



lakeFS - an open source data version control system designed for Data Lakes solves these problems by introducing concepts borrowed from Git: branching, committing, merging and rolling back changes to data.



In this talk you'll learn about the challenges with using object storage for data lakes and how lakeFS enables you to solve them.



By the end of the session you’ll understand how lakeFS scales its Git-like data model to petabytes of data, across billions of objects - without affecting throughput or performance. We will also demo branching, writing data using Spark and merging it on a billion-object repository.

Session Speakers

Headshot of Oz Katz

Oz Katz

CTO @ Treeverse, co-creator of lakeFS

Treeverse LTD

See the best of Data+AI Summit

Watch on demand