Building a Minimalistic Open Lakehouse Using Open Source Projects Apache Spark™: Project Nessie and Iceberg
A Lakehouse architecture is a combination of various components such as storage, file format, table format, and catalog. What truly makes a lakehouse 'open' is data being stored in open source table and file formats like Iceberg, Delta and Parquet respectively, and the technology being open sourced for easy and quick adoption by the community. Like any new technology, implementation of a lakehouse may seem daunting at first. However, when we break down the architecture to its open components, this becomes easy to adopt and scale.
Though this session, the idea is to help data engineers getting their leg into the world of data lakehouses, easily learn and implement it. We will go through a Notebook-style presentation to show beginners how to build a minimalistic functional lakehouse using Apache Spark, Project Nessie and Iceberg.
In this session, we will cover:
- Configuring the three different components
- Creating tables from raw data files
- Ingesting new data from various sources into the tables, querying it and making updates
- Time travel, compaction, etc. capabilities