The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Boudewijn Braams is a Software Engineer at Databricks in Amsterdam. He is part of the Storage & I/O team, one of the teams focussing on Databricks Runtime performance. In this team, he has worked on Parquet robustness, the Delta caching layer and cloud storage connectors. Prior to starting full-time, he did his Master's thesis at Databricks in the form of an internship. This research explored early filtering techniques like predicate pushdown in the context of Parquet and the Databricks Runtime. He holds a joint Master's degree in Computer Science from the University of Amsterdam and the Vrije Universiteit.