An open-source columnar storage file format that delivers exceptional compression and query performance for big data analytics workloads
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file formats available in Hadoop, namely RCFile and ORC.
Apache Parquet is implemented using the record-shredding and assembly algorithm, which accommodates the complex data structures that can be used to store the data. Parquet is optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types. This approach is best especially for those queries that need to read certain columns from a large table. Parquet can only read the needed columns therefore greatly minimizing the IO.
CSV is a simple and common format that is used by many tools such as Excel, Google Sheets, and numerous others. Even though the CSV files are the default format for data processing pipelines it has some disadvantages:
Parquet has helped its users reduce storage requirements by at least one-third on large datasets, in addition, it greatly improved scan and deserialization time, hence the overall costs. The following table compares the savings as well as the speedup obtained by converting data into Parquet from CSV.
| Dataset | Size on Amazon S3 | Query Run Time | Data Scanned | Cost |
| Data stored as CSV files | 1 TB | 236 seconds | 1.15 TB | $5.75 |
| Data stored in Apache Parquet Format | 130 GB | 6.78 seconds | 2.51 GB | $0.01 |
| Savings | 87% less when using Parquet | 34x faster | 99% less data scanned | 99.7% savings |
The open source Delta Lake project builds upon and extends the Parquet format, adding additional functionality like ACID transactions on cloud object storage, time travel, schema evolution, and simple DML commands (CREATE/UPDATE/INSERT/DELETE/MERGE). Delta Lake implements many of these important features through the use of an ordered transaction log that makes data warehousing functionality possible on cloud object storage. Learn more in the Databricks blog post Diving into Delta Lake: Unpacking the Transaction Log.
Subscribe to our blog and get the latest posts delivered to your inbox.