Apache Spark’s Built-in File Sources in Depth

Download Slides

In Spark 3.0 releases, all the built-in file source connectors [including Parquet, ORC, JSON, Avro, CSV, Text] are re-implemented using the new data source API V2. We will give a technical overview of how Spark reads and writes these file formats based on the user-specified data layouts. The talk will also explain the differences between Hive Serde and native connectors, and share the experiences of how to tune the connectors and choose the best data layouts for achieving the best performance.


Try Databricks
See More Spark + AI Summit Europe 2019 Videos

« back
About Gengliang Wang


Gengliang Wang is a software engineer in Databricks. He is an active Spark contributor and his main interest is on Spark SQL. Previously, he worked on building backend web services in Linkedin and Hulu.