This talk outlines data lake design patterns that can yield massive performance gains for all downstream consumers. We will talk about how to optimize Parquet data lakes and the awesome additional features provided by Databricks Delta.
* Optimal file sizes in a data lake
* File compaction to fix the small file problem
* Why Spark hates globbing S3 files
* Partitioning data lakes with partitionBy
* Parquet predicate pushdown filtering
* Limitations of Parquet data lakes (files aren’t mutable!)
* Mutating Delta lakes
* Data skipping with Delta ZORDER indexes
Matt loves writing Spark open source code and is the author of the spark-style-guide, spark-daria, quinn, and spark-fast-tests. He's obsessed with eliminating UDFs from codebases, perfecting method signatures of the public interface, and writing readable tests that execute quickly. Matt spends most of his time in Colombia and Mexico and wants to move to Brazil and learn Portuguese soon. He loves dancing and small talk. In a past life, Matt worked as an economic consultant and passed all three Chartered Financial Analyst exams.