Session
Faster queries in local laptop mode for Apache Spark
Overview
| Experience | In Person |
|---|---|
| Track | Data Warehousing |
| Industry | Enterprise Technology |
| Technologies | Databricks SQL |
| Skill Level | Intermediate |
Apache Spark is optimized for distributed large-scale processing, but this architecture introduces significant overhead for small, local datasets -- often making sub-100 MB queries take several seconds. In this talk, we present a suite of optimizations that dramatically improve Spark's local-mode performance across three fronts: smarter query compilation and task scheduling that eliminates unnecessary shuffles, an Arrow-based df.cache implementation that cuts cache load and query times in half, and a novel shuffle-free execution mode using Java virtual threads for lightweight in-process data transfer. We will walk through the design, benchmarks, and engineering trade-offs of each approach, and show how together they position Apache Spark as a practical gateway engine from laptop-scale exploration to petabyte-scale production.
Session Speakers
Daniel Tenedorio
/Sr. Staff Software Engineer
Databricks