Session

Faster queries in local laptop mode for Apache Spark

Overview

ExperienceIn Person
TrackData Warehousing
IndustryEnterprise Technology
TechnologiesDatabricks SQL
Skill LevelIntermediate
Apache Spark is optimized for distributed large-scale processing, but this architecture introduces significant overhead for small, local datasets -- often making sub-100 MB queries take several seconds. In this talk, we present a suite of optimizations that dramatically improve Spark's local-mode performance across three fronts: smarter query compilation and task scheduling that eliminates unnecessary shuffles, an Arrow-based df.cache implementation that cuts cache load and query times in half, and a novel shuffle-free execution mode using Java virtual threads for lightweight in-process data transfer. We will walk through the design, benchmarks, and engineering trade-offs of each approach, and show how together they position Apache Spark as a practical gateway engine from laptop-scale exploration to petabyte-scale production.

Session Speakers

Speaker placeholderIMAGE COMING SOON

Daniel Tenedorio

/Sr. Staff Software Engineer
Databricks