Skip to main content

What is Tungsten?

How the Tungsten project optimizes Spark’s execution engine with smarter memory use, cache-aware algorithms and code generation to push performance closer to bare metal

by Databricks Staff

  • Tungsten is a project in Apache Spark that focuses on making Spark applications more efficient by improving how they use memory and CPU so performance gets closer to modern hardware limits.
  • The Tungsten project includes work on explicit memory management, cache aware computation, code generation and reducing virtual function calls to cut overhead in the execution engine.
  • By keeping more intermediate data in CPU registers and optimizing tight loops, Tungsten delivers significant speedups for Spark SQL and DataFrame workloads that are increasingly limited by CPU and memory rather than I/O.

What is the Tungsten Project?

Tungsten is the codename for the umbrella project to make changes to Apache Spark’s execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware.

REPORT

The agentic AI playbook for the enterprise

Tungsten Project Includes These Initiatives:

  • Memory Management and Binary Processing: leveraging application semantics to manage memory explicitly and eliminate the overhead of JVM object model and garbage collection
  • Cache-aware computation: algorithms and data structures to exploit memory hierarchy
  • Code generation: using code generation to exploit modern compilers and CPUs
  • No virtual function dispatches: this reduces multiple CPU calls which can have a profound impact on performance when dispatching billions of times.
  • Intermediate data in memory vs CPU registers: Tungsten Phase 2 places intermediate data into CPU registers. This is an order of magnitudes reduction in the number of cycles to obtain data from the CPU registers instead of from memory
  • Loop unrolling and SIMD: Optimize Apache Spark’s execution engine to take advantage of modern compilers and CPUs’ ability to efficiently compile and execute simple for loops (as opposed to complex function call graphs).

The focus on CPU efficiency is motivated by the fact that Spark workloads are increasingly bottlenecked by CPU and memory use rather than IO and network communication. The trend is shown by recent research on the performance of big data workloads.
 

Additional Resources

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.