Skip to main content

What is Tungsten?

How the Tungsten project optimizes Spark’s execution engine with smarter memory use, cache-aware algorithms and code generation to push performance closer to bare metal

10 Personas Artificial Intelligence

Summary

  • See how Tungsten focuses on improving memory and CPU efficiency in Spark’s execution engine to get closer to modern hardware limits for Spark applications.
  • Learn about Tungsten’s key initiatives, including explicit memory management, cache-aware computation, code generation and reducing virtual function dispatches.
  • Understand how techniques like keeping intermediate data in CPU registers, loop unrolling and SIMD support deliver major speedups for Spark SQL and DataFrame workloads.

What is the Tungsten Project?

Tungsten is the codename for the umbrella project to make changes to Apache Spark’s execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware.

A 5X LEADER

Gartner®: Databricks Cloud Database Leader

Tungsten Project Includes These Initiatives:

  • Memory Management and Binary Processing: leveraging application semantics to manage memory explicitly and eliminate the overhead of JVM object model and garbage collection
  • Cache-aware computation: algorithms and data structures to exploit memory hierarchy
  • Code generation: using code generation to exploit modern compilers and CPUs
  • No virtual function dispatches: this reduces multiple CPU calls which can have a profound impact on performance when dispatching billions of times.
  • Intermediate data in memory vs CPU registers: Tungsten Phase 2 places intermediate data into CPU registers. This is an order of magnitudes reduction in the number of cycles to obtain data from the CPU registers instead of from memory
  • Loop unrolling and SIMD: Optimize Apache Spark’s execution engine to take advantage of modern compilers and CPUs’ ability to efficiently compile and execute simple for loops (as opposed to complex function call graphs).

The focus on CPU efficiency is motivated by the fact that Spark workloads are increasingly bottlenecked by CPU and memory use rather than IO and network communication. The trend is shown by recent research on the performance of big data workloads.
 

Additional Resources

Never miss a Databricks post

Subscribe to our blog and get the latest posts delivered to your inbox