Session

Faster, Leaner, and Easier to Debug: PySpark UDFs in 2026

Overview

ExperienceIn Person
TrackData Engineering & Streaming
IndustryEnterprise Technology
TechnologiesLakeflow
Skill LevelIntermediate
PySpark has long provided one of the most powerful and flexible ways to bring Python logic to large-scale distributed data processing. In this talk, we focus on two major advances in the PySpark UDF ecosystem: performance optimizations and improved debuggability. First, we introduce Arrow-based execution for Python, including Native Arrow UDFs and Arrow UDTFs, which operate directly on columnar Arrow data without Pandas conversion overhead. This design reduces memory usage, improves support for complex types, and delivers faster execution compared to earlier Python UDF approaches. Second, we present new debuggability enhancements for Python running inside Spark tasks, including built-in faulthandler integration for clearer crash diagnostics and improved profiling support to better understand CPU and memory behavior of UDFs. Together, these improvements make PySpark UDFs not only faster and more scalable, but also significantly easier to diagnose and optimize in production environments.

Session Speakers

Tian Gao

/Senior Software Engineer
Databricks

Speaker placeholderIMAGE COMING SOON

Yicong Huang

/Software Engineer