Session
Faster, Leaner, and Easier to Debug: PySpark UDFs in 2026
Overview
| Experience | In Person |
|---|---|
| Track | Data Engineering & Streaming |
| Industry | Enterprise Technology |
| Technologies | Lakeflow |
| Skill Level | Intermediate |
PySpark has long provided one of the most powerful and flexible ways to bring Python logic to large-scale distributed data processing. In this talk, we focus on two major advances in the PySpark UDF ecosystem: performance optimizations and improved debuggability. First, we introduce Arrow-based execution for Python, including Native Arrow UDFs and Arrow UDTFs, which operate directly on columnar Arrow data without Pandas conversion overhead. This design reduces memory usage, improves support for complex types, and delivers faster execution compared to earlier Python UDF approaches. Second, we present new debuggability enhancements for Python running inside Spark tasks, including built-in faulthandler integration for clearer crash diagnostics and improved profiling support to better understand CPU and memory behavior of UDFs. Together, these improvements make PySpark UDFs not only faster and more scalable, but also significantly easier to diagnose and optimize in production environments.
Session Speakers
Tian Gao
/Senior Software Engineer
Databricks
Yicong Huang
/Software Engineer