Skip to main content

PySpark UDF Unified Profiling

Optimizing PySpark UDFs with Unified Profiling: Enhanced Performance and Memory Insights

PySpark UDF Unified Profiling

Published: June 9, 2025

Open Source3 min read

Summary

  • Introducing PySpark UDF Unified Profiling – Learn how performance and memory profiling for UDFs in Databricks Runtime 17.0 helps optimize execution and resource usage.
  • Enhancing Performance and Debugging – Explore how to track function calls, execution time, and memory consumption to identify bottlenecks and improve efficiency.
  • Replacing Legacy Profiling with a Unified Approach – Understand the benefits of the new SparkSession-based profiling, its compatibility with Spark Connect, and how to enable, visualize, and manage profiling results.

We are excited to release Unified Profiling for PySpark User-Defined Functions (UDFs) as part of Databricks Runtime 17.0 (release notes). Unified Profiling for PySpark UDFs lets developers profile the performance and memory usage of their PySpark UDFs, including tracking function calls, execution time, memory usage, and other metrics. This enables PySpark developers to easily identify and address bottlenecks, leading to faster and more resource-efficient UDFs.

The unified profilers can be enabled by setting the Runtime SQL configuration “spark.sql.pyspark.udf.profiler” to “perf” or “memory” to enable the performance or memory profiler, respectively, as shown below.

Replacement for Legacy Profiling

Legacy profiling [1, 2] was implemented at the SparkContext level and, thus, did not work with Spark Connect. The new profiling is SparkSession-based, applies to Spark Connect, and can be enabled or disabled at runtime. It maximizes API parity with legacy profiling by providing “show” and “dump” commands to visualize profile results and save them to a workspace folder. Additionally, it offers convenience APIs to help manage and reset profile results on demand. Lastly, it supports registered UDFs, which were not supported by the legacy profiling.

PySpark Performance Profiler

The PySpark performance profiler leverages Python's built-in profilers to extend profiling capabilities to the driver and UDFs executed on executors in a distributed manner.

Let's dive into an example to see the PySpark performance profiler in action. We run the following code on Databricks Runtime 17.0 notebooks.

The added.show() command displays performance profiling results as shown below.

The output includes information such as the number of function calls, total time spent in the given function, and the filename, along with the line number to aid navigation. This information is essential for identifying tight loops in your PySpark programs and enabling you to make decisions to improve performance.

It's important to note that the UDF id in these results directly correlates with the one found in the Spark plan, by observing the “ArrowEvalPython [add1(...)#50L]”, which is revealed when calling the explain method on the dataframe.

Finally, we can dump the profiling results to a folder and clear the result profiles as shown below.

PySpark Memory Profiler

It is based on memory-profiler, which can profile the driver, as seen here. PySpark has expanded its usage to include profiling UDFs, which are executed on executors in a distributed manner.

To enable memory profiling on a cluster, we should install the memory-profiler on the cluster as shown below.

The above example modifies the last two lines by:

Then we obtain memory profiling results as shown below.

The output includes several columns that give you a comprehensive view of how your code performs in terms of memory usage. "Mem usage" reveals the memory usage after executing that line. "Increment" details the change in memory usage from the previous line, helping you spot where memory usage spikes. "Occurrences" indicates how many times each line was executed.

The UDF id in these results also directly correlates with the one found in the Spark plan, the same as performance profiling results, by observing the “ArrowEvalPython [add1(...)#4L]”, which is revealed when calling the explain method on the dataframe as shown below.

Please note that for this functionality to work, the memory-profiler package must be installed on your cluster.

Conclusion

PySpark Unified Profiling, which includes performance and memory profiling for UDFs, is available in Databricks Runtime 17.0. Unified Profiling provides a streamlined method for observing important aspects such as function call frequency, execution durations, and memory consumption. It simplifies the process of pinpointing and resolving bottlenecks, paving the way for the development of faster and more resource-efficient UDFs.

Ready to explore more? Check out the PySpark API documentation for detailed guides and examples.

Never miss a Databricks post

Subscribe to the categories you care about and get the latest posts delivered to your inbox