We are excited to release Unified Profiling for PySpark User-Defined Functions (UDFs) as part of Databricks Runtime 17.0 (release notes). Unified Profiling for PySpark UDFs lets developers profile the performance and memory usage of their PySpark UDFs, including tracking function calls, execution time, memory usage, and other metrics. This enables PySpark developers to easily identify and address bottlenecks, leading to faster and more resource-efficient UDFs.
The unified profilers can be enabled by setting the Runtime SQL configuration “spark.sql.pyspark.udf.profiler” to “perf” or “memory” to enable the performance or memory profiler, respectively, as shown below.
Legacy profiling [1, 2] was implemented at the SparkContext level and, thus, did not work with Spark Connect. The new profiling is SparkSession-based, applies to Spark Connect, and can be enabled or disabled at runtime. It maximizes API parity with legacy profiling by providing “show” and “dump” commands to visualize profile results and save them to a workspace folder. Additionally, it offers convenience APIs to help manage and reset profile results on demand. Lastly, it supports registered UDFs, which were not supported by the legacy profiling.
The PySpark performance profiler leverages Python's built-in profilers to extend profiling capabilities to the driver and UDFs executed on executors in a distributed manner.
Let's dive into an example to see the PySpark performance profiler in action. We run the following code on Databricks Runtime 17.0 notebooks.
The added.show() command displays performance profiling results as shown below.
The output includes information such as the number of function calls, total time spent in the given function, and the filename, along with the line number to aid navigation. This information is essential for identifying tight loops in your PySpark programs and enabling you to make decisions to improve performance.
It's important to note that the UDF id in these results directly correlates with the one found in the Spark plan, by observing the “ArrowEvalPython [add1(...)#50L]”, which is revealed when calling the explain method on the dataframe.
Finally, we can dump the profiling results to a folder and clear the result profiles as shown below.
It is based on memory-profiler, which can profile the driver, as seen here. PySpark has expanded its usage to include profiling UDFs, which are executed on executors in a distributed manner.
To enable memory profiling on a cluster, we should install the memory-profiler on the cluster as shown below.
The above example modifies the last two lines by:
Then we obtain memory profiling results as shown below.
The output includes several columns that give you a comprehensive view of how your code performs in terms of memory usage. "Mem usage" reveals the memory usage after executing that line. "Increment" details the change in memory usage from the previous line, helping you spot where memory usage spikes. "Occurrences" indicates how many times each line was executed.
The UDF id in these results also directly correlates with the one found in the Spark plan, the same as performance profiling results, by observing the “ArrowEvalPython [add1(...)#4L]”, which is revealed when calling the explain method on the dataframe as shown below.
Please note that for this functionality to work, the memory-profiler package must be installed on your cluster.
PySpark Unified Profiling, which includes performance and memory profiling for UDFs, is available in Databricks Runtime 17.0. Unified Profiling provides a streamlined method for observing important aspects such as function call frequency, execution durations, and memory consumption. It simplifies the process of pinpointing and resolving bottlenecks, paving the way for the development of faster and more resource-efficient UDFs.
Ready to explore more? Check out the PySpark API documentation for detailed guides and examples.