PySpark in Apache Spark 3.3 and Beyond
- Data Engineering
- Moscone South | Upper Mezzanine | 160
- 35 min
PySpark has rapidly evolved with the momentum of Project Zen introduced in Apache Spark 3.0. We improved error messages, added type hints for autocompletion, implemented visualization, etc. Most importantly, Pandas API on Spark was introduced from Apache Spark 3.2 which exposes the pandas API that runs on Apache Spark, and the Pandas API on Spark has gained a lot of popularity.
In Apache Spark 3.3, the effort of Project Zen continued and PySpark has many cool changes such as more API coverage & faster default index in Pandas API on Spark, datetime.timedelta support, new PyArrow batch interface, better autocompletion, Python & Pandas UDF profiler and new error classification.
In this talk, we will introduce what is new in PySpark at Apache Spark 3.3, and what is next beyond Apache Spark 3.3 with the current effort and roadmap in PySpark.