HomepageData + AI Summit 2022 Logo
Watch on demand

PySpark in Apache Spark 3.3 and Beyond

On Demand


  • Session


  • Hybrid


  • Data Engineering


  • Intermediate


  • Moscone South | Upper Mezzanine | 160


  • 35 min
Download session slides

Vue d'ensemble

PySpark has rapidly evolved with the momentum of Project Zen introduced in Apache Spark 3.0. We improved error messages, added type hints for autocompletion, implemented visualization, etc. Most importantly, Pandas API on Spark was introduced from Apache Spark 3.2 which exposes the pandas API that runs on Apache Spark, and the Pandas API on Spark has gained a lot of popularity.

In Apache Spark 3.3, the effort of Project Zen continued and PySpark has many cool changes such as more API coverage & faster default index in Pandas API on Spark, datetime.timedelta support, new PyArrow batch interface, better autocompletion, Python & Pandas UDF profiler and new error classification.

In this talk, we will introduce what is new in PySpark at Apache Spark 3.3, and what is next beyond Apache Spark 3.3 with the current effort and roadmap in PySpark.

Session Speakers

Hyukjin Kwon


Xinrong Meng

Ingénieur logiciel

Databricks, Inc.

Visionnez les temps forts du Data+AI Summit

Watch on demand