HomepageData + AI Summit 2022 Logo
Watch on demand

PySpark in Apache Spark 3.3 and Beyond

On Demand

Type

  • Session

Format

  • Hybrid

Track

  • Data Engineering

Difficulty

  • Intermediate

Room

  • Moscone South | Upper Mezzanine | 160

Duration

  • 35 min
Download session slides

Überblick

PySpark has rapidly evolved with the momentum of Project Zen introduced in Apache Spark 3.0. We improved error messages, added type hints for autocompletion, implemented visualization, etc. Most importantly, Pandas API on Spark was introduced from Apache Spark 3.2 which exposes the pandas API that runs on Apache Spark, and the Pandas API on Spark has gained a lot of popularity.

In Apache Spark 3.3, the effort of Project Zen continued and PySpark has many cool changes such as more API coverage & faster default index in Pandas API on Spark, datetime.timedelta support, new PyArrow batch interface, better autocompletion, Python & Pandas UDF profiler and new error classification.

In this talk, we will introduce what is new in PySpark at Apache Spark 3.3, and what is next beyond Apache Spark 3.3 with the current effort and roadmap in PySpark.

Session Speakers

Hyukjin Kwon

Databricks

Xinrong Meng

Software Engineer

Databricks, Inc.

Das Beste des Data+AI Summits anzeigen

Watch on demand