Haejoon is a software engineer at Databricks. His main interest is in Koalas and PySpark. He is one of the major contributors of the Koalas project.
May 27, 2021 03:15 PM PT
The number of PySpark users has increased dramatically, and Python has become one of the most commonly used languages in data science. In order to cater to the increasing number of Python users and improve Python usability in Apache Spark, Apache Spark initiated Project Zen named after "The Zen of Python" which defines the principles of Python.
Project Zen started with newly redesigned pandas UDFs and function APIs with Python type hints in Apache Spark 3.0. The Spark community has since then, introduced numerous improvements as part of Project Zen in Apache Spark 3.1 and the upcoming apache Spark 3.2 that includes:
In this talk, we will present the improvements and features in Project Zen with demonstration to show how Project Zen makes data science easier with the improved usability.
November 18, 2020 04:00 PM PT
Koalas is an open source project that provides pandas APIs on top of Apache Spark. pandas is a Python package commonly used among data scientists, but it does not scale out in a distributed manner. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark. Koalas is useful for not only pandas users but also PySpark users. For example, PySpark users can visualize their data directly from their PySpark DataFrame via the Koalas plotting APIs such as plotting. In addition, Koalas users can leverage PySpark specific APIs such as higher-order functions and a rich set of SQL APIs. In this talk, we will focus on the PySpark aspect and the interaction between PySpark and Koalas in order for PySpark users to leverage their knowledge of Apache Spark in Koalas.
Speakers: Takuya Ueshin and Haejoon Lee