Hyukjin Kwon

Software Engineer, Databricks

Hyukjin is a Databricks software engineer, Apache Spark PMC member and committer, working on many different areas in Apache Spark such as PySpark, Spark SQL, SparkR, etc. He is also one of the top contributors in Koalas. He mainly focuses on development, helping discussions, and reviewing many features and changes in Apache Spark and Koalas.

Past sessions

Summit 2021 Project Zen: Making Data Science Easier in PySpark

May 27, 2021 03:15 PM PT

The number of PySpark users has increased dramatically, and Python has become one of the most commonly used languages in data science. In order to cater to the increasing number of Python users and improve Python usability in Apache Spark, Apache Spark initiated Project Zen named after "The Zen of Python" which defines the principles of Python.

Project Zen started with newly redesigned pandas UDFs and function APIs with Python type hints in Apache Spark 3.0. The Spark community has since then, introduced numerous improvements as part of Project Zen in Apache Spark 3.1 and the upcoming apache Spark 3.2 that includes:

  • Python type hints
  • New documentation
  • Conda, venv and PEX
  • numpydoc docstring
  • pandas APIs on Spark
  • Visualization

In this talk, we will present the improvements and features in Project Zen with demonstration to show how Project Zen makes data science easier with the improved usability.

In this session watch:
Hyukjin Kwon, Software Engineer, Databricks
Haejoon Lee, Software Engineer, Databricks


Summit Europe 2020 Project Zen: Improving Apache Spark for Python Users

November 18, 2020 04:00 PM PT

As Apache Spark grows, the number of PySpark users has grown rapidly, the number of PySpark users has almost jumped up three times for the last year. The Python programming language itself became one of the most commonly used languages in data science.

With this momentum, the Spark community started to focus more on Python and PySpark, and in an initiative we named Project Zen, named after The Zen of Python that defines the principles of Python itself.

In Apache Spark 3.0, the redesigned pandas UDFs and improved error message in UDF were introduced as part of this effort. In the upcoming Apache Spark 3.1, there are also many notable improvements as part of Project Zen to make PySpark more Pythonic and user-friendly.

In this talk, it will introduce the improvements, features and the roadmap in Project Zen that include:

  • Redesigning PySpark documentation
  • PySpark type hints
  • JDK, Hive and Hadoop distribution option for PyPI users
  • Standardized warnings and exceptions
  • Visualization

Speaker: Hyukjin Kwon

Summit 2020 Pandas UDF and Python Type Hint in Apache Spark 3.0

June 23, 2020 05:00 PM PT

In the past several years, the pandas UDFs are perhaps the most important changes to Apache Spark for Python data science. However, these functionalities have evolved organically, leading to some inconsistencies and confusions among users. In Apache Spark 3.0, the pandas UDFs were redesigned by leveraging type hints. By using Python type hints, you can naturally express pandas UDFs without requiring such as the evaluation type. Also, pandas UDFs are now more ‘Pythonic’ and let themselves define what the UDF is supposed to input and output with the clear definition. Moreover, it allows many benefits such as easier static analysis. In this talk, I will introduce the redesigned pandas UDFs with type hints in Apache Spark 3.0 with a technical overview.

Summit Europe 2019 Vectorized R Execution in Apache Spark

October 15, 2019 05:00 PM PT

Apache Spark already has a vectorization optimization in many operations, for instance, internal columnar format, Parquet/ORC vectorized read, Pandas UDFs, etc. Vectorization improves performance greatly in general. In this talk, the performance aspect of SparkR will be discussed and vectorization in SparkR will be introduced with technical details. SparkR vectorization allows users to use the existing codes as are but boost the performance around several thousand present faster when they execute R native functions or convert Spark DataFrame to/from R DataFrame.