Project Zen: Improving Apache Spark for Python Users

Download Slides

As Apache Spark grows, the number of PySpark users has grown rapidly, the number of PySpark users has almost jumped up three times for the last year. The Python programming language itself became one of the most commonly used languages in data science.

With this momentum, the Spark community started to focus more on Python and PySpark, and in an initiative we named Project Zen, named after The Zen of Python that defines the principles of Python itself.

In Apache Spark 3.0, the redesigned pandas UDFs and improved error message in UDF were introduced as part of this effort. In the upcoming Apache Spark 3.1, there are also many notable improvements as part of Project Zen to make PySpark more Pythonic and user-friendly.

In this talk, it will introduce the improvements, features and the roadmap in Project Zen that include:

  • Redesigning PySpark documentation
  • PySpark type hints
  • JDK, Hive and Hadoop distribution option for PyPI users
  • Standardized warnings and exceptions
  • Visualization

Speaker: Hyukjin Kwon


– Okay, hello, I’m Hyukjin Kwon and I’m software engineer at Databricks. So in this talk, I’m going to introduce Project Zen that I’m driving in Apache Spark. So the number of the Python users has grown a lot. And then this project basically aims to make PySpark more Pythonic in many ways. This is an ongoing effort. So it will continue even after Spark 3.1. So I hope you guys keep your eyes on it. So before we talk about Project Zen, just to give you a little bit about me, I’m a PMC member and committer in Apache Spark and one of the top contributors in Koalas, the Databricks, and I’m pretty active in GitHub and Apache Spark community. All right, so this is the agenda of this talk. I’m going to go through this. Firstly, I’ll briefly explain what is Project Zen and then why we initiate this project. And I’ll introduce several features and improvement made in Apache Spark 3.1, which are redesigned documentation, PySpark type hints, and distribution options for PyPI users. Lastly, I will show the roadmap and the future works in this project. Okay, so let me start with explaining what Project Zen is first. So just bit of a background, the number of Python users has grown rapidly in last few years. And in Databricks, we analyzed and made a simple statistics from the Databricks notebook users. And we noticed that almost 70% of Databricks notebook were actually in Python. And PyPI download statistics shows that the number of the PySpark users jumped up almost three times in the last few years. Python is becoming more popular and popular and then more widely used. But today’s PySpark is a little bit disappointing in some ways, for example, the documentation is a bit messy and difficult to navigate because each module is documented with a few of single pages. There are also not so much information about that the users can have, like how to set up an IDE or how to ship a package together with a PySpark application. The documentation should also have pages like Quickstart or installation guide or instruction and then the PySpark documentation currently doesn’t. Also PySpark itself is not very IDE or notebook friendly. For example, there are some functions that are dynamically defined that cannot be statically detected in IDE or notebook. In addition, auto-completion support does not work very well out of the box. So IDE shows like some unrelated suggestions and notebook doesn’t even know what to suggest. Another problem is that it’s just less Pythonic. Like for example, creating a DataFrame from dictionaries is pretty intuitive. And then that’s also supported in other libraries like pandas, however, PySpark shows the warning that is deprecated and it’s recommended to donors. This is actually related to the installation order in dictionary. So in all Python versions, the dictionary didn’t keep the order of the installation before. But now the new Python version support it by default. So actually we can just un-deprecate it. And then here’s another problem in PyPI distribution, so, Apache Spark is published with multiple options. For example, you can download Apache Spark that is built with Hive 1.2 or 2.3 or Hadoop 3 or without Hadoop. So when you download Apache Spark from official Apache Mirror, it gives you such many options. However, the PyPI users have only one single option. This is the default distribution with Hive 2.3, and Hadoop 2. If you just want to pip install PySpark to like run on an Hadoop 3 constant, there’s no way to do it. And lastly, the exception and warnings are not classified properly. PySpark just uses some like plain exception class in many places and then warning type is just playing user warning level. Because of this, users cannot properly handle and expect properly classified exceptions. And it confuses users about what actions they should take from the warnings. So the number of the PySpark users and its importance increased a lot, but PySpark did not catch up very well like as shown earlier. We had to do something to improve PySpark usability. So we named it as a Project Zen and then started working on it. The name Project Zen is actually came after the Zen of Python that defines a set of rules and then principles to make Python more Python. So in Project Zen, we focused on three things. Firstly, the PySpark has to be Pythonic, should follow the Zen of Python and should be Python-friendly. And second is to make it easier to use. It has to have a rich set of examples and documentation. Users should know what to expect from their output and input. It should have clear exception and the warning that provides better information. And then it should also have installation options. Lastly, it should work with other libraries out of the box. And then for instance, it should also have a kind of visualization and support like plot. All right, so we started to work on it and then so as part of Project Zen we redesigned PySpark documentation. So the old PySpark documentation has some problems. For example, that each module is documented within a single page. So it makes each page very, very, very, very long. And there are like few long pages that are difficult to navigate and there’s no kind of structure for that. So as I said earlier, there is virtually no other useful pages like how to contribute or how to debug, how to install and et cetera. So to address this problem, we redesigned the whole PySpark documentation like this. So it has some structure API documentation, many useful pages like links and live notebook where you can try some examples by yourself. So this is an example of one useful page. It has the search box and you can search any page you want. So the top menu has the root and then the other pages are shown on the left side, and then the main contents are of course located in the middle. And then the right side shows the titles within this current page. So here, I’m going to show the new API and reference page. So unlike the old documentation, it has some structures, groups like this. For example, there are some DataFrame APIs and there are functions, sessions and et cetera. When you click one of it, then it moves to another page that has fine classification. So this is the page you see after clicking the functions. It has a table, the list of relevant APIs. And here, if you click one of the APIs above, it moves to another page. Each page is dedicated to describe single API with sufficient examples and descriptions. Beside it also has a Quickstart page that goes through basic functionalities like viewing, selecting, encrypting and DataFrame and the Spark C4. One of the things that I want to emphasize is that now the documentation has live notebooks, so you can try the PySpark right away. There are some links to that in the main page, they’re in the main instruction page, and then the Quickstart page. So once you click the links, then it redirects you to the live notebook. So here you can run the Quickstart out of the box in Jupyter Notebooks. So you can modify the examples and then see what you get by running, you can also update the examples, yeah. Also the new PySpark documentation has lots and lots of other useful pages, like how to contribute to Spark, how to test, how to debug, how to set up an IDE, how to install, how to start and like how to ship a package together with an application. All these doc is available in Spark 3.1. So hope you guys have the time to read it once the Spark 3.1 is out. Okay, the next thing is the PySpark type hints. This is also of course a part of Project Zen. So Python type hints are a standard approach to annotate the types in Python. For example, it writes the type next to argument and then the written type. So by doing this, you can easily tell which type it expects and outputs. By the way, the Python typing standard, it also defines some rules about stub files, that’s also called pyi files. It basically has like the same syntax as Python type hints. And it allows you to omit the actual quotes. So, in this way, you can manage the type hints separately and optionally. So the Python type hints can effectively be useful in IDE and notebook, which developers and users use often in their daily job. Before the type hints, the auto-completion did not work out of the box in IDE. But now we can work for all APIs. So the developers don’t have to check the documentation and other quotes, when they call the APIs, they just can press tab. Notebook case is the same. So, previously Jupyter Notebook could not suggest anything in many cases but now we can suggest properly. Python type hints are not only for IDE but notebook supports. So it can also automatically document what’s expected as an input and output. And the right side is an example of speakers generated documentation as you can see. So it makes developers to don’t bother about input and output documentation. And then the users can expect a correct and consistent documentation. One of the most common mistake Python developers do might be just typos or using wrong types. Because of the doc typing often IDE or notebooks can’t detect the error statically. With Python type hints they can detect not only the typos, but also detect if the right type is provided or not. So developers and users can improve their productivity hugely without executing one by one for like every line different. So in the coming Spark 3.1, the PySpark now has the built-in Python type hint support as more. The PySpark type hints were purely made from the Community. And then it has grown as a separate project for years. And finally, it came to PySpark as a built-in support. It has the type hints only for user facing interfaces. And then they’re being managed separately as stub pyi files in this project. Okay, so now it’s about the pip installation. PySpark is already available in PyPI as its official distribution. So users can easily install PySpark by simply pip install PySpark. The problem is that there is only one single distribution of the element PyPI. So when you manually download Apache Spark from the site in Apache Mirror, there are many options. For example, the Spark build with Hive 1.2, Hive 2.3, and Hadoop 2 or Hadoop 3. For PyPI users they’re are only forced to use the one default distribution. In the upcoming Apache Spark 3.1, now PySpark provides an experimental way for pip installation with other installation options. You can set the corresponding environment variable to indicate the Hadoop version you want, and then optionally, you can also specify the mirror to download. Enabling this option, download the PySpark from PyPI first, and then start to download the corresponding distribution from the Apache Mirror. It can take a while, so it’s recommended to use with the debugging option in pip. Some people might ask why this didn’t use pip native options such as like installation options. The problem is that installation options go through old dependencies. So for example, if Py4J does not handle the specified options, then it fails to install PySpark in the middle. So this is currently work in progress in PyPI. So once they provide a way to do it, then the environment variable will be deprecating PySpark, and they will be switched to this pip native options. Okay, so, so far I talked about what I’ve already done in the Spark, the development approach. So this will be available in the Spark 3.1. And then there is also roadmap here because the Project Zen as I said, is not only Spark 3.1, it’s an ongoing effort. Okay, so one of the typical docstrings style in Python is the reST style. So it provides a way of annotating the parameters and written type in docstring properly. However, its disadvantages is that the documentation becomes arguably less human readable in particular when the docstring goes long. It’s good for like short docstring arguably, but it becomes a bit messy in long docstrings. So, NumPydoc style provides a good human readable text format and then it can also work with the sphinx out of the box. And again, arguably, this provides a better classification and the readability and this fits best for detailed and the long docstrings in long run, yeah. PySpark documentation will change it to use NumPydoc style with a rich set of the examples in these scenarios in the docstring. Another thing that we should take a look is the, as I said, warnings and exceptions. So in many places, exception and warnings are like very plain and they’re naive, the handle is not good enough. For example, the DataFrame.explain explain API above, just through just plain exception class. So it needs to have a better classification. And also we used just a default usual warning level and many deprecate warnings were like other warning cases. Python provides lots of warning levels, like such as say, future warning or deprecation warning, so we should replace them to use it. And then we should also have the good exception, the warning level hierarchy. Lastly, we should more take care of the interaction with other libraries like NumPy, like Koalas, pandas. One example is that the NumPy universal functions, they provide an API to override their behavior on NumPy function calls. For example, you can call a NumPy function with a PySpark DataFrame or column instances. And this is on the roadmap. Also, we’re thinking about adding the visualization support like plot so users can easily understand their data like visually. Okay, so to recap, so I’m going to just talk about what I’ve talked today briefly. So Python and the PySpark are becoming more and more popular and important. So the documentation is redesigned and that the auto-completion and that the type-checking supports are edited in IDE and notebooks. From PyPI, you can now install PySpark with other installation options like Spark with the built-in Hadoop 3. And then for the future plans we’ll likely focus on migrating the current PySpark docstyle to NumPydoc style which set up examples. Also make exceptions and warnings more classified, and more Pythonic. Then lastly, we’ll try to make PySpark more friendly to other libraries with a visualization support, and then the NumPy universal function support. Okay, so that’s it for today. So if you have any questions, let me know, thank you guys.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Hyukjin Kwon


Hyukjin is a Databricks software engineer, Apache Spark PMC member and committer, working on many different areas in Apache Spark such as PySpark, Spark SQL, SparkR, etc. He is also one of the top contributors in Koalas. He mainly focuses on development, helping discussions, and reviewing many features and changes in Apache Spark and Koalas.