Koalas is an open-source project that aims at bridging the gap between big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with pandas library in Python. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. With Koalas, data scientist can use the same APIs as pandas’ but at scale with PySpark. In this talk, I introduce Koalas and its updates, and also show some comparisons between pandas and Koalas, then deep-dive into its internal structures and how it works with Spark.
– Hi everyone. Let me start my talk. My talk is Koalas, making an easy transition from Pandas to Apache Spark. I’m Takuya Ueshin, a software engineer at Databricks. I am an Apache Spark committer and a PMC member. My focus is on Spark SQL and PySpark.
Now, I mainly working on Koalas project and one of the major contributors in maintenance.
In this talk, I’ll introduce Koalas and its framework comparing to Pandas and PySpark. Next, I’ll introduce Koalas 1.0.
I’ll show some demos ranging from basic to Koalas 1.0 related topics. Then, I will explain our internal structure and a couple of important topics. We’ll chat about index and default index. Finally, let’s see the roadmap of the Koalas 1.0. What is Koalas? Koalas was announced on April 24th at last year’s Spark AI Summit. It is a pure Python library and aims at providing pandas API on top of Apache Spark.
It’s trying to unify the two ecosystems with a familiar API and to help make a seamless transition between small and large data.
For pandas users, it will help scale out the pandas code and make learning PySpark much easier. And for PySpark users, it will provide more productive by its pandas-like functions.
Let’s review what pandas is. Pandas was authored by Wes McKinney in 2008.
Pandas is the standard tool for data manipulation and analysis in Python. The latest version is 1.0.4.
It’s still growing as you can see in the Stack Overflow trends.
Pandas is deeply integrated into Python data science ecosystem, such as numpy, matplotlib, scikit-learn and so on.
It can deal with a lot of different situations, including basic statistical analysis, handling missing data, time series, categorical variables, strings, and so on.
Python data scientists tend to start learning how to analyze data with pandas. And to use it over most of it.
Pandas works really well for small data sets in a single norm, but unfortunately, it doesn’t work well with bigger data sets. The data scientists have to start PySpark or other libraries if they want to handle bigger data sets.
From the other end, what is Spark? As you know, Spark is a defacto standard unified analytics engine for large-scale data processing.
It supports streaming, ETL, machine learning, and so on.
Spark was originally created at UC Berkeley by Databricks’ founders.
It provides PySpark API for Python as well as API for Scala, R and SQL.
The latest version is 3.0.
While pandas and PySpark have concepts or DataFrame, they look sometimes similar, but are actually significantly different.
Only referring to column, it’s the same in this table, but the others are different.
You can see a lot of differences, but the two most significant differences are its mutability and execution model.
Pandas DataFrame can change its state easily, whereas PySpark DataFrame doesn’t allow to change the state, and we need to create a new DataFrame instead.
Execution model is also different. Pandas computes its data eagerly whereas Spark lazily, meaning it doesn’t execute any computations until it actually needs. For example, storing or sorting the data.
Let’s see a short example. The left side is pandas and the right side is PySpark.
Both are doing the same thing, but their appearance is different. Pandas uses half the length of PySpark to fill the gap between pandas’ DataFrame and PySpark DataFrame if they want to handle bigger data sets.
If there is Koalas, we can write almost the same as pandas. Only things we have to do are to change the import and use the alias instead of pandas’ one. It’s so cool, isn’t it?
The execution is lazy as the same as Spark DataFrame.
Basically, it won’t run some jobs until the actual data is needed.
And the optimization of the Spark DataFrame will be used during the execution.
Thankfully, Koalas has been growing more and more. According to API statistics, the download number reaches 30,000 per day and 800,000 last month. And data Sparks reached 2000.
Now, we are hitting a bit by storm. You know what? It’s Koalas 1.0.
It supports Spark 3.0 and optimize some functions using Spark 3.0 functions.
Dataframe apply or DataFrame supply batch functions will be optimized with Spark and pandas functions which improves the performance 20 to 25% faster.
Also, it supports pandas 1.0.
Actually, we started supporting this since Koalas 0.28.0, but basically, Koalas follows pandas 1.0 behavior from now on.
We removed deprecated functions which are removed in pandas 1.0 or a couple of pandas functions. On the other hand, we introduced Spark properly to create Spark-specific functions in it.
The Koalas API coverage is increasing overtime.
Series and DataFrame APIs are more than 70% and Index and GroupBy functions are 60% range. Also, 80% of plotting functions are already implemented. I just show you simple demos here. At first, of course, we should import Koalas.
The package name is databricks.koalas and usually, we use ks as an alias name.
This short examples are on the slides. We can just confirm that the example it actually works and shows exactly the same result.
The next example is how to convert from or to pandas DataFrame.
To convert from pandas DataFrame, we can use from pandas functions. It will use data to spark DataFrame and manages some meta data to behave on pandas DataFrame.
To convert back to pandas DataFrame, we can use two pandas functions.
Next, we also have to know how to convert from or to Spark DataFrame.
When importing Koalas package, it automatically attaches to Koalas functions onto Spark DataFrame so we can use it.
A column manages index columns. We can see an additional column here.
To back to Spark DataFrame, we can use to spark function, similar to pandas.
If we have columns to be used as index, we can use index column parameter.
Then Koalas uses the columns as index. If the columns are more than one, we can use list.
When going back to Spark DataFrame, if we want to preserve the index values, we can also use index column parameter. We can also use list, but the lengths must be the same as the number of index columns.
Let’s see more simple examples.
Here, it shows me functions.
This is for DataFrame and this is for series. We have more and more statistics functions from pandas. If we want a summary, we can call describe function.
Of course, we have GroupeBy functions. This example shows GroupBy sum function, but we also have more and more GroupBy functions.
Most face and DataFrame have plot functions.
The default is light plot.
We already implemented 80% of plot functions.
More and more examples are available in the blog post: 10 Minutes from pandas to Koalas on Apache Spark Please take a look at the blog post.
The next shows transform and apply functions.
From pandas, Koalas has transform and apply functions to apply a function against Koalas DataFrame. Now the difference between transform and apply is that transform require to return the same lengths as the input whereas apply doesn’t require it.
Each function takes a pandas series.
Koalas has also transform_batch and apply_batch functions.
The batch post-fix means each chunk in Koalas DataFrame or series. This slices the Koalas DataFrame or series and applies the given functions. For DataFrame, each function takes a pandas DataFrame and for series, it takes a pandas series. Here, transform_batch has the length instruction whereas apply_batch does not, as same as transform and apply.
The next section is for Spark users.
Since Koalas 1.0, it provides spark property.
The first example is to know the underlying spark schema and data type.
Schema functions returns a specific type object of Spark and print schema prints out its schema as a three streak. Both take index column parameter to include the index columns.
For series and index, data type function gives us the data type.
The next one is DataFrame.spark.apply.
It runs the given UTF which takes Spark DataFrame.
We can modify, underline Spark DataFrame as we want.
If we want to preserve index columns, it also takes index column parameter.
We can modify the index values but the returning Spark DataFrame also has to include columns with the same names.
Similarly, for series, it provides transform function. The UTF takes Spark column object and we can modify it with Spark functions.
The final example is explain. We can check the underlined Spark planning.
Well, let’s go back to the slides. Here, let me explain a bit of Koalas internal.
Koalas based structure is like this. Koalas DataFrame has internal frame in it. Internal frame is immutable structure and it holds the current Spark DataFrame and some meta data, such as column_labels, and index_map to provide a fit mimic of pandas DataFrame.
If I use a call some API, Koalas DataFrame creates a new Spark DataFrame and updates the meta data in internal frame. Creates or copies the current internal frame, it’s the new state. Then, we done a new Koalas statement.
Sometimes, the new Spark DataFrame is not needed, but only the updates of meta data. Then new structure will be like this.
In any case, Koalas DataFrame never mutates internal frame, but creates or copies the internal frame to keep it mutable.
Next topics, index and default index. As you saw in the demo, Koalas manages a group of columns as indexes and they should behave the same as pandas.
When you have pandas DataFrame, and compare it to Koalas, by from pandas functions, then Koalas automatically fix up and pierce its meta data.
But other data sources, like Spark DataFrame, don’t always have an index.
But if the index is not specified, Koalas will automatically attach default index to the DataFrame. Koalas provides three types of default index.
Each default index has pros and cons. We need to decide which index we should use carefully.
The three index types are sequence, distributed-sequence and distributed. The sequence is the default value. This guarantees the index increments continuously.
But it uses no partition window function in it. It means all the data needs to be collected in a single node. If the node doesn’t have enough memory, part of memory or error will occur. The second option is distributed-sequence. Once the index is attached, the performance guarantee is not as significant as sequence type.
It computes and guarantees, it computes and generates the index in the distributed model, but it need another extra Spark job to generate the global sequence in it.
Also notice that, in some case, it doesn’t guarantee continuously increasing numbers. The third option is distributed. It almost has no performance issues and always creates different numbers. But the numbers happen in no continuous increment. It means this index type shall never be used as indexes for operation on different DataFrames.
But if the index is just needed to order rows, then this index type would be the best choice. List, also see the user guide for more details.
Finally, I show the roadmap of Koalas 1.0.
In July or August of this year, Datbricks runtime 7.1 will be released which will pre-install Koalas 1.x, the stable release of that time.
We continue improving the pandas API coverage and support more visualization for machine learning libraries. Also, we have more examples in the documentation or workarounds for APIs we don’t support.
Well, please get started soon. You can install Koalas by pip or conda. If you don’t have a subclass not yet, you can also install PySpark with pip or conda for your local computers.
Also, please take a look at the documents on future updates on GitHub. There is 10 minutes tutorial in a Live Jupyter notebook available. You can also easily import it into Databricks notebook.
There is a very useful blog post I mentioned during the demo. It shows a lot of examples, as well as best practices.
If you have any suggestions, questions, requests, please read the file on issues on the GitHub issues page. Your contributions are always welcome. Please take a look at the contribution guide.
We will have another Koalas session from 10 AM on Friday which will show more examples.
Takuya Ueshin is a software engineer at Databricks, and an Apache Spark committer and a PMC member. His main interests are in Spark SQL internal, a.k.a. Catalyst, and also PySpark. He is one of the major contributors of the Koalas project.