Back in the old days of Apache Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, likewise did the constant improvement of the optimisers (Catalyst and Tungsten). But, after Spark 2.3, PySpark has sped up tremendously thanks to the addition of the Arrow serialisers. In this talk you will learn how the Spark Scala core communicates with the Python processes, how data is exchanged across both sub-systems and the development efforts present and underway to make it as fast as possible.
Ruben Berenguel is the lead data engineer at Hybrid Theory, as well as an occasional contributor for Spark (especially PySpark). PhD in Mathematics, he moved to data engineering where he works mostly with Scala, Python and Go designing and implementing big data pipelines in London and Barcelona.