Making PySpark Amazing—From Faster UDFs to Dependency Management and Graphing!

PySpark is getting awesomer in Spark 2.3 with vectorized UDFs, and there is even more wonderful things on the horizon (and currently available as WIP packages). This talk will start by illustrating how to use PySpark’s new vectorized UDFs to make ML pipeline stages. Since most of us use Python in part because of its wonderful libraries, like pandas, numpy, and antigravity*, it’s important to be able to make sure that our dependencies are available on our cluster. Historically there’s been a few If there is time near the end we will talk about how to expose your Python code to Scala so everyone can use your fancy deep learning code (if you want them to). *Ok maybe not a real thing, but insert super specialized domain specific library you use instead 🙂

Session hashtag: #Py4SAIS

« back
Holden Karau
About Holden Karau

Holden is a transgender Canadian open source developer with a focus on Apache Spark, Airflow, Kubeflow, and related "big data" tools. She is the co-author of Learning Spark, High Performance Spark, and Kubeflow for Machine Learning. She is a committer and PMC on Apache Spark. She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal.

About Bryan Cutler

Bryan Cutler is a software engineer at IBM's Spark Technology Center, where he works on big data analytics and machine learning systems. He is a contributor to Apache Spark in the areas of ML, SQL, Core and Python and a committer for the Apache Arrow project. His interests are in pushing the boundaries of software to build high performance tools that are also a snap to use.