Best Practices for Running PySpark

Download Slides

PySpark (component of Spark allows users to write their code Python) has grabbed the attention of Python programmers who analyze and process data for a living. The appeal is obvious- you don’t need to learn a new language, and you still have access to modules (i.e., pandas, nltk, statsmodels, etc.) that you are familiar with, but you are able to run complex computations quickly and at scale using the power of Spark. The drawbacks of using Python in a distributed environment only become apparent when you try to deploy your application and run an analysis against real-world data. The reality of using PySpark is that: * Managing dependencies and their installation on a cluster is crucial. * Duck typing in Python can let bugs in your code slip by, only to be discovered when you run it against a large and inevitably messy data set. * You must understand the underlying Spark computational model- particularly where and when various blocks of code get executed- in order to write applications that will work correctly when distributed across a cluster. In this talk, we will examine a real PySpark job that runs a statistical analysis of time series data to motivate the issues described above and provides a concrete example of best practices for real world PySpark applications. We will cover: * Python package management on a cluster using virtualenv. * Testing PySpark applications. * Spark’s computational model and its relationship to how you structure your code.

Additional Reading:

  • Developing Custom Machine Learning Algorithms in PySpark

    « back