Data Wrangling with PySpark for Data Scientists Who Know Pandas

Download Slides

Data scientists spend more time wrangling data than making models. Traditional tools like Pandas provide a very powerful data manipulation toolset. Transitioning to big data tools like PySpark allows one to work with much larger datasets, but can come at the cost of productivity.

In this session, learn about data wrangling in PySpark from the perspective of an experienced Pandas user. Topics will include best practices, common pitfalls, performance consideration and debugging.

Session hashtag: #SFds12

Learn more:

  • Introducing Pandas UDF for PySpark
  • From Pandas to Apache Spark’s DataFrame
  • Getting The Best Performance With PySpark

    « back