SESSION

Pandas on Spark: Simplicity of Pandas with Efficiency of Spark

Accept Cookies to Play Video

OVERVIEW

EXPERIENCEIn Person
TYPEBreakout
TRACKData Engineering and Streaming
INDUSTRYEnterprise Technology, Health and Life Sciences, Financial Services
TECHNOLOGIESApache Spark
SKILL LEVELBeginner
DURATION40 min
DOWNLOAD SESSION SLIDES
With Python as the go-to language for data science, pandas has gained immense popularity in the data science community, as it is simple to learn and use, while powerful, expressive, and flexible. As data volumes grow, a key drawback of pandas is its inability to scale with increasing data volumes since it processes everything on a single machine. Pandas API on Spark addresses this issue, empowering users to handle vast datasets by leveraging the power of Apache Spark under the hood for scalable, distributed data processing while just using the pandas API. In addition, Pandas on Spark enhances pandas by offering access to SQL and machine learning utilities, enabling scalable data processing and analysis.In this talk, we will give an overview of Pandas on Spark: how to get started and also how to use it with your existing pandas code to scale your existing data science workloads using Pandas on Spark.

SESSION SPEAKERS

Matthew Powers

/Staff Developer Advocate
Databricks

Xinrong Meng

/Senior Software Engineer
Databricks