R and Spark: How to Analyze Data Using RStudio’s Sparklyr and H2O’s Rsparkling Packages

Download Slides

Sparklyr is an R package that lets you analyze data in Spark while using familiar tools in R. Sparklyr supports a complete backend for dplyr, a popular tool for working with data frame objects both in memory and out of memory. You can use dplyr to translate R code into Spark SQL. Sparklyr also supports MLlib so you can run classifiers, regressions, clustering, decision trees, and many more machine learning algorithms on your distributed data in Spark. With sparklyr you can analyze large amounts of data that would not traditionally fit into R memory. Then you can collect results from Spark into R for further visualization and documentation. Sparklyr is also extensible. You can create R packages that depend on sparklyr to call the full Spark API. One example of an extension is H2O’s rsparkling, an R package that works with H2O’s machine learning algorithm. With sparklyr and rsparkling you have access to all the tools in H2O for analysis with R and Spark.

In this presentation I will demonstrate how to analyze data in Spark by using sparklyr and rsparkling.

Learn more:

  • Using sparklyr in Databricks
  • 10 Things I Wish I Knew Before Using Apache SparkR
  • Accelerating R Workflows on Databricks

    « back
  • About Nathan Stephens

    Nathan Stephens is the director of solutions engineering for RStudio. His background is in applied analytics and consulting. He has experience building data science teams, creating innovative data products, analyzing big data, and architecting analytic platforms. He was an early adopter of R and has introduced it into many organizations.