Skip to main content

What is Sparklyr?

An R package that provides dplyr-style syntax for Apache Spark, allowing R users to perform distributed data manipulation and machine learning on massive datasets

4 Personas Analytics AIBI 5a

Summary

  • Provides familiar tidyverse dplyr syntax (select, filter, mutate, group_by) that translates seamlessly to distributed Spark operations on datasets too large for local R processing
  • Integrates with Spark MLlib and H2O SparkingWater for distributed machine learning, plus supports user-defined functions through spark_apply for custom R computations at scale
  • Connects to Databricks clusters via 'databricks' method in spark_connect, working alongside SparkR and compatible with RStudio for interactive development and debugging

What is Sparklyr?

Sparklyr is an open-source package that provides an interface between R and Apache Spark. You can now leverage Spark’s capabilities in a modern R environment, due to Spark’s ability to interact with distributed data with little latency. Sparklyr is an effective tool for interfacing with large datasets in an interactive environment. This way you can benefit from the familiar tools in R in order to analyze data in Spark., giving you the best of both worlds. Sparklyr Through Sparklyr you can use Spark as the backend for dplyr, a popular data manipulation package. Sparklyr provides a range of functions that allow us to access the Spark tools for transforming/pre-processing data, On top of that, it also provides interfaces to Spark’s distributed machine learning algorithms and much more. Sparklyr is also extensible. R packages that depend on Sparklyr to call the full Spark API can be created. One such extension is H2O’s Rsparkling, an R package compatible with H2O’s machine learning algorithm.

A 5X LEADER

Gartner®: Databricks Cloud Database Leader

Main highlights of Sparklyr:

  • Users can interactively manipulate Spark data using dplyr as well as SQL (via DBI).
  • Spark datasets can be filtered and aggregated and afterward brought into R to be analyzed.
  • You will be able to orchestrate distributed machine learning from R using either Spark MLlib or H2O SparkingWater.
  • Sparklyr users are able to generate extensions that call the full Spark API and provide interfaces to Spark packages.
  • Sparklyr tool offers an exhaustive dplyr backend useful in case of data manipulation, analysis, and visualization
  • Loads data into Spark DataFrames from various locations such as local R data frames, Hive tables, CSV, JSON, and Parquet files.
  • Sparklyr is able to connect to both local instances of Spark as well as to remote Spark clusters


 

Additional Resources

Never miss a Databricks post

Subscribe to our blog and get the latest posts delivered to your inbox