Announcing SparkR: R on Apache Spark
June 9, 2015 in Engineering Blog
I am excited to announce that the upcoming Apache Spark 1.4 release will include SparkR, an R package that allows data scientists to analyze large datasets and interactively run jobs on them from the R shell.
R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the runtime is single-threaded and can only process data sets that fit in a single machine’s memory. SparkR, an R package initially developed at the AMPLab, provides an R frontend to Apache Spark and using Spark’s distributed computation engine allows us to run large scale data analysis from the R shell.
Project History
The SparkR project was initially started in the AMPLab as an effort to explore different techniques to integrate the usability of R with the scalability of Spark. Based on these efforts, an initial developer preview of SparkR was first open sourced in January 2014. The project was then developed in the AMPLab for the next year and we made many performance and usability improvements through open source contributions to SparkR. SparkR was recently merged into the Apache Spark project and will be released as an alpha component of Apache Spark in the 1.4 release.
SparkR DataFrames
The central component in the SparkR 1.4 release is the SparkR DataFrame, a distributed data frame implemented on top of Spark. Data frames are a fundamental data structure used for data processing in R and the concept of data frames has been extended to other languages with libraries like Pandas etc. Projects like dplyr have further simplified expressing complex data manipulation tasks on data frames. SparkR DataFrames present an API similar to dplyr and local R data frames but can scale to large data sets using support for distributed computation in Spark.
The following example shows some of the aspects of the DataFrame API in SparkR. (You can see the full example at https://gist.github.com/shivaram/d0cd4aa5c4381edd6f85)
# flights is a SparkR data frame. We can first print the column
# names, types, flights
#DataFrame[year:string, month:string, day:string, dep_time:string, dep_delay:string, #arr_time:string, arr_delay:string, carrier:string, tailnum:string, flight:string, origin:string, #dest:string, air_time:string, distance:string, hour:string, minute:string]
# Print the first few rows using `head`
head(flights)
# Filter all the flights leaving from JFK
jfk_flights
For a more comprehensive introduction to DataFrames you can see the SparkR programming guide at