R has become the de facto language for statisticians. There are nearly 10,000 packages to choose from for statistical inference, visualization, and machine learning. However, the base CRAN implementation of R is burdened by numerous scalability challenges: it is single threaded and bounded by memory of a single node. In this talk, I will summarize some recent advancements in the R APIs for Spark, and show how they can be incorporated with Microsoft R Server on Spark to create a scalable machine learning platform. In particular, I will show how an R user can create functional pipelines for Spark DataFrames and RevoScaleR XDFs (external dataframes) to conduct Bayesian inference at scale, such as estimating cluster membership using Variational Consensus Monte Carlo in Gaussian mixture models, large scale topic modeling with stochastic variational inference, and finally, Bayesian estimation of Neural Networks with Stochastic Gradient Hamiltonian Monte Carlo. All examples will be developed entirely in R, and I’ll describe best practices for performance and reproducibility.
Ali is a data scientist in the language understanding team at Microsoft AI Research. He spends his days trying to make tools for researchers and engineers to analyze large quantities of language data efficiently in the cloud and on clusters. Ali studied statistics and machine learning at the University of Toronto and Stanford University.