Recommendation systems are among the most popular applications of machine learning. MLlib implements alternating least squares (ALS) for collaborative filtering, a very popular algorithm for making recommendations. We utilize Spark’s in-memory caching and a special partitioning strategy to make ALS efficient and scalable. MLlib’s ALS runs 10x faster than Apache Mahout’s implementation and it scales up to billions of ratings. In this talk, we present a more scalable implementation of ALS with scalability results on 100 billion ratings. It is based on the issues we experienced with the old implementation. We will review the ALS algorithm, and describe the internal data storage we used in the new implementation as well as techniques used to accelerate the computation and to improve JVM efficiency. We will also discuss the next steps for recommendation algorithms in MLlib.
Xiangrui Meng is an Apache Spark PMC member and a software engineer at Databricks. His main interests center around developing and implementing scalable algorithms for scientific applications. He has been actively involved in the development and maintenance of Spark MLlib since he joined Databricks. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His Ph.D. work at Stanford is on randomized algorithms for large-scale linear regression problems.