MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library for Apache Spark

Download Slides

With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.
Session hashtag: #EUai7

« back
About Miruna Oprescu

Miruna Oprescu is a Software Engineer at Microsoft specializing in tools and infrastructure for big data and machine learning. Her goal is to make machine learning simple for both developers and end users. As an active MMLSpark (Microsoft Machine Learning for Spark) contributor, she has been working on Python/R wrapper generation for Spark pipeline stages and a robust testing framework for Spark pipelines using Jupyter Notebooks.