Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Our platform builds on Hops, a new distribution of Hadoop with a distributed metadata architecture, that includes a frontend called Hopsworks with support for project-based multi-tenancy and first-class datasets. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific storage on HDFS and project-specific Kafka topics. Both project-specific storage and Kafka topics are protected from access by users that are not members of the project. Researchers work in an entirely UI-driven environment on a platform that is open-source. In this talk we will discuss the challenges in building a metered version of Spark-as-a-Service for YARN, experiences with Spark-on-YARN, and some of the possibilities that Hopsworks opens up for building secure, multi-tenant Spark applications on a shared cluster. We will also discuss the experiences of our users (over 100 users as of June 2016): how they manage their YARN and HDFS quotas, patterns for how users share datasets between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
Jim Dowling is CEO of Logical Clocks and an Associate Professor at KTH Royal Institute of Technology. He is lead architect of the open-source Hopsworks platform, a horizontally scalable data platform for machine learning that includes the industry’s first Feature Store.