OASIS—Collaborative Data Analysis Platform Using Apache Spark

A new collaborative data analysis platform using Apache Spark was incubated in LINE corporation which is a communication smart phone application provider based in Japan and it is going to be open sourced in this year.

OASIS is the more secure and more scalable version of Apache Zeppelin for the large scale enterprise use. Apache Zeppelin cannot be used for the large scale enterprise use because it does not work well with a Hadoop cluster whose data access permission is strictly controlled via Apache Ranger, it runs only a single server and it cannot be scalable. OASIS solves these issues of Apache Zeppelin and provides a more secure and more scalable data analysis platform for the enterprise use.

In my presentation, I’m going to explain what kind of issues OASIS solves in details. I’m also going to talk about the actual use case in my company. We manage a Hadoop cluster of 500 nodes which contains all of our company’s data. It has more than 20 PB data, more than 100 Hive databases and more then 1,000 Hive tables. My company’s employees extract data by submitting Apace Spark applications (Spark SQL, PySpark and SparkR) to the Hadoop cluster, visualize the results and share with only their specific project members on OASIS.

I’m going to talk how OASIS works in details and how we succeeded in analyzing/visualizing large scale data effectively and stably by using Apache Spark/YARN, and in sharing the analysis results with other project/team members in a secured manner by using OASIS/Apache Ranger. “How my company’s employees (1,000+ people) can analyze huge data and share the results *securely and stably*” is the most challenging point of our company (or other huge enterprises) and I’m going to tell how we could overcome it by using Apache Spark and OASIS.

Session hashtag:  #SAISExp6



« back
About Keiji Yoshida

I've been working as a software engineer at LINE Corporation in Japan. I'm maintaining and enhancing "OASIS," which is a collaborative data analysis platform using Apache Spark. It's the more secure and more scalable version of Apache Zeppelin for the large-scale enterprise use. It was incubated at LINE corporation and will be open sourced in this year.