Skip to main content
Company Blog

In the age of ‘Big Data,’ with datasets rapidly growing in size and complexity and cloud computing becoming more pervasive, data science techniques are fast becoming core components of large-scale data processing pipelines.

Apache Spark offers analysts and engineers a powerful tool for building these pipelines, and learning to build such pipelines will soon be a lot easier. Databricks is excited to be working with professors from University of California Berkeley and University of California Los Angeles to produce two new upcoming Massive Open Online Courses (MOOCs). Both courses will be freely available on the edX MOOC platform in spring summer 2015. edX Verified Certificates are also available for a fee.

Introduction to Big Data with Apache Spark

The first course, called Introduction to Big Data with Apache Spark, will teach students about Apache Spark and performing data analysis. Students will learn how to apply data science techniques using parallel programming in Spark to explore big (and small) data. The course will include hands-on programming exercises including Log Mining, Textual Entity Recognition, Collaborative Filtering that teach students how to manipulate data sets using parallel processing with PySpark (part of Apache Spark). The course is also designed to help prepare students for taking the Spark Certified Developer exam. The course is being taught by Anthony Joseph, a professor at UC Berkeley and technical advisor at Databricks, and will start on February 23rd June 1st, 2015.

The second course, called Scalable Machine Learning, introduces the underlying statistical and algorithmic principles required to develop scalable machine learning pipelines, and provides hands-on experience using PySpark. It presents an integrated view of data processing by highlighting the various components of these pipelines, including exploratory data analysis, feature extraction, supervised learning, and model evaluation. Students will use Spark to implement scalable algorithms for fundamental statistical models while tackling real-world problems from various domains. The course is being taught by Ameet Talwalkar, an assistant professor at UCLA and technical advisor at Databricks, and will start on April 14th June 29th, 2015.

Both courses are available for free on the edX website. https://www.edx.org/