Welcome

This self-paced guide is the “Hello World” tutorial for Apache Spark using Databricks. In the following tutorial modules, you will learn the basics of creating Spark jobs, loading data, and working with data. You’ll also get an introduction to running machine learning algorithms and working with streaming data. Databricks lets you start writing Spark queries instantly so you can focus on your data problems.

Navigating this Apache Spark Tutorial

Hover over the above navigation bar and you will see the six stages to getting started with Apache Spark on Databricks. This guide will first provide a quick start on how to use open source Apache Spark and then leverage this knowledge to learn how to use Spark DataFrames with Spark SQL. We also will discuss how to use Datasets and how DataFrames and Datasets are now unified. The guide also has quick starts for Machine Learning and Streaming so you can easily apply them to your data problems. Each of these modules refers to standalone usage scenarios—including IoT and home sales—with notebooks and datasets so you can jump ahead if you feel comfortable.

Introduction to Apache Spark

spark-logo-trademark

Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics.

Spark SQL + DataFrames

Structured Data: Spark SQL

Many data scientists, analysts, and general business intelligence users rely on interactive SQL queries for exploring data. Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data. It also provides powerful integration with the rest of the Spark ecosystem (e.g., integrating SQL query processing with machine learning).

Streaming

Streaming Analytics: Spark Streaming

Many applications need the ability to process and analyze not only batch data, but also streams of new data in real-time. Running on top of Spark, Spark Streaming enables powerful interactive and analytical applications across both streaming and historical data, while inheriting Spark’s ease of use and fault tolerance characteristics. It readily integrates with a wide variety of popular data sources, including HDFS, Flume, Kafka, and Twitter.

MLlib Machine Learning

Machine Learning: MLlib

Machine learning has quickly emerged as a critical piece in mining Big Data for actionable insights. Built on top of Spark, MLlib is a scalable machine learning library that delivers both high-quality algorithms (e.g., multiple iterations to increase accuracy) and blazing speed (up to 100x faster than MapReduce). The library is usable in Java, Scala, and Python as part of Spark applications, so that you can include it in complete workflows.

GraphX Graph Computation

Graph Computation: GraphX

GraphX is a graph computation engine built on top of Spark that enables users to interactively build, transform and reason about graph structured data at scale. It comes complete with a library of common algorithms.

Spark Core API

General Execution: Spark Core

Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built on top of. It provides in-memory computing capabilities to deliver speed, a generalized execution model to support a wide variety of applications, and Java, Scala, and Python APIs for ease of development.

R
SQL
Python
Scala
Java

“At Databricks, we’re working hard to make Spark easier to use and run than ever, through our efforts on both the Spark codebase and support materials around it. All of our work on Spark is open source and goes directly to Apache.”

Matei Zaharia, VP, Apache Spark,
Co-founder & Chief Technologist, Databricks

For more information about Spark, you can also reference:

Get Databricks

Databricks is a Unified Analytics Platform on top of Apache Spark that accelerates innovation by unifying data science, engineering and business. With our fully managed Spark clusters in the cloud, you can easily provision clusters with just a few clicks. Databricks incorporates an integrated workspace for exploration and visualization so users can learn, work, and collaborate in a single, easy to use environment. You can easily schedule any existing notebook or locally developed Spark code to go from prototype to production without re-engineering.

Sign up Today

In addition, Databricks includes:

  • Our award-winning Massive Open Online Course, “Introduction to Big Data with Apache Spark” which has enrolled over 76,000 participants to date!
  • Massive Open Online Courses (MOOCs), including Machine Learning with Apache Spark
  • Analysis Pipelines Samples in R and Scala

Find all of our available courses here at https://academy.databricks.com

Additional Resources

Spark: Better with Delta Lake

This series of tech talk tutorials takes you through the technology foundation of Delta Lake (Apache Spark) and the capabilities Delta Lake adds to it to power cloud data lakes.

WATCH NOW