Apache Spark 2.0: An Anthology of Technical Assets

Published: June 1, 2016

Free Edition has replaced Community Edition, offering enhanced features at no cost. Start using Free Edition today.

Older anthologies collated a collection of contributions from various authors around a theme—bounded then as a journal or periodical. Newer anthologies include multiple modals of expressions—digitized now as an ebook or a blog. Both offer an exposition of the subject matter—no matter their form.

In this anthology, I’ve compiled a collection of videos, technical blogs, notebooks, webinar, podcasts, and news articles that focus on Apache Spark 2.0 now generally available.

You can try the Apache Spark 2.0 version from two places:

Spark Summit East Keynote: Apache Spark 2.0
Databricks’ CTO Matei Zaharia thanks community’s contributions and previews Apache Spark 2.0’s three themes: simplicity, speed and unification.

Structuring Spark: DataFrames, Datasets, and Streaming
Apache Spark committer and Databricks’ engineer Michael Armbrust sets the stage for why structure, as applied to data, is relevant, and how it affects the design of DataFrame and Dataset APIs and Streaming in Apache Spark 2.0.

A Deep-Dive in Structured Streaming in Apache Spark 2.0
Databricks' Spark committer Tathagata Das gives a tech-talk on how Structured Streaming works, under the hood.

Apache Spark 2.0: Easier, Faster & Smarter
Apache Spark committer and Chief Architect at Databricks Reynold Xin and Spark Community Evangelist Jules S. Damji preview Apache Spark 2.0 and showcase salient features in Databricks notebooks running pre-release of Spark 2.0.

Introducing Apache Spark 2.0 Now Generally Available on Databricks
A more in-depth version of the webinar, Matei Zaharia, Reynold Xin, and Michael Armbrust expound on three thrusts—speed, simplicity structured streaming—behind Apache Spark 2.0, with notebooks running on Databricks.

Approximate Algorithms in Apache Spark: HyperLogLog Quantiles
Databricks’ engineers Tim Hunter, Hossein Falaki, and Joseph Bradley showcase two approximation algorithms to approximate distinct elements and compute quantiles in a large data using pre-release preview of Apache Spark 2.0 on Databricks.

Apache Spark as a Compiler: Joining a Billion Rows on your Laptop
Apache Spark is already pretty fast, but can we make it 10x faster? Reynold Xin, Sameer Agarwal, and Davies Liu explain how Tungsten’s whole-stage code generation makes it so.

Efficiently Compiling Efficient Query Plans for Modern Hardware
Adrian Coyle, former CTO of SpringSource, explores influential and important topics in the world of computer science in his Morning Paper.

Spark With Tungsten Burns Brighter
Paige Roberts (of Syncort) opines that Tungsten represents a huge leap forward for Apache Spark, particularly in the area of performance, and writes how it works, and why it improves Spark performance.

Structured Streaming Comes to Apache Spark 2.0
O'Reilly’s Chief Data Scientist Ben Lorica sits down with Michael Armbrust and talks about life and structured streaming.

What Spark’s Structured Streaming Really Means
Ion Pointer (contributor for InfoWorld) advocates why DataFrames are the best choice for Spark Streaming in Spark 2.0, and why structured streaming makes sense.

Apache Spark 2.0 Preview: Machine Learning Model Persistence
Databricks’ engineer Joseph Bradley shares the benefits of Machine Learning Model Persistence in Spark 2.0 Preview, and how you can save and load ML Pipelines across multiple languages in Spark.

How to Process IoT Data Using Datasets APIs
Databricks Community Edition notebook showcasing Apache Spark 2.0 Dataset APIs.

SQL Subqueries in Apache Spark 2.0
Databricks' engineers Davies Liu and Herman van Hövell provide hands-on examples of scalar and predicate type subqueries

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets
Databricks' Spark Community Evangelist Jules S. Damji tells the tale of three Spark APIs: when to use them and why

Spark 2.0 - Datasets and case classes
Daniel Pape, an analytics engineer at codecentric explores and explains the type-safety features of Datasets APIs through some code examples using Scala case classes

Continuous Applications: Evolving Streaming in Apache Spark 2.0
Databricks' Co-founder and CTO Matei Zaharia shares his vision on end-to-end streaming applications called continuous application using Structured Streaming APIs in Apache Spark 2.0

Structured Streaming in Apache Spark 2.0: A new high-level API for streaming.
Messrs Matei Zaharia, Tathagata Das, Reynold Xin and Michael Armbrust explain the challenges of writing end-to-end streaming applications called continuous application and elaborate why and how Structured Streaming makes it simple.

How to Use SparkSessions in Apache Spark 2.0
Databricks' Spark Community Evangelist Jules S. Damji explores SparkSession functionality in Spark 2.0.

What’s Next?

In the coming weeks, we’ll publish a series of posts on Spark 2.0 versions and will update this anthology. You might want to bookmark this page!

What's next?

March 22, 2024/10 min read

GGML GGUF File Format Vulnerabilities

June 5, 2024/3 min read

What’s Next?

Never miss a Databricks post

Sign up

What's next?

GGML GGUF File Format Vulnerabilities

BigQuery adds first-party support for Delta Lake