Skip to main content
Engineering blog

Today we are happy to announce the availability of Apache Spark 2.2.0 on Databricks as part of the Databricks Runtime 3.0.

This release marks a major milestone for Structured Streaming by marking it as production ready and removing the experimental tag. In this release, we also support for arbitrary stateful operations in a stream, and Apache Kafka 0.10 support for both reading and writing using the streaming and batch APIs. In addition to extending new functionality to SparkR, Python, MLlib, and GraphX, the release focuses on usability, stability, and refinement, resolving over 1100 tickets.

This blog post discusses some of the high-level changes, improvements and bug fixes:

  • Production ready Structured Streaming
  • Expanding SQL functionalities
  • New distributed machine learning algorithms in R
  • Additional Algorithms in MLlib and GraphX

Structured Streaming

Introduced in Spark 2.0, Structured Streaming is a high-level API for building continuous applications. Our goal is to make it easier to build end-to-end streaming applications, which integrate with storage, serving systems, and batch jobs in a consistent and fault-tolerant way.

The third release in 2.x line, Spark 2.2 declares Structured Streaming as production ready, meaning removing the experimental tag, with additional high-level changes:

At Databricks, we religiously believe in dogfooding. Using a release candidate version of Spark 2.2, we have ported some of our internal data pipelines as well as worked with some of our customers to port their production pipelines using Structured Streaming.

SQL and Core APIs

Since Spark 2.0 release, Spark is now one of the most feature-rich and standard-compliant SQL query engine in the Big Data space. It can connect to a variety of data sources and perform SQL-2003 feature sets such as analytic functions and subqueries. Spark 2.2 adds a number of SQL functionalities:

  • API Updates: Unify CREATE TABLE syntax for data source and hive serde tables and add broadcast hints such as BROADCAST, BROADCASTJOIN, and MAPJOIN for SQL Queries
  • Overall Performance and stability:
    • Cost-based optimizer cardinality estimation for filter, join, aggregate, project and limit/sample operators and Cost-based join re-ordering
    • TPC-DS performance improvements using star-schema heuristics
    • File listing/IO improvements for CSV and JSON
    • Partial aggregation support of HiveUDAFFunction
    • Introduce a JVM object based aggregate operator
  • Other notable changes:
    • Support for parsing multi-line JSON and CSV files
    • Analyze Table Command on partitioned tables
    • Drop Staging Directories and Data Files after completion of Insertion/CTAS against Hive-serde Tables

MLlib, SparkR, and Python

The last major set of changes in Spark 2.2 focuses on advanced analytics and Python. Now you can install PySpark from PyPI package using pip install. To boost advanced analytics, a few new algorithms were added to MLlib and GraphX:

Spark 2.2 also adds support for the following distributed algorithms in SparkR:

  • ALS
  • Isotonic Regression
  • Multilayer Perceptron Classifier
  • Random Forest
  • Gaussian Mixture Model
  • LDA
  • Multiclass Logistic Regression
  • Gradient Boosted Trees
  • Structured Streaming API for R
  • column functions to_json, from_json for R
  • Multi-column approxQuantile in R

With the addition of these algorithms, SparkR has become the most comprehensive library for distributed machine learning on R.

While this blog post only covered some of the major features in this release, you can read the official release notes to see the complete list of changes.

If you want to try out these new features, you can use Spark 2.2 in Databricks Runtime 3.0. Sign up for a free trial account here.