The inaugural Spark Summit Europe will be held in Amsterdam this October. Check out the full agenda and get your ticket before it sells out!
Today we are happy to announce the availability of Apache Spark’s 1.5 release! In this post, we outline the major development themes in Spark 1.5 and some of the new features we are most excited about. In the coming weeks, our blog will feature more detailed posts on specific components of Spark 1.5. For a comprehensive list of features in Spark 1.5, you can also find the detailed Apache release notes below.
Many of the major changes in Spark 1.5 are under-the-hood changes to improve Spark’s performance, usability, and operational stability. Spark 1.5 ships major pieces of Project Tungsten, an initiative focused on increasing Spark’s performance through several low-level architectural optimizations. The release also adds operational features for the streaming component, such as backpressure support. Another major theme of this release is data science: Spark 1.5 ships several new machine learning algorithms and utilities, and extends Spark’s new R API.
One interesting tidbit is that in Spark 1.5, we have crossed the 10,000 mark for JIRA number (i.e. more than 10,000 tickets have been filed to request features or report bugs). Hopefully the added digit won’t slow down our development too much!
Earlier this year we announced Project Tungsten - a set of major changes to Spark’s internal architecture designed to improve performance and robustness. Spark 1.5 delivers the first major pieces of Project Tungsten. This includes binary processing, which circumvents the Java object model using a custom binary memory layout. Binary processing significantly reduces garbage collection pressure for data-intensive workloads. It also includes a new code generation framework, where optimized byte code is generated at runtime for evaluating expressions in user code. Spark 1.5 adds a large number of built-in functions that are code generated, for common tasks like date handling and string manipulation.
Over the next few weeks, we will be writing about Project Tungsten. To give you a teaser, the chart below compares the out-of-the-box (i.e. no configuration changes) performance of aggregation queries using Spark 1.4 and Spark 1.5, for both small aggregations and large aggregations.
The release also includes a variety of other performance enhancements. Support for the Apache Parquet file format sees improved input/output performance, with predicate push down now enabled by default and a faster metadata lookup path. Spark’s joins also receive some attention, with a new broadcast outer join operator and the ability to do sort-merge outer joins.
Spark 1.5 also focuses on usability aspects - such as providing interoperability with a wide variety of environments. After all, you can only use Spark if it connects to your data source or works on your cluster. And Spark programs need to be easy to understand if you want to debug them.
Spark 1.5 adds visualization of SQL and DataFrame query plans in the web UI, with dynamic update of operational metrics such as the selectivity of a filter operator and the runtime memory usage of aggregations and joins. Below is an example of a plan visualization from the web UI (click on the image to see details).
In addition, we have invested a significant amount of work to improve interoperability with other ecosystem projects. As an example, using classloader isolation techniques, a single instance of Spark (SQL and DataFrames) can now connect to multiple versions of Hive metastores, from Hive 0.12 all the way to Hive 1.2.1. Aside from being able to connect to different metastores, Spark can now read several Parquet variants generated by other systems, including parquet-avro, parquet-thrift, parquet-protobuf, Impala, Hive. To the best of our knowledge, Spark is the only system that is capable of connecting to various versions of Hive and supporting the litany of Parquet formats that exist in the wild.
Spark Streaming adds several new features in this release, with a focus on operational stability for long-lived production streaming workloads. These features are largely based on feedback from existing streaming users. Spark 1.5 adds backpressure support, which throttles the rate of receiving when the system is in an unstable state. For example, if there is a large burst in input, or a temporary delay in writing output, the system will adjust dynamically and ensure that the streaming program remains stable. This feature was developed in collaboration with Typesafe.
A second operational addition is the ability to load balance and schedule data receivers across a cluster, and better control over re-launching of receivers for long running jobs. Spark streaming also adds several Python API’s in this release, including Amazon Kinesis, Apache Flume, and the MQTT protocol.
One of the primary focuses of Spark in 2015 is to empower large-scale data science. We kicked this theme off with three major additions to Spark: DataFrames, machine learning pipelines, and R language support. These three additions brought APIs similar to best-in-class single-node tools to Spark. In Spark 1.5, we have greatly expanded their capabilities.
After the initial release of DataFrames in Spark 1.3, one of the most common user requests was to support more string and date/time functions out-of-the-box. We are happy to announce that Spark 1.5 introduces over 100 built-in functions. Almost all of these built-in functions also implement code generation, so applications using them can take advantage of the changes we made as part of Project Tungsten.
R language support was introduced as an alpha component in Spark 1.4. Spark 1.5 improves R usability as well as introduces support for scalable machine learning via integration with MLlib. R frontend now supports GLMs with R formula, binomial/Gaussian families, and elastic-net regularization.
For machine learning, Spark 1.5 brings better coverage for the new pipeline API, with new pipeline modules and algorithms. New pipeline features include feature transformers like CountVectorizer, DCT, MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer, algorithms like multilayer perceptron classifier, enhanced tree models, k-means, and naive Bayes, and tuning tools like train-validation split and multiclass classification evaluator. Other new algorithms include PrefixSpan for sequential pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov test, etc.
The 1.5 release is also a good time to mention the growth of Spark’s package ecosystem. Today, there are more than 100 packages that can be enabled with a simple flag for any Spark program. Packages include machine learning algorithms, connectors to various data sources, experimental new features, and much more. Several packages have released updates coinciding with the Spark 1.5 release, including the spark-csv, spark-redshift, and spark-avro data source connectors.
Spark 1.5.0 featured contributions from more than 230 developers - thanks to everyone who helped make this release possible! Stay tuned to the Databricks blog to learn more about Spark 1.5’s features and get a peek of upcoming Spark development.
For your convenience, we have attached the entire release notes here. If you want to try out these new features, you can already use Spark 1.5 in Databricks. Sign up for a 14-day free trial here.
table.`column.with.dots`.nested
).In the spark.mllib package, there are no breaking API changes but some behavior changes:
In the experimental spark.ml package, there exists one breaking API change and one behavior change:
The following issues are known in 1.5.0, and will be fixed in 1.5.1 release.
We would like to thank the following organizations for testing the release candidates with their workloads: Tencent, Mesosphere, Typesafe, Palantir, Cloudera, Hortonworks, Huawei, Shopify, Netflix, Intel, Yahoo, Kixer, UC Berkeley and Databricks.
Last but not least, this release would not have been possible without the following contributors: Aaron Davidson, Adam Roberts, Ai He, Akshat Aranya, Alex Shkurenko, Alex Slusarenko, Alexander Ulanov, Alok Singh, Amey Chaugule, Andrew Or, Andrew Ray, Animesh Baranawal, Ankur Chauhan, Ankur Dave, Ben Fradet, Bimal Tandel, Brennan Ashton, Brennon York, Brian Lockwood, Bryan Cutler, Burak Yavuz, Calvin Jia, Carl Anders Duvel, Carson Wang, Chen Xu, Cheng Hao, Cheng Lian, Cheolsoo Park, Chris Freeman, Christian Kadner, Cody Koeninger, Damian Guy, Daniel Darabos, Daniel Emaasit, Daoyuan Wang, Dariusz Kobylarz, David Arroyo Cazorla, Davies Liu, DB Tsai, Dennis Huo, Deron Eriksson, Devaraj K, Dibyendu Bhattacharya, Dong Wang, Emiliano Leporati, Eric Liang, Favio Vazquez, Felix Cheung, Feynman Liang, Forest Fang, Francois Garillot, Gen Tang, George Dittmar, Guo Wei, GuoQiang Li, Han JU, Hao Zhu, Hari Shreedharan, Herman Van Hovell, Holden Karau, Hossein Falaki, Huang Zhaowei, Hyukjin Kwon, Ilya Ganelin, Imran Rashid, Iulian Dragos, Jacek Lewandowski, Jacky Li, Jan Prach, Jean Lyn, Jeff Zhang, Jiajin Zhang, Jie Huang, Jihong MA, Jonathan Alter, Jose Cambronero, Joseph Batchik, Joseph Gonzalez, Joseph K. Bradley, Josh Rosen, Judy Nash, Juhong Park, Kai Sasaki, Kai Zeng, KaiXinXiaoLei, Kan Zhang, Kashif Rasul, Kay Ousterhout, Keiji Yoshida, Kenichi Maehashi, Keuntae Park, Kevin Conor, Konstantin Shaposhnikov, Kousuke Saruta, Kun Xu, Lars Francke, Leah McGuire, lee19, Liang-Chi Hsieh, Lianhui Wang, Luca Martinetti, Luciano Resende, Manoj Kumar, Marcelo Vanzin, Mark Smith, Martin Zapletal, Matei Zaharia, Mateusz Buskiewicz, Matt Massie, Matthew Brandyberry, Meethu Mathew, Meihua Wu, Michael Allman, Michael Armbrust, Michael Davies, Michael Sannella, Michael Vogiatzis, Michel Lemay, Mike Dusenberry, Min Zhou, Mingfei Shi, mosessky, Moussa Taifi, Mridul Muralidharan, NamelessAnalyst, Namit Katariya, Nan Zhu, Nathan Howell, Navis Ryu, Neelesh Srinivas Salian, Nicholas Chammas, Nicholas Hwang, Nilanjan Raychaudhuri, Niranjan Padmanabhan, Nishkam Ravi, Nishkam Ravi, Noel Smith, Oleksiy Dyagilev, Oleksiy Dyagilev, Paavo Parkkinen, Patrick Baier, Patrick Wendell, Pawel Kozikowski, Pedro Rodriguez, Perinkulam I. Ganesh, Piotr Migdal, Prabeesh K, Pradeep Chhetri, Prayag Chandran, Punya Biswal, Qian Huang, Radek Ostrowski, Rahul Palamuttam, Ram Sriharsha, Rekha Joshi, Rekha Joshi, Rene Treffer, Reynold Xin, Roger Menezes, Rohit Agarwal, Rosstin Murphy, Rowan Chattaway, Ryan Williams, Saisai Shao, Sameer Abhyankar, Sandy Ryza, Santiago M. Mola, Scott Taylor, Sean Owen, Sephiroth Lin, Seth Hendrickson, Sheng Li, Shilei Qian, Shivaram Venkataraman, Shixiong Zhu, Shuo Bai, Shuo Xiang, Simon Hafner, Spiro Michaylov, Stan Zhai, Stefano Parmesan, Steve Lindemann, Steve Loughran, Steven She, Su Yan, Sudhakar Thota, Sun Rui, Takeshi YAMAMURO, Takuya Ueshin, Tao Li, Tarek Auel, Tathagata Das, Ted Blackman, Ted Yu, Thomas Omans, Thomas Szymanski, Tien-Dung Le, Tijo Thomas, Tim Ellison, Timothy Chen, Tom Graves, Tom White, Tomohiko K., Vincent D. Warmerdam, Vinod K C, Vinod KC, Vladimir Vladimirov, Vyacheslav Baranov, Wang Tao, Wang Wei, Weizhong Lin, Wenchen Fan, Wisely Chen, Xiangrui Meng, Xu Tingjun, Xusen Yin, Yadong Qi, Yanbo Liang, Yash Datta, Yijie Shen, Yin Huai, Yong Tang, Yu ISHIKAWA, Yuhao Yang, Yuming Wang, Yuri Saito, Yuu ISHIKAWA, Zc He, Zhang, Liye, Zhichao Li, Zhongshuai Pei, Zoltan Zvara, and a few unknown contributors (please indicate your email and name in your git commit to show up here).
You can download the release at https://spark.apache.org/downloads.html