Databricks is pleased to announce the release of Databricks Runtime 5.5. This release includes Apache Spark 2.4.3 along with several important improvements and bug fixes as noted in the latest release notes [Azure|AWS]. We recommend all users upgrade to take advantage of this new runtime release. This blog post gives a brief overview of some of the new high-value features that increase performance, compatibility, manageability and simplifying machine learning on Databricks.
In Databricks Runtime 5.5 we are previewing a feature called Instance Pools, which significantly reduces the time it takes to launch a Databricks cluster. Today, launching a new cluster requires acquiring virtual machines from your cloud provider, which can take up to several minutes. With Instance Pools, you can hold back a set of virtual machines so they can be used to rapidly launch new clusters. You pay only cloud provider infrastructure costs while virtual machines are not being used in a Databricks cluster, and pools can scale down to zero instances, avoiding costs entirely when there are no workloads.
As of Databricks Runtime 5.5, you can make Delta Lake tables available for querying from Presto and Amazon Athena. These tables can be queried just like tables with data stored in formats like Parquet. This feature is implemented using manifest files. When an external table is defined in the Hive metastore using manifest files, Presto and Amazon Athena use the list of files in the manifest rather than finding the files by directory listing.
We’ve partnered with Amazon Web Services to bring AWS Glue to Databricks. Databricks Runtime can now use AWS Glue as a drop-in replacement for the Hive metastore. For further information, see Using AWS Glue Data Catalog as the Metastore for Databricks Runtime.
The Databricks Filesystem (DBFS) is a layer on top of cloud storage that abstracts away peculiarities of underlying cloud storage providers. The existing DBFS FUSE client lets processes access DBFS using local filesystem APIs. However, it was designed mainly for convenience instead of performance. We introduced high-performance FUSE storage at location file:/dbfs/ml for Azure in Databricks Runtime 5.3 and for AWS in Databricks Runtime 5.4.  DBFS FUSE v2 expands the improved performance from dbfs:/ml to all DBFS locations including mounts. The feature is in private preview; to try it contact Databricks support.
The Databricks Secrets API [Azure|AWS] lets you inject secrets into notebooks without hardcoding them. As of Databricks Runtime 5.5, this API is available in R notebooks in addition to existing support for Python and Scala notebooks. You can use the dbutils.secrets.get function to obtain secrets. Secrets are redacted before printing to a notebook cell.
Python 2 is coming to the end of life in 2020. Many popular projects have announced they will cease supporting Python 2 on or before 2020, including a recent announcement for Spark 3.0. We have considered our customer base and plan to drop Python 2 support starting with Databricks Runtime 6.0, which is due to release later in 2019.
Databricks Runtime 6.0 and newer versions will support only Python 3. Databricks Runtime 4.x and 5.x will continue to support both Python 2 and 3. In addition, we plan to offer long-term support (LTS) for the last release of Databricks Runtime 5.x. You can continue to run Python 2 code in the LTS Databricks Runtime 5.x. We will soon announce which Databricks Runtime 5.x will be LTS.
With Databricks Runtime 5.5 for Machine Learning, we have made major package upgrades including:
We enabled HorovodRunner to utilize multi-GPU driver-only clusters. Previously, to use multiple GPUs, HorovodRunner users would have to spin up a driver and at least one worker. With this change, customers can now distribute training within a single node (i.e. a multi-GPU node) and thus use compute resources more efficiently. HorovodRunner is available only in Databricks Runtime for ML.
Machine learning tasks, especially in the image and video domain, often have to operate on a large number of files. In Databricks Runtime 5.4, we made available the binary file data source to help ETL arbitrary files such as images into Spark tables. In Databricks Runtime 5.5, we have added an option, recursiveFileLookup, to load files recursively from nested input directories. See binary file data source [Azure|AWS].
The binary file data source enable you to run model inference tasks in parallel from Spark tables using a scalar Pandas UDF. However, you might have to initialize the model for every record batch, which introduces overhead. In Databricks Runtime 5.5, we are backporting a new Pandas UDF type called “scalar iterator” from Apache Spark master. With it you can initialize the model only once and apply the model to many input batches, which can result in a 2-3x speedup for models like ResNet50. See Scalar Iterator UDFs [Azure|AWS].
