We’re excited to announce the general availability of Databricks Runtime 5.0. Included in this release is Spark 2.4. This release offers substantial performance increases within key areas of the platform. Benchmarking workloads have shown a 16% improvement in total execution time and Databricks Delta benefits from substantial improvements to metadata caching, improving query latency by 30%. Beyond these powerful performance improvements we've packed this release with many new features and improvements. I'll highlight some of these now.
Enhanced Writes with MERGE, DELETE and UPDATE for Databricks Delta
With Databricks Runtime 5.0 we've improved the usage for MERGE commands:
- Scalable MERGE command with Databricks Delta: There is no longer a limit on the number of inserts and updates that can be performed with a merge. We've eliminated any previous limits allowing merge scalability to billions of rows.
You can also now use MERGE for SCD Type 1 and Type 2 queries. SCD Type 2 queries track historical data by creating multiple records for a given natural key in the dimensional tables. A typical use case that Databricks Delta now supports might look like: Given a table with a list of customers and their current address, SCD Type 2 queries allow you to update a customer's current address and maintain the record for their previous address along with the active date range in one query. For further information on MERGE, and these new features see consult the documentation.
- Subqueries are now supported in the WHERE clause for DELETE and UPDATE Commands. Any subquery you would normally put in a WHERE clause for DELETE and UPDATE are now supported in Databricks Delta, such as the following examples:
<span style="color: #339966;"><i>-- Example 1</i></span> <span style="color: #0000ff;">DELETE FROM</span> all_events <span style="color: #0000ff;"> WHERE</span> session_time SELECT <span style="color: #ff0000;"><i>Min</i></span>(session_time) <span style="color: #0000ff;"> FROM</span> good_events) <span style="color: #339966;"><i>-- Example 2</i></span> <span style="color: #0000ff;">DELETE FROM</span> orders <span style="color: #0000ff;">AS</span> t1 <span style="color: #0000ff;"> WHERE EXISTS</span> (<span style="color: #0000ff;">SELECT</span> oid <span style="color: #0000ff;"> FROM</span> returned_orders <span style="color: #0000ff;"> WHERE</span> t1.oid = oid) <span style="color: #339966;"><i>-- Example 3</i></span> DELETE FROM events <span style="color: #0000ff;">WHERE</span> category <span style="color: #0000ff;">NOT IN</span> (<span style="color: #0000ff;">SELECT</span> category <span style="color: #0000ff;"> FROM</span> events2 <span style="color: #0000ff;"> WHERE</span> date > <span style="color: #ff0000;">'2001-01-01'</span>)
For further information on UPDATE and DELETE commands, please refer to the Databricks Delta Documentation.
Improved reads using OPTIMIZE command with Databricks Delta
In addition to the new features in this release we’ve invested heavily in improvements for Databricks Delta, including work to improve performance and stability for the OPTIMIZE command:
- The OPTIMIZE command now commits batches as soon as possible, where in previous releases this was performed at the end. This improves optimize time and performance.
- We reduced the default number of threads OPTIMIZE runs in parallel. This dramatically improves optimize performance for large tables.
- Databricks Runtime 5.0 speeds up OPTIMIZE writes by avoiding unnecessarily sorting the data when writing a to a partitioned table.
- Beginning with Databricks Runtime 5.0, OPTIMIZE ZORDER is now incremental, eliminating the need for rewriting data files that were previously Z-Ordered by the same column(s).
We’ve improved the isolation level for Databricks Delta queries. Any query with multiple references to a single Databricks Delta table (self-joins etc) will read from the same snapshot even if there are concurrent updates to the table.
Lastly, we want to point out the improved query latency for small Databricks Delta tables (release notes for Databricks Runtime 5.0.
Structured Streaming - New Features
We’ve upgraded the streaming source Kafka client to version 2.0.0, which is an important milestone. Databricks now supports kafka.isolation.level to read only committed records from Kafka topics that are written using a transactional producer.
We’ve also included the new Azure Blob Storage file notification based Streaming Source. Instead of listing to find new files for processing, this streaming source, can directly read file event notifications to find new files. This can significantly reduce listing costs for Structured Streaming queries on files in Azure Blob Storage.
To read more about the above new features and to see the full list of improvements included in Databricks Runtime 5.0, refer to the release notes in the following locations:
- Amazon Web Services: Databricks Runtime 5.0 release notes
- Microsoft Azure: Azure Databricks Runtime 5.0 release notes
We recommend all customers upgrade to Databricks Runtime 5.0 to take advantage of these new features and performance optimizations.