New Built-in Functions for Databricks SQL

Published: January 19, 2023

by Daniel Tenedorio, Entong Shen and Serge Rielau

Built-in functions extend the power of SQL with specific transformations of values for common needs and use cases. For example, the LOG10 function accepts a numeric input argument and returns the logarithm with base 10 as a double-precision floating-point result, and the LOWER function accepts a string and returns the result of converting each character to lowercase.

As part of our commitment to making it easy to migrate your data warehousing workloads to the Databricks lakehouse platform, we have carefully designed and launched dozens of new built-in functions into the core ANSI compliant Standard SQL dialect over the last year. The open-source Apache Spark community has also made significant contributions to this area, which we have integrated into Databricks runtime as well. In this blog post we mention a useful subset of these new functions and describe, with examples, how they may prove useful for your data processing journeys over the coming days. Please enjoy!

Process strings and search for elements

Use Databricks SQL to quickly inspect and process strings with new functions in this category. You can quickly check if a string contains a substring, inspect its length, split strings, and check for prefixes and suffixes.

Use regular expression operations to compare strings against patterns, or specialized functions to convert to or from numbers using specialized formats, and to and from URL patterns.

Compare numbers and timestamps

Get into the details by extracting bits and perform conditional logic on integers and floating-point numbers. Convert floating point numbers to integers by rounding up or down with an optional target scale, or compare numbers for equality with support for NULL values.

Work with temporal values using new strongly-typed conversions. Cast input expression to or from one of the INTERVAL data types, query the current date, or add and subtract to dates and timestamps.

Work with arrays, structs, and maps

Make sophisticated queries for your structured and semi-structured data with the array, struct, and map types. Construct new array values with the array constructor, or inspect existing arrays to see if they contain specific values or figure out what their positions are. Check how many elements are in an array, or extract specific elements by index.

Maps are a powerful data type that support inserting unique keys associated with values and efficiently extracting them later. Use the map constructor to create new map values and then look up values later as needed. Once created, you can concatenate them together, or extract their keys or values as arrays.

Perform error-safe computation

Enjoy the benefits of standard SQL with ANSI mode while also preventing your long running ETL pipelines from returning errors with new error-safe functions. Each such function returns NULL instead of raising an exception. For example, take a look at try_add, try_subtract, try_multiply, and try_divide. You can also perform casts, compute sums and averages, and safely convert values to and from numbers and timestamps using custom formatting options.

Aggregate groups of values together in new ways

Make data-driven decisions by asking questions about groups of values using new built-in aggregate functions. For example, you can now return any value in a group, concatenate groups into arrays, and compute histograms. You can also perform statistical calculations by querying the median or mode of a group, or get specific by looking up any arbitrary percentile.

The new regr_* family of functions help you ask questions about the values of a group where the input expression(s) are NOT NULL.

Each of these can also be invoked as a window function using the OVER clause.

Use encryption

Protect access to your data by encrypting it at rest and decrypting it when needed. These functions use the Advanced Encryption Standard (AES) to convert values to and from their encrypted equivalents.

Apply introspection

Programmatically query properties of your Databricks cluster or configuration with SQL. For example, you can ask about the current version of your Databricks SQL or Databricks Runtime environment. You can also now use SQL to return the list of secret keys populated so far within the Databricks secret service which the current user is authorized to see, and request to extract specific secret values by scope and key.

Build yourself a geospatial lakehouse

Efficiently process and query vast geospatial datasets at scale. In this section, we describe new SQL functions now available for organizing and processing data in this way, along with examples of how to call the functions with different input data types. For a more detailed background, please refer to the separate dedicated "Processing Geospatial Data at Scale With Databricks" blog post.

This is a geospatial visualization of taxi dropoff locations in New York City with cell colors indicating aggregated counts therein.

As of today, Databricks now supports a new collection of geospatial functions operating over H3 cells. Each H3 cell represents a unique region of space on the planet at some resolution, and has its own associated unique cell ID represented as a BIGINT or hexadecimal STRING expression. The boundaries of these cells can convert to open formats including GeoJSON, a standard designed for representing simple geographical features using JSON, or WKT, an open text based format for expressing geospatial data using strings (along with WKB, its binary equivalent).

You can inspect the distance between points by querying the H3 cells that are within (grid) distance k of the origin cell. The set of these H3 cells is called the k-ring of the origin cell. It is possible to convert input H3 cell IDs to or from their equivalent hexadecimal string representations.

Furthermore, you can now compute an ARRAY of H3 cell IDs (represented as BIGINTs or STRINGs) corresponding to hexagons or pentagons that are contained by the input area geography. The try_ versions return NULL instead of raising errors.

You can compute the parent or child H3 cell of the input H3 cell at the specified resolution, or check whether one H3 cell is a child of another. Representing polygons as (potentially exploded) arrays of H3 cells and points via their H3 cells of containment supports performing very efficient spatial analytics operating on the H3 cells as opposed to original geographic objects. Also, please refer to our recent blog that describes how to perform spatial analytics at any scale and how to supercharge spatial analytics using H3.

Finally, you can validate H3 cells by returning the input value of type BIGINT or STRING if it corresponds to a valid H3 cell ID.

Databricks SQL lets you do anything

Standards compliance and easy migration came to Databricks SQL previously with the birth of ANSI mode, and it already sets the world record in performance. With the addition of this wide array of new built-in functions, SQL workloads now have significant newfound expressibility on the lakehouse.

Now feel free to chop up strings, aggregate values, manipulate dates, analyze geographies, and more. And if some functionality is missing from these built-ins, check out Python user-defined functions and SQL user-defined functions to define your own logic that behaves the same way at call sites as the built-ins.

Thanks for using Databricks SQL, and happy querying!

What's next?

November 25, 2024/9 min read

Booting Databricks VMs 7x Faster for Serverless Compute

December 10, 2024/7 min read