Databricks SQL Year in Review (Part II): SQL Programming Features

The new SQL feature highlights from Databricks in 2023

Published: January 31, 2024

Product10 min read

Welcome to the blog series covering product advancements in 2023 for Databricks SQL, the serverless data warehouse from Databricks. This is part 2 where we highlight many of the new SQL programming features delivered in the past year. Naturally, every SQL developer wants to be more productive and tackle ever more complex scenarios with ease -- adding SQL features like these helps developers and our customers get the most out of their Databricks SQL warehouse. This is all part of the Data Intelligence Platform from Databricks, built on the lakehouse architecture that combines the best of data warehousing and data lakes, and why the best data warehouse is a lakehouse.

Without further ado, here are the highlight SQL programming features from 2023:

Lateral Column Alias Support

If coffee is not good for us, why does everyone drink it? Lateral column support is like that. It goes against SQL's principles, but it sure comes in handy because this feature allows you to reference the result of a SQL expression in the select list in any following expression in that same select list. You will look back and wonder how you could have been forced to push a subquery just to share an expression for so long in the name of SQL purity.

Before:

After (with Lateral Column Alias):

See Introducing Lateral Column Alias to learn more.

Error classes and SQLSTATEs

It has been a long time coming, but most error conditions you encounter in Databricks will present you with a human-readable error classification and a SQL standard-based SQLSTATE. These error messages are documented, and for Python and Scala, Databricks also provides methods that allow you to handle error conditions programmatically without building a dependency on error message text.

Example:

See Error Handling in Databricks to learn more.

General table-valued function support

2023 saw many improvements in the area of table-valued function support. We kicked things off by generalizing and standardizing the invocation of table functions so that you can now invoke all table functions in the FROM clause of a query, even generator functions such as explode(), and there is no more need for the LATERAL VIEW syntax.

Before:

After:

See Table Valued Function Invocation to learn more.

Python UDF and UDTF with polymorphism

SQL UDFs were introduced in Databricks 9 and were a smashing success, but the Python crowd got jealous and they upped the ante! You can now:

Create Python UDFs and put all that shiny logic into it.
Pass tables to Python Table UDFs using the SQL Standard TABLE syntax. This is called polymorphism, where the UDF can behave differently depending on the signature of the passed table.

Example:

See Introducing Python User Defined Table Functions, Function invocation | Databricks on AWS, and python_udtf.rst: Table Input Argument to learn more.

Unnamed Parameter Markers

In 2022, we introduced parameter markers that allow a SQL query to refer to placeholder variables passed into the SQL using, e.g. the spark.sql() API. The initial support consisted of named parameter markers, meaning your Python, Java, or Scala values are passed to SQL using a map where the keys line up with the name of the parameter marker. This is great and allows you to refer to the same argument repeatedly and out of order.

In 2023, we expanded support for unnamed parameter markers. Now, you can pass an array of values, and they are assigned in order of occurrence.

Example:

See Unnamed Parameter Markers to learn more.

SQL Session Variables

Parameter markers are great. We love them. But, it would be even nicer if we could avoid passing results from SQL back via dataframes, just to turn around and pass them back into SQL via parameter markers. That's where SQL Session Variables come in — a session variable is a scalar (as in : not a table) object that is private to your SQL session for both its definition and the values it holds. You can now:

Declare a session variable with a type and an initial default value.
Set one or more variables based on the result of a SQL expression or query.
Reference variables within any query, or DML statement.

This makes for a great way to break up queries and pass state from one query to the next.

Example:

See Variables to learn more.

IDENTIFIER clause

In the previous two highlights, we showed how to parameterize queries with values passed in from your application or notebook, or even using session variables looked up in a table. But don't you also want to parameterize identifiers, say, table names, function names, and such, without becoming the butt of an XKCD joke on SQL injection? The IDENTIFIER clause allows you to do just that. It magically turns string values in session variables or provided using parameter markers into SQL names to be used as function, table, or column references.

Example:

See IDENTIFIER clause to learn more.

INSERT BY NAME

INSERT BY NAME is a nice usability feature that makes you wonder why SQL wasn't born that way to handle wide tables (i.e. tables with many columns). When you deal with many columns, raise your hand if you enjoy looking up the order in which you must provide the columns in the select list feeding that INSERT. Or do you prefer spelling out the lengthy column list of the insert target? Nobody does.

Now, instead of providing that column list and checking and double-checking the select list order, you can tell Databricks to do it for you. Just INSERT BY NAME, and Databricks will line your select list up with your table columns.

Example:

See INSERT INTO to learn more.

Named Parameter invocation

Imagine you wrote a function that takes 30 arguments and most of them have a sensible default. But now you must invoke it with that last argument, which is not the default. Just "skip ahead" and set only that one parameter and don't worry about the order of arguments! Just tell the argument which parameter it's meant for.

Example:

See Named Parameter Invocation to learn more.

TIMESTAMP without timezone

By default, Databricks timestamps are "with local timezone". When you provide a timestamp, Databricks will assume it is in your locale timezone and store it normalized to UTC. When you read it back, this translation is undone and looks fine. If, however, another user reads the timestamp back from another timezone, they will see the normalized timestamp translated to their timezone.

This is a great feature unless you want to just store a timestamp "as is". TIMESTAMP_NTZ is a new type that takes time at face value. You give it 2 pm on Jan 4, 2024, and it will store that.

Example:

See Introducing TIMESTAMP_NTZ to learn more.

Federated query support

Of course we know that all your data is already in the lakehouse. But if you have friends who still have some data elsewhere, tell them not to fret. They can still access this data from Databricks by registering those foreign tables with Databricks Unity Catalog and running all their SQL queries against it without having to leave Databricks. Simply register a connection to the remote system, link a remote catalog (aka database) and query the content. Of course, you can mix and match local and foreign tables in the same query.

Example:

See Federated Queries to learn more.

Row-level Security and Column Masking

Feeling secretive? Do you need to give some users access to your table, but would prefer not to show all its secrets? Row-level Security and column masking are what you need. You can give other users and groups access to a table, but establish rules tailored to them on what rows they can see. You can even blank out or otherwise obfuscate PII (Personally Identifiable Information) such as substituting stars for all but the last three digits of the credit card number.

To add a row filter, create a UDF that determines whether the user can see a row based on the function arguments. Then add the row filter to your table using ALTER TABLE or do so when you CREATE TABLE.

Example:

To add a column mask:
Create a UDF that takes data of a certain type, modifies it based on the user and returns the result. Then attach the mask to the column when you create the table or using ALTER TABLE.

Example:

See Row Filters and Column Masks to learn more.

GROUP BY ALL and ORDER BY ALL

Here you are. You have crafted a beautiful reporting query, and you got a "MISSING_AGGREGATION" error because SQL made you list all the grouping columns that you have already listed up front again in the GROUP BY clause.

"Make a list! Check it twice!" is great advise for some. For others - not so much.

To that end you can now tell Databricks to do the work for you and collect all the columns to group by.

And, while we're at it, also just order the resultset by all returned columns if you like.

Example:

See GROUP BY, ORDER BY to learn more.

More SQL built-in functions

There are two certainties in a Developer's life: There is never enough boba tea, and there are never enough built-in functions. In addition to various functions to enhance compatibility with other products, such as to_char and to_varchar on datetime types, we focused on greatly extending the set of array manipulation functions as well as libraries of bitmap and hll_sketch functions. The bitmap functions can each speed up count distinct style queries over integers. Whereas datasketches enable a wide variety of probabilistic counting capabilities.

Example:

See Mask function, bitmap_count function, to_varchar function, sketch based approximate distinct counting to learn more.

Databricks ❤️ SQL

At Databricks, we love SQL so much we named our data warehouse after it! And, since the best data warehouse is a lakehouse, SQL and Python both have a first-class experience throughout the entire Databricks Intelligent Data Platform. We are excited to add new features like the ones above to help our customers use SQL for their projects, and we are already back working on more.

If you want to migrate your SQL workloads to a high-performance, serverless data warehouse with a great environment for SQL developers, then Databricks SQL is the solution -- try it for free.

What's next?

November 20, 2024/4 min read

Introducing Predictive Optimization for Statistics

November 21, 2024/3 min read

Lateral Column Alias Support

Error classes and SQLSTATEs

General table-valued function support

Python UDF and UDTF with polymorphism

Unnamed Parameter Markers

SQL Session Variables

IDENTIFIER clause

INSERT BY NAME

Named Parameter invocation

TIMESTAMP without timezone

Federated query support

Row-level Security and Column Masking

GROUP BY ALL and ORDER BY ALL

More SQL built-in functions

Databricks ❤️ SQL

Never miss a Databricks post

Sign up

What's next?

Introducing Predictive Optimization for Statistics

How to present and share your Notebook insights in AI/BI Dashboards