Databricks SQL Year in Review (Part II): SQL Programming Features
Welcome to the blog series covering product advancements in 2023 for Databricks SQL, the serverless data warehouse from Databricks. This is part 2 where we highlight many of the new SQL programming features delivered in the past year. Naturally, every SQL developer wants to be more productive and tackle ever more complex scenarios with ease -- adding SQL features like these helps developers and our customers get the most out of their Databricks SQL warehouse. This is all part of the Data Intelligence Platform from Databricks, built on the lakehouse architecture that combines the best of data warehousing and data lakes, and why the best data warehouse is a lakehouse.
Without further ado, here are the highlight SQL programming features from 2023:
Lateral Column Alias Support
If coffee is not good for us, why does everyone drink it? Lateral column support is like that. It goes against SQL's principles, but it sure comes in handy because this feature allows you to reference the result of a SQL expression in the select list in any following expression in that same select list. You will look back and wonder how you could have been forced to push a subquery just to share an expression for so long in the name of SQL purity.
Before:
SELECT fullname,
upper(fullname),
lower(fullname)
FROM (SELECT name || firstname
FROM persons) AS T(fullname);
After (with Lateral Column Alias):
SELECT name || firstname AS fullname,
upper(fullname),
lower(fullname)
FROM persons;
See Introducing Lateral Column Alias to learn more.
Error classes and SQLSTATEs
It has been a long time coming, but most error conditions you encounter in Databricks will present you with a human-readable error classification and a SQL standard-based SQLSTATE. These error messages are documented, and for Python and Scala, Databricks also provides methods that allow you to handle error conditions programmatically without building a dependency on error message text.
Example:
from pyspark.errors import PySparkException
try:
spark.sql("SELECT * FROM does_not_exist").show()
except PySparkException as ex:
if (ex.getErrorClass() == "TABLE_OR_VIEW_NOT_FOUND"):
print("I'm so sorry, but I cannot find: " + ex.getMessageParameters()['relationName'])
else:
raise
See Error Handling in Databricks to learn more.
General table-valued function support
2023 saw many improvements in the area of table-valued function support. We kicked things off by generalizing and standardizing the invocation of table functions so that you can now invoke all table functions in the FROM clause of a query, even generator functions such as explode(), and there is no more need for the LATERAL VIEW syntax.
Before:
SELECT *,
explode(ARRAY('x' || Y.y1, 'y' || Y.y1)) AS (z1)
FROM VALUES('a'), ('b') AS X(c1)
LATERAL VIEW explode(ARRAY(1, 2)) Y AS y1;
a 1 x1
a 1 y1
a 2 x2
a 2 y2
b 1 x1
b 1 y1
b 2 x2
b 2 y2
After:
SELECT *
FROM VALUES('a'), ('b') AS X(c1),
explode(ARRAY(1, 2)) AS Y(y1),
LATERAL explode(ARRAY('x' || Y.y1, 'y' || Y.y1)) AS Z(z1);
a 1 x1
a 1 y1
a 2 x2
a 2 y2
b 1 x1
b 1 y1
b 2 x2
b 2 y2
See Table Valued Function Invocation to learn more.
Python UDF and UDTF with polymorphism
SQL UDFs were introduced in Databricks 9 and were a smashing success, but the Python crowd got jealous and they upped the ante! You can now:
- Create Python UDFs and put all that shiny logic into it.
- Pass tables to Python Table UDFs using the SQL Standard TABLE syntax. This is called polymorphism, where the UDF can behave differently depending on the signature of the passed table.
Example:
from pyspark.sql.functions import udtf
from pyspark.sql.types import Row
@udtf(returnType="id: int")
class FilterUDTF:
def eval(self, row: Row):
if row["id"] > 5:
yield row["id"],
spark.udtf.register("filter_udtf", FilterUDTF)
SELECT * FROM filter_udtf(TABLE(SELECT * FROM range(10)));
6
7
8
9
See Introducing Python User Defined Table Functions, Function invocation | Databricks on AWS, and python_udtf.rst: Table Input Argument to learn more.
Unnamed Parameter Markers
In 2022, we introduced parameter markers that allow a SQL query to refer to placeholder variables passed into the SQL using, e.g. the spark.sql() API. The initial support consisted of named parameter markers, meaning your Python, Java, or Scala values are passed to SQL using a map where the keys line up with the name of the parameter marker. This is great and allows you to refer to the same argument repeatedly and out of order.
In 2023, we expanded support for unnamed parameter markers. Now, you can pass an array of values, and they are assigned in order of occurrence.
Example:
spark.sql("SELECT ? * ? * ? AS volume", args = { 3, 4, 5 }).show()
+------+
|volume|
+------+
| 60|
+------+
See Unnamed Parameter Markers to learn more.
SQL Session Variables
Parameter markers are great. We love them. But, it would be even nicer if we could avoid passing results from SQL back via dataframes, just to turn around and pass them back into SQL via parameter markers. That's where SQL Session Variables come in — a session variable is a scalar (as in : not a table) object that is private to your SQL session for both its definition and the values it holds. You can now:
- Declare a session variable with a type and an initial default value.
- Set one or more variables based on the result of a SQL expression or query.
- Reference variables within any query, or DML statement.
This makes for a great way to break up queries and pass state from one query to the next.
Example:
DECLARE var INTEGER DEFAULT 5;
SELECT var;
5
SET VAR var = (SELECT max(c1) * var FROM VALUES(1), (2), (3) AS T(c1));
SELECT var;
15
See Variables to learn more.
IDENTIFIER clause
In the previous two highlights, we showed how to parameterize queries with values passed in from your application or notebook, or even using session variables looked up in a table. But don't you also want to parameterize identifiers, say, table names, function names, and such, without becoming the butt of an XKCD joke on SQL injection? The IDENTIFIER clause allows you to do just that. It magically turns string values in session variables or provided using parameter markers into SQL names to be used as function, table, or column references.
Example:
DECLARE agg = 'max';
DECLARE col = 'c1';
DECLARE tab = 'T';
CREATE TEMPORARY VIEW IDENTIFIER(tab)(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'));
SELECT IDENTIFIER(agg)(IDENTIFIER(col)) FROM IDENTIFIER(tab);
c
See IDENTIFIER clause to learn more.
INSERT BY NAME
INSERT BY NAME is a nice usability feature that makes you wonder why SQL wasn't born that way to handle wide tables (i.e. tables with many columns). When you deal with many columns, raise your hand if you enjoy looking up the order in which you must provide the columns in the select list feeding that INSERT. Or do you prefer spelling out the lengthy column list of the insert target? Nobody does.
Now, instead of providing that column list and checking and double-checking the select list order, you can tell Databricks to do it for you. Just INSERT BY NAME, and Databricks will line your select list up with your table columns.
Example:
CREATE TABLE T(c1 INT, c2 INT);
INSERT INTO T BY NAME SELECT 1 AS c2, 2 AS c1;
SELECT * FROM T;
2 1
See INSERT INTO to learn more.
Named Parameter invocation
Imagine you wrote a function that takes 30 arguments and most of them have a sensible default. But now you must invoke it with that last argument, which is not the default. Just "skip ahead" and set only that one parameter and don't worry about the order of arguments! Just tell the argument which parameter it's meant for.
Example:
CREATE FUNCTION my_tan(sin FLOAT, cos FLOAT) RETURN sin / cos;
SELECT my_tan(cos => 0.4, sin => 0.1);
0.25
See Named Parameter Invocation to learn more.
TIMESTAMP without timezone
By default, Databricks timestamps are "with local timezone". When you provide a timestamp, Databricks will assume it is in your locale timezone and store it normalized to UTC. When you read it back, this translation is undone and looks fine. If, however, another user reads the timestamp back from another timezone, they will see the normalized timestamp translated to their timezone.
This is a great feature unless you want to just store a timestamp "as is". TIMESTAMP_NTZ is a new type that takes time at face value. You give it 2 pm on Jan 4, 2024, and it will store that.
Example:
SET TIME ZONE 'America/Los_Angeles';
DECLARE local_time TIMESTAMP_LTZ = TIMESTAMP'2023-12-01 12:13:14';
DECLARE any_time TIMESTAMP_NTZ = TIMESTAMP'2023-12-01 12:13:14';
SELECT local_time, any_time;
2023-12-01 12:13:14 2023-12-01 12:13:14
SET TIME ZONE 'America/New_York';
SELECT local_time, any_time;
2023-12-01 15:13:14 2023-12-01 12:13:14
See Introducing TIMESTAMP_NTZ to learn more.
Federated query support
Of course we know that all your data is already in the lakehouse. But if you have friends who still have some data elsewhere, tell them not to fret. They can still access this data from Databricks by registering those foreign tables with Databricks Unity Catalog and running all their SQL queries against it without having to leave Databricks. Simply register a connection to the remote system, link a remote catalog (aka database) and query the content. Of course, you can mix and match local and foreign tables in the same query.
Example:
CREATE CONNECTION postgresql_connection
TYPE POSTGRESQL
OPTIONS (
host 'qf-postgresql-demo.xxxxxx.us-west-2.rds.amazonaws.com',
port '5432',
user 'postgresql_user',
password 'password123');
CREATE FOREIGN CATALOG pg
USING CONNECTION postgresql_connection
OPTIONS (database 'postgresdb');
SELECT * FROM pg.myschema.t;
See Federated Queries to learn more.
Row-level Security and Column Masking
Feeling secretive? Do you need to give some users access to your table, but would prefer not to show all its secrets? Row-level Security and column masking are what you need. You can give other users and groups access to a table, but establish rules tailored to them on what rows they can see. You can even blank out or otherwise obfuscate PII (Personally Identifiable Information) such as substituting stars for all but the last three digits of the credit card number.
To add a row filter, create a UDF that determines whether the user can see a row based on the function arguments. Then add the row filter to your table using ALTER TABLE or do so when you CREATE TABLE.
Example:
CREATE FUNCTION us_filter(region STRING)
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('admin'), true, region='US');
CREATE TABLE sales (region STRING, id INT);
ALTER TABLE sales SET ROW FILTER us_filter ON (region);
To add a column mask:
Create a UDF that takes data of a certain type, modifies it based on the user and returns the result. Then attach the mask to the column when you create the table or using ALTER TABLE.
Example:
CREATE FUNCTION ssn_mask(ssn STRING)
RETURN CASE WHEN is_member('HumanResourceDept') THEN ssn ELSE '***-**-****' END;
CREATE TABLE users (
name STRING,
ssn STRING MASK ssn_mask);
SELECT * FROM users;
Jack ***-**-***
See Row Filters and Column Masks to learn more.
GROUP BY ALL and ORDER BY ALL
Here you are. You have crafted a beautiful reporting query, and you got a "MISSING_AGGREGATION" error because SQL made you list all the grouping columns that you have already listed up front again in the GROUP BY clause.
"Make a list! Check it twice!" is great advise for some. For others - not so much.
To that end you can now tell Databricks to do the work for you and collect all the columns to group by.
And, while we're at it, also just order the resultset by all returned columns if you like.
Example:
SELECT name, firstname, level, sum(comp) as totalcomp
FROM VALUES('The Cricket', 'Jimmy' , 'Principal Conscience', 2),
('Geppetto' , 'Signore', 'Woodcarver' , 1)
AS emp(name, firstname, level, empid)
NATURAL JOIN VALUES(1, 200, 'salary'),
(1, 100, 'spot' ),
(2, 250, 'salary'),
(2, 120, 'spot' )
AS pay(empid, comp, reason)
GROUP BY ALL
ORDER BY ALL;
Geppetto Signore Woodcarver 300
The Cricket Jimmy Principal Conscience 370
See GROUP BY, ORDER BY to learn more.
More SQL built-in functions
There are two certainties in a Developer's life: There is never enough boba tea, and there are never enough built-in functions. In addition to various functions to enhance compatibility with other products, such as to_char and to_varchar on datetime types, we focused on greatly extending the set of array manipulation functions as well as libraries of bitmap and hll_sketch functions. The bitmap functions can each speed up count distinct style queries over integers. Whereas datasketches enable a wide variety of probabilistic counting capabilities.
Example:
SELECT mask('AaBb123-&^ % 서울 Ä', lowerchar => 'z', otherchar => 'X');
AzBz123XXXXXXXXXÄ
SELECT sum(num_distinct) AS num_distinct
FROM(SELECT bitmap_bucket_number(val),
bitmap_count(bitmap_construct_agg(bitmap_bit_position(val))) FROM VALUES(1), (2), (1), (-1), (5), (0), (5) AS t(val) GROUP BY ALL) AS distinct_vals_by_bucket(bucket, num_distinct);
5
SELECT hll_sketch_estimate(
hll_sketch_agg(col))
FROM VALUES('abc'), ('def'), ('abc'), ('ghi'), ('abc') AS tab(col);
3
See Mask function, bitmap_count function, to_varchar function, sketch based approximate distinct counting to learn more.
Databricks ❤️ SQL
At Databricks, we love SQL so much we named our data warehouse after it! And, since the best data warehouse is a lakehouse, SQL and Python both have a first-class experience throughout the entire Databricks Intelligent Data Platform. We are excited to add new features like the ones above to help our customers use SQL for their projects, and we are already back working on more.
If you want to migrate your SQL workloads to a high-performance, serverless data warehouse with a great environment for SQL developers, then Databricks SQL is the solution -- try it for free.