In part two of the SQL Scripting announcement blog series, we will examine the administrative task we discussed in part one—how to apply a case-insensitive rule to every STRING column in a table. We’ll walk through that example step by step, explain the features used, and expand it beyond a single table to cover an entire schema.
You can also follow along in this notebook.
Databricks supports a wide range of language-aware, case-insensitive, and accent-insensitive collations. It's easy to use this feature for new tables and columns. But what if you have an existing system using upper() or lower() in predicates everywhere and want to pick up the performance improvements associated with a native case-insensitive collation while simplifying your queries? That will require some programming; now you can do it all in SQL.
Let’s use the following test schema:
The order is based on the ASCII codepoints, where all uppercase letters precede all lowercase letters. Can you fix this without adding upper() or lower()?
Our first step is to tell the table to change its default collation for newly added columns. You can feed your local variables with parameter markers, which the notebook will automatically detect and add widgets. You can also use EXECUTE IMMEDIATE to run a dynamically composed ALTER TABLE statement.
Every SQL script consists of a BEGIN .. END (compound) statement. Local variables are defined first within a compound statement, followed by the logic.
This is all just a set of linear statements. So far, you could write all this with SQL Session variables without the compound statement. You also have not achieved much. After all, you wanted to change the collation for existing columns. To do this, you need to:
In short, you need to loop over the INFORMATION_SCHEMA.COLUMNS table.
SQL Scripting offers four ways of looping and ways to control loop iterations.
Now, apply the FOR loop to our collation script. The query gets the column names of all string columns of the table. The loop body alters each column collation in turn:
Let's verify that the table has been properly updated:
So far, so good. Our code is functionally complete, but you should tell Delta to analyze the columns you modified to benefit from file skipping. You don't want to do this per column. But gather all of them together and do the work only if there was, in fact, a string column for which the collation was altered. Decisions, decisions ….
SQL Scripting offers three ways to perform conditional execution of SQL statements.
For our collation script, a simple IF THEN END IF will suffice. You also need to collect the set of columns to apply ANALYZE to and some higher-order function magic to produce the column list:
What you have written so far works for individual tables. What if you want to operate on all tables in a schema? SQL Scripting is fully composable. You can nest compound statements, conditional statements, and loops within other SQL scripting statements.
So what you will do here is twofold:
This error makes sense. You have multiple ways to proceed:
A key capability of SQL Scripting is the ability to intercept and handle exceptions. Condition handlers are defined in the declaration section of a compound statement, and they apply to any statement within that compound, including nested statements. You can handle specific error conditions by name, specific SQLSTATEs handling several error conditions, or all error conditions. Within the body of the condition handler, you can use the GET DIAGNOSTICS statement to retrieve information about the exception being handled and execute any SQL scripting you deem appropriate, such as recording the error in a log or running an alternative logic to the one that failed. You can then SIGNAL a new error condition, RESIGNAL the original condition, or simply exit the compound statement where the handler is defined and continue with the following statement.
In our script, you want to skip any statement for which the ALTER TABLE DEFAULT COLLATION statement did not apply and log the object's name.
Above, you have developed an administrative script purely in SQL. You can also write ELT scripts and turn them into Jobs. SQL Scripting is a truly powerful tool you should exploit.
Whether you are an existing Databricks user or migrating from another product, SQL Scripting is a capability you should use. SQL Scripting follows the ANSI standard and is fully compatible with OSS Apache Spark™. SQL Scripting is described in detail in SQL Scripting | Databricks Documentation.
You can also use this notebook to see for yourself.