Skip to main content

Introducing SQL Scripting in Databricks, Part 2

A deep dive into the SQL Scripting constructs and how to use them

SQL Scripting Deep Dive OG

Summary

  • Introduce and explain SQL Scripting features
  • Conditionally execute and control flow within SQL Scripts
  • Intercept and handle specific error conditions

In part two of the SQL Scripting announcement blog series, we will examine the administrative task we discussed in part one—how to apply a case-insensitive rule to every STRING column in a table. We’ll walk through that example step by step, explain the features used, and expand it beyond a single table to cover an entire schema.

You can also follow along in this notebook

Changing the collation of all text fields in all tables in a schema

Databricks supports a wide range of language-aware, case-insensitive, and accent-insensitive collations. It's easy to use this feature for new tables and columns. But what if you have an existing system using upper() or lower() in predicates everywhere and want to pick up the performance improvements associated with a native case-insensitive collation while simplifying your queries? That will require some programming; now you can do it all in SQL. 

Let’s use the following test schema: 

The order is based on the ASCII codepoints, where all uppercase letters precede all lowercase letters. Can you fix this without adding upper() or lower()?

Dynamic SQL statements and setting variables

Our first step is to tell the table to change its default collation for newly added columns. You can feed your local variables with parameter markers, which the notebook will automatically detect and add widgets. You can also use EXECUTE IMMEDIATE to run a dynamically composed ALTER TABLE statement.

Every SQL script consists of a BEGIN .. END (compound) statement. Local variables are defined first within a compound statement, followed by the logic.

This is all just a set of linear statements. So far, you could write all this with SQL Session variables without the compound statement. You also have not achieved much. After all, you wanted to change the collation for existing columns. To do this, you need to:

  • Discover all existing string columns in the table
  • Change the collation for each column

In short, you need to loop over the INFORMATION_SCHEMA.COLUMNS table.

Loops

SQL Scripting offers four ways of looping and ways to control loop iterations.

  1. LOOP … END LOOP;
    This is a "forever" loop.
    This loop will continue until an exception or an explicit ITERATE or LEAVE command breaks out of the loop.
    We will discuss exception handling later and point to the ITERATE and LEAVE documentation explaining how to control loops.
  2. WHILE predicate DO … END WHILE;
    This loop will be entered and re-entered as long as the predicate expression evaluates to true or the loop is broken out by an exception, ITERATE or LEAVE.
  3. REPEAT … UNTIL predicate END REPEAT;
    Unlike WHILE, this loop is entered at least once and re-executes until the predicate expression evaluates to false or the loop is broken by an exception, LEAVE, or ITERATE command.
  4. FOR query DO …. END FOR;
    This loop executes once per row the query returns unless it is left early with an exception, LEAVE, or ITERATE statement.

Now, apply the FOR loop to our collation script. The query gets the column names of all string columns of the table. The loop body alters each column collation in turn:


Let's verify that the table has been properly updated:

So far, so good. Our code is functionally complete, but you should tell Delta to analyze the columns you modified to benefit from file skipping. You don't want to do this per column. But gather all of them together and do the work only if there was, in fact, a string column for which the collation was altered. Decisions, decisions …. 

Conditional logic

SQL Scripting offers three ways to perform conditional execution of SQL statements.

  1. If-then-else logic. The syntax for this is straightforward:
    IF predicate THEN … ELSEIF predicate THEN … ELSE …. END IF;
    Naturally, you can have any number of optional ELSEIF blocks, and the final ELSE is also optional.
  2. A simple CASE statement
    This statement is the SQL Scripting version of the simple case expression.
    CASE expression WHEN option THEN … ELSE … END CASE;
    A single execution of an expression is compared to several options, and the first match decides which set of SQL statements should be executed. If none match, the optional ELSE block will be executed.
  3. A searched CASE statement
    This statement is the SQL Scripting version of the searched case expression.
    CASE WHEN predicate THEN …. ELSE … END CASE;
    The THEN block is executed for the first of any predicates that evaluate to true. If none match, the optional  ELSE block is executed.

For our collation script, a simple IF THEN END IF will suffice. You also need to collect the set of columns to apply ANALYZE to and some higher-order function magic to produce the column list:

Nesting

What you have written so far works for individual tables. What if you want to operate on all tables in a schema? SQL Scripting is fully composable. You can nest compound statements, conditional statements, and loops within other SQL scripting statements.

So what you will do here is twofold:

  1. Add an outer FOR loop to find all tables within a schema using INFORMATION_SCHEMA.TABLES. As part of this, you need to replace the references to the table name variable with references to the results of the FOR loop query. 
  2. Add a nested compound to move the column list variable down into the outer FOR loop. You cannot declare a variable directly in the FOR loop body; it does not add a new scope. This is mainly a decision related to coding style, but you will have a more serious reason for a new scope..

This error makes sense. You have multiple ways to proceed:

  1. Filter out unsupported table types, such as views, in the information schema query. The problem is that there are numerous table types, and new ones are occasionally added.
  2. Handle views. That's a great idea. Let's call that your homework assignment.
  3. Tolerating the error condition

Exception handling

A key capability of SQL Scripting is the ability to intercept and handle exceptions. Condition handlers are defined in the declaration section of a compound statement, and they apply to any statement within that compound, including nested statements. You can handle specific error conditions by name, specific SQLSTATEs handling several error conditions, or all error conditions. Within the body of the condition handler, you can use the GET DIAGNOSTICS statement to retrieve information about the exception being handled and execute any SQL scripting you deem appropriate, such as recording the error in a log or running an alternative logic to the one that failed. You can then SIGNAL a new error condition, RESIGNAL the original condition, or simply exit the compound statement where the handler is defined and continue with the following statement.

In our script, you want to skip any statement for which the ALTER TABLE DEFAULT COLLATION statement did not apply and log the object's name.

Above, you have developed an administrative script purely in SQL. You can also write ELT scripts and turn them into Jobs. SQL Scripting is a truly powerful tool you should exploit.  

What to do next

Whether you are an existing Databricks user or migrating from another product, SQL Scripting is a capability you should use. SQL Scripting follows the ANSI standard and is fully compatible with OSS Apache Spark™. SQL Scripting is described in detail in SQL Scripting | Databricks Documentation

You can also use this notebook to see for yourself. 

Never miss a Databricks post

Subscribe to the categories you care about and get the latest posts delivered to your inbox