Skip to main content

Introducing SQL Scripting Support in Databricks, Part 1

An overview of procedural logic, in ANSI SQL

sql-scripting-welcome-og

Summary

  • Automate database schema administration with SQL scripts
  • Combine AI and SQL scripting for cleaning messy text data
  • Split and manage incoming transactions across multiple tables

Today, Databricks announces support for the ANSI SQL/PSM scripting language!


SQL Scripting is now available in Databricks, bringing procedural logic like looping and control-flow directly into the SQL you already know.  Scripting in Databricks is based on open standards and fully compatible with Apache Spark™.

For SQL-first users, this makes it easier to work directly on the Lakehouse while taking advantage of Databricks’ scalability and AI capabilities.

If you already use Databricks, you’ll find SQL scripting especially useful for building administrative logic and ELT tasks. Key features include:

  • Scoped local variables
  • Native exception handling based on symbolic error conditions
  • IF-THEN-ELSE and CASE support
  • Multiple loop constructs, including FOR loops over queries
  • Loop control with ITERATE and LEAVE
  • Dynamic SQL execution through EXECUTE IMMEDIATE

Enough with the feature list — let's walk through some real examples. You can use this notebook to follow along.

Data management

Administrative tasks and data cleanup are a constant in enterprise data management — necessary, routine, and impossible to avoid. You’ll need to clean up historical records, standardize mixed formats, apply new naming conventions, rename columns, widen data types, and add column masks.  The more you can automate these tasks, the more reliable and manageable your systems will be over time. One common example: enforcing case-insensitive behavior for all STRING columns in a table.

Let’s walk through how SQL scripting can make this kind of schema administration repeatable and straightforward.

Schema administration: make all STRING columns in a table case-insensitive

In this example, we want to apply a new policy for string sorting and comparison for every applicable column in the table called employees. We will use a standard collation type, UTF8_LCASE, to ensure that sorting and comparing the values in this table will always be case-insensitive. Applying this standard allows users to benefit from the performance benefits of using collations, and simplifies the code as users no longer have to apply LOWER() in their queries.

We will use widgets to specify which table and collation type to alter. Using the information schema, we will then find all existing columns of type STRING in that table and alter their collation. We will collect the column names into an array. Finally, we will collect new statistics for the altered columns, all in one script.

A natural extension of the above script is to extend it to all tables in a schema, and refresh views to pick up the collation change. 

Data cleansing: fix grammar in free-form text fields

Is there any issue more common in the world of data than ‘dirty data’? Data from different systems, devices, and humans, will inevitably have differences or mistakes that must be corrected. If data is not cleaned up, you may have wrong results and miss an important insight. You can expect a garbage response if you feed garbage into an LLM. 

Let’s look at an example that includes the bane of every publication, including this blog: typos. We have a table that includes free-text entries in a column called description. The issues in the text, which include spelling and grammar mistakes, would be apparent to anyone who knows English. Leaving the data in this state will undoubtedly lead to issues later if trying to analyze or inspect the text. Let’s fix it with SQL Scripting!  First, we extract tables holding this column name from the information schema. Then fix any spelling errors using ai_fix_grammar(). This function is non-deterministic. So we use MERGE to achieve our goal. 

An interesting improvement could be to let ai_classify() deduce whether a column contains free-form text from the column name or sample data. SQL Scripting makes administrative tasks and cleaning up messy data efficient and straightforward.

ETL

Customers use SQL for ETL today. Why? Because SQL supports a robust set of data transformation capabilities, including joins, aggregations, filtering, with intuitive syntax, making pipeline code easy for any Data Engineer to extend, update, and maintain. Now, with SQL Scripting, customers can simplify previously complex approaches or handle more complex logic with pure SQL.

Updating multiple tables

Anyone who sells physical products will have a process for monitoring sales and tracking shipments. A typical data management pattern is to model multiple tables to track transactions, shipments, deliveries, and returns. Transaction monitoring is business critical, and like any critical process, it requires the handling of unexpected values. With SQL Scripting, it’s easy to leverage a conditional CASE statement to parse transactions into their appropriate table, and if an error is encountered, to catch the exception. 

In this example, we consider a raw transactions table for which rows must be routed into a known set of target tables based on the event type. If the script encounters an unknown event, a user-defined exception is raised. A session variable tracks how far the script got before it finished or encountered an exception.

This example script could be extended with an outer loop that keeps polling for more data. With SQL Scripting, you have both the power and flexibility to manage and update data across your data estate. SQL Scripting gives you power to tackle any data management task and efficiently control the flow of data processing. 

Stay tuned to the Databricks blog and the SQL sessions at the upcoming Data + AI Summit, as we prepare to launch support for Temp Tables, SQL Stored Procedures, and more! 

What to do next

Whether you are an existing Databricks user doing routine maintenance or orchestrating a large-scale migration, SQL Scripting is a capability you should exploit. SQL Scripting is described in detail in SQL Scripting | Databricks Documentation.

You can try these examples directly in this SQL Scripting Notebook. For more details, stay tuned for Part 2 of this series, which dives into SQL Scripting constructs and how to use them. 

 

Never miss a Databricks post

Subscribe to the categories you care about and get the latest posts delivered to your inbox