Power to the SQL People: Introducing Python UDFs in Databricks SQL
July 22, 2022 in Platform Blog
We were thrilled to announce the preview for Python User-Defined Functions (UDFs) in Databricks SQL (DBSQL) at last month's Data and AI Summit. This blog post gives an overview of the new capability and walks you through an example showcasing its features and use-cases.
Python UDFs allow users to write Python code and invoke it through a SQL function in an easy secure and fully governed way, bringing the power of Python to Databricks SQL.
Introducing Python UDFs to Databricks SQL
In Databricks and Apache Spark™ in general, UDFs are means to extend Spark: as a user, you can define your business logic as reusable functions that extend the vocabulary of Spark, e.g. for transforming or masking data and reuse it across their applications. With Python UDFs for Databricks SQL, we will expand our current support for SQL UDFs.
Let's look at a Python UDF example. Below the function redacts email and phone information from a JSON string, and returns the redacted string, e.g., to prevent unauthorized access to sensitive data:
CREATE FUNCTION redact(a STRING) RETURNS STRING LANGUAGE PYTHON AS $$ import json keys = ["email", "phone"] obj = json.loads(a) for k in obj: if k in keys: obj[k] = "REDACTED" return json.dumps(obj) $$;
To define the Python UDF, all you have to do is a
CREATE FUNCTION SQL statement. This statement defines a function name, input parameters and types, specifies the language as
PYTHON, and provides the function body between $$.
The function body of a Python UDF in Databricks SQL is equivalent to a regular Python function, with the UDF itself returning the computation's final value. Dependencies from the Python standard library and Databricks Runtime 10.4, such as the json package in the above example, can be imported and used in your code. You can also define nested functions inside your UDF to encapsulate code to build or reuse complex logic.
From that point on, all users with appropriate permissions can call this function as you do for any other built-in function, e.g., in the
WHERE part of a query.
Features of Python UDFs in Databricks SQL
Now that we described how easy it is to define Python UDFs in Databricks SQL, let's look at how it can be managed and used within Databricks SQL and across the lakehouse.
Manage and govern Python UDFs across all workspaces
Python UDFs are defined and managed as part of Unity Catalog, providing strong and fine-grained management and governance means:
- Python UDFs permissions can be controlled on a group (recommended) or user level across all workspaces using GRANT and REVOKE statements.
- To create a Python UDF, users need USAGE and CREATE permission on the schema and USAGE permission on the catalog. To run a UDF, users need EXECUTE on the UDF. For instance, to grant the finance-analysts group permissions to use the above
redactPython UDF in their SQL expressions, issue the following statement:
GRANT EXECUTE ON silver.finance_db.redact TO finance-analysts
- Members of the finance-analyst group can use the redact UDF in their SQL expressions, as shown below, where the contact_info column will contain no phone or email addresses.
SELECT account_nr, redact(contact_info) FROM silver.finance_db.customer_data
Enterprise-grade security and multi-tenancy
With the great power of Python comes great responsibility. To ensure Databricks SQL and Python UDFs meet the strict requirements for enterprise security and scale, we took extra precautions to ensure it meets your needs.
To this end, compute and data are fully protected from the execution of Python code within your Databricks SQL warehouse. Python code is executed in a secure environment preventing:
- Access to data not provided as parameters to the UDF, including file system or memory outside of the Python execution environment
- Communication with external services, including the network, disk or inter-process communication
This execution model is built from the ground up to support the concurrent execution of queries from multiple users leveraging additional computation in Python without sacrificing any security requirements.
Do more with less using Python UDFs
Serving as an extensibility mechanism there are plenty of use-cases for implementing custom business logic with Python UDFs.
Python is a great fit for writing complex parsing and data transformation logic which requires customization beyond what's available in SQL. This can be the case if you are looking at very specific or proprietary ways to protect data. Using Python UDFs, you can implement custom tokenization, data masking, data redaction, or encryption mechanisms.
Python UDFs are also great if you want to extend your data with advanced computations or even ML model predictions. Examples include advanced geo-spatial functionality not available out-of-the-box and numerical or statistical computations, e.g., by building upon NumPy or pandas.
Re-use existing code and powerful libraries
If you have already written Python functions across your data and analytics stack you can now easily bring this code into Databricks SQL with Python UDFs. This allows you to double-dip on your investments and onboard new workloads faster in Databricks SQL.
Similarly, having access to all packages of Python's standard library and the Databricks Runtime allows you to build your functionality on top of those libraries, supporting high quality of your code while at the same time making more efficient use of your time.
Get started with Python UDFs on Databricks SQL and the Lakehouse
If you already are a Databricks customer, sign up for the private preview today. We'll provide you with all the necessary information and documentation to get you started as part of the private preview.
If you want to learn more about Unity Catalog, check out this website. If you are not a Databricks customer, sign up for a free trial and start exploring the endless possibilities of Python UDFs, Databricks SQL and the Databricks Lakehouse Platform.
Join the conversation and share your ideas and use-cases for Python UDFs in the Databricks Community where data-obsessed peers are chatting about Data + AI Summit 2022 announcements and updates. Learn. Network. Celebrate.