How To Migrate Your Oracle PL/SQL Code to Databricks Lakehouse Platform

by Laurent Léturgez, Leo Mao and Soham Bhatt

February 13, 2023 in Platform Blog

Share this post

Oracle is a well-known technology for hosting Enterprise Data Warehouse solutions. However, many customers like Optum and the U.S. Citizenship and Immigration Services chose to migrate to the Databricks Lakehouse Platform to leverage the power of data, analytics, and AI in one single platform at scale and to deliver business value faster. For example, Optum's on-premises Oracle-based data warehouse system struggled to quickly process and analyze the data. With Azure Databricks, they have improved data pipeline performance by 2x, enabling faster delivery of results to hospitals, saving them millions of dollars in potentially lost revenue.

Migrating from Oracle to Databricks involves multiple steps, the most critical ones are:

Modeling the Enterprise Data Warehouse into the Lakehouse
Data migration from Oracle tables to Delta tables using Databricks Ingest or our Data Ingestion partners
Code migration from PL/SQL to PySpark or Spark SQL (covered in this blog)
Data processing Job orchestration
Overall EDW migration methodology for a successful migration

In this blog post, we'll focus on converting PL/SQL proprietary code to an open standard python code and take advantage of PySpark for ETL workloads and Databricks SQL's data analytics workload power.

Challenge of converting PL/SQL to PySpark

As the need for creating data pipelines and ETL grew, every database needed a programming language wrapper to have the context to pass parameters and handle datasets programmatically. Instead of using open source standards like Python, most databases created their own proprietary languages. PL/SQL is Oracle's version of programming language extensions on SQL. It leverages the SQL language with procedural elements (parameters, variables, datasets as cursors, conditional statements, loops, exception blocks etc.). Its proprietary language and extensions were developed over the years and have their own specificities that can make it tricky to convert to a standard, widely used, full-blown open source programming language like Python.

An example of this is the supplied PL/SQL packages (DBMS_ or UTL_ packages etc.) and the user-defined types that can be used as objects for column types (objects or collections defined as a column type) which makes migration quite complex. These Oracle-specific features and numerous others need to be considered during code conversion to Apache Spark™.

Many organizations have created ETL data processing jobs by writing PL/SQL procedures and functions wrapped into packages that run against an Oracle database. You can convert these PL/SQL jobs to open source python and Spark and run it in Databricks notebooks or Delta Live Tables without any of the complexity of PL/SQL and run it on the modern Databricks on-demand serverless compute.

Migrate PL/SQL code to PySpark for your ETL pipelines

ETL Process is used mostly for:

Ingesting data from multiple sources
Validating, cleaning and transforming data
Loading data into various data layers (bronze, silver, gold / Operational Data Store, Data Warehouse, Data Marts … depending on the data architecture)

In Oracle Databases, PL/SQL is usually used to validate/transform the data in place. Depending on the Oracle database architecture, data moves from various containers that could be a user schema and/or a pluggable database.
Here's an example of a typical Oracle database implementation that supports a Data Warehouse using PL/SQL for ETL.

Typical Oracle database implementation that supports a Data Warehouse using PL/SQL for ETL.

Moving from Oracle and PL/SQL to the Databricks Lakehouse will leverage many key aspects:

PySpark will provide a standard library in Python, which will provide the capability to process various data sources at scale directly to the ODS without having to materialize a table in the staging area. This can be done with a python notebook scheduled on a regular basis with Databricks workflows.
Delta Live Tables will deliver the capability to implement the whole ETL pipeline in either python or SQL notebook with data quality control (Manage data quality with Delta Live Tables) and are able to process data either in batch or real-time (Process streaming data with Delta Live Tables).

Regardless of the feature used, PL/SQL logic will be migrated into python code or SQL. For example, PL/SQL functions will be translated into PySpark and are called directly or through a python user-defined function (See this link on how to use Python UDF in Delta Live Tables: Delta Live Tables cookbook )

Migrate PL/SQL code to Databricks SQL or Python UDF

Databricks SQL is used to run many SQL Workloads and one of them is to run analytics queries based on data hosted on the lakehouse. Those analytics queries can require some functions to be executed on these tables (data redaction etc.).
These functions are usually addressed in Oracle by PL/SQL functions or packages' functions and will be migrated during the process.

Python UDF on Databricks SQL leverages traditional SQL workloads with the functionalities brought by the python language.

PL/SQL Code migration samples

This section will be dedicated to some examples of code migration from Oracle PL/SQL to Databricks. Based on best practices and our recommendations, each example depends on the implementation choice. (PySpark in an ETL process, Python UDF in a Databricks SQL analytic workload).

Dynamic cursors using DBMS_SQL supplied Oracle package

In Oracle, a cursor is a pointer to a private SQL area that stores information about processing a specific SELECT or DML statement. A cursor that is constructed and managed by the Oracle kernel through PL/SQL is an implicit cursor. A cursor that you construct and manage is an explicit cursor.

In Oracle, cursors can be parameterized by using dynamic strings for the SQL statement, but this technique can lead to SQL Injection issues, that's why it's better to use DBMS_SQL supplied PL/SQL package or EXECUTE IMMEDIATE statements which will help to build dynamic statements. A cursor can very easily be converted to a Spark DataFrame.

The following example is how we transform dynamic SQL statements built with an Oracle-supplied PL/SQL package to PySpark.

Here is the PL/SQL SQL code in Oracle.

create or replace function get_prod_name(in_pid in number) return VARCHAR2
as
    sql_stmt VARCHAR2(256);
    l_i_cursor_id   INTEGER;
    l_n_rowcount    NUMBER;
    l_vc_prod_name VARCHAR2(30);
BEGIN
    sql_stmt := 'select prod_name from products where prod_id=:pid FETCH NEXT 1 ROWS ONLY'; 
    l_i_cursor_id:=dbms_sql.open_cursor;
    
    dbms_sql.parse(l_i_cursor_id,sql_stmt, dbms_sql.native);

    dbms_sql.bind_variable(l_i_cursor_id,'pid', in_pid);
    
    dbms_sql.define_column(l_i_cursor_id,1, l_vc_prod_name,30);
    
    l_n_rowcount:= dbms_sql.execute_and_fetch(l_i_cursor_id);
    
    dbms_sql.column_value(l_i_cursor_id,1,l_vc_prod_name);
    dbms_sql.close_cursor(l_i_cursor_id);
    
    return l_vc_prod_name;
END;
/

Here is the code that performs the same functionality in PySpark.

import pyspark.sql.functions as f

def get_prod_name(in_pid:int) -> str:
  sql_stmt="""select PROD_NAME 
     from products 
     where PROD_ID={} 
     LIMIT 1""".format(in_pid)
  ret=spark.sql(sql_stmt).collect()[0]["PROD_NAME"]
  return ret

print(get_prod_name(20))

# Function call
print(get_prod_name(20))

Collections migrations to python

In Oracle PL/SQL, many collection and record types exist.

Collection Type	Number of Elements	Index Type	Dense or Sparse
Associative array (or index-by table)	Unspecified	String or Integer	Either
VARRAY (variable-size array)	Specified	Integer	Always dense
Nested table	Unspecified	Integer	Starts dense, can become sparse

These can be migrated to Python elements regardless if they are executed as tasks using PySpark for ETL purposes, or into Python UDF in Databricks SQL.

An associative array is a set of key-value pairs. Each key is a unique index used to locate the associated value with the syntax variable_name(index).

The data type of index can be either a string type (VARCHAR2, VARCHAR, STRING, or LONG) or PLS_INTEGER. Indexes are stored in sort order, not creation order. The best way to migrate an associate array to python (or PySpark) is to use a dictionary structure.

Here is what the code looks like in PL/SQL:

DECLARE
  -- Associative array indexed by string:
  
  TYPE population IS TABLE OF NUMBER  
    INDEX BY VARCHAR2(64);           
  
  city_population  population;   
  i  VARCHAR2(64);
BEGIN
  -- Add elements (key-value pairs) to associative array:
 
  city_population('Smallville')  := 2000;
  city_population('Midland')     := 750000;
  city_population('Megalopolis') := 1000000;
 
  -- Change value associated with key 'Smallville':
 
  city_population('Smallville') := 2001;
 
  -- Print associative array:
 
  i := city_population.FIRST;  -- Get first element of array
 
  WHILE i IS NOT NULL LOOP
    DBMS_Output.PUT_LINE
      ('Population of ' || i || ' is ' || city_population(i));
    i := city_population.NEXT(i);  -- Get next element of array
  END LOOP;
END;
/

Below is, an example on how to convert associative arrays into python from PL/SQL:

# Declare dictionary and add elements
city_population={'Smallville':2000, 'Midland':750000, 'Megalopolis':1000000}

# modify an element
city_population['Smallville']=1750

# get the first element of dictionary
elt_cnt=list(city_population.keys())[0]

# print dictionary content
for k,v in city_population.items():
  print(f"Population of {k} is {v}")

Data redaction

On the semantic layer of a data warehouse, it is sometimes necessary to redact sensitive data. To do that, functions are very often used to implement the data redaction process.

In an Oracle Database, you can use Advanced Security Option for Data Redaction or a PL/SQL that will implement the redaction. Both of these techniques can be used by our migrations teams, but if the source database uses PL/SQL to do that, the best solution will be to use Python UDF into Databricks SQL.

Python UDFs allow users to write Python code and invoke it through a SQL function in an easy, secure and fully governed way, bringing the power of Python to Databricks SQL.

In the following example, we translated a PL/SQL function that redacts product names when the list price is greater than 100 by using the python UDF feature.

Code in PL/SQL as below:

CREATE OR REPLACE FUNCTION simple_redaction(
  input VARCHAR2,
  price NUMBER)
return varchar2
as
BEGIN
  IF (price > 100) THEN
    RETURN SUBSTR(input,0,1)||'*****'||SUBSTR(input,-1);
  ELSE
    RETURN input;
  END IF;
END;
/

SQL> select simple_redaction(prod_name, list_price) as  
     r_prod_name, list_price 
     from product;

R_PROD_NAME		       LIST_PRICE
------------------------------ ----------
product 1			     10.5
p*****2 			            103.1
product 3			     5.99
product 4			    12.35
product 5			       35
e*****t 			             1400

The Python UDF will be as below:

create or replace function simple_redaction(_input string, 
                 _price float)
returns STRING
language python
as $$
if _price >100:
  return (_input[0]+"*****"+_input[-1])
else:
  return _input
$$;



select simple_redaction(prod_name, list_price) as r_prod_name, list_price from product;

--------------------------------
| r_prod_name	| list_price   |
--------------------------------
| product 1	| 10.5         |
| p*****2	| 103.1        |
| product 3	| 5.99         | 
| product 4	| 12.35        |
| product 5	| 35.0         |
| e*****t	| 1400.0       |
--------------------------------

Planning your PL/SQL migration

Databricks and our SI/consulting partners can help you with a detailed technical migration assessment which includes your target architecture, technical assessment of your existing code, such as the number of objects to be migrated, their overall complexity classification, technical approaches to data, code and report modernization etc. Our customers can execute the migration in-house manually or accelerate their migration by using an automated code conversion of PL/SQL to PySpark.

Automated migration approach

The Data Warehouse Migration practice at Databricks is thriving, and we have several ISV and Consulting/SI partners who can assist with EDW Migrations. Data Ingestion partners like Fivetran, Qlik, Arcion can help migrate the data in real-time using CDC from Oracle to Databricks and low-code/code optional ETL partners like Matillion and Prophecy can also help if Stored procedures need to be converted to visual ETL mappings. See the full list of our ISV partners here.

With the aid of legacy platform assessment tools and automatic code conversion accelerators, Databricks Professional Services and several of our authorized Migrations Brickbuilder SI partners can also migrate PL/SQL code quickly and effectively to native Databricks Notebooks.

Here is one example of an automated code conversion demo from PL/SQL to PySpark by BladeBridge, our ISV conversion partner.

LeapLogic is another partner that also has automated assessment and code converters from various EDWs to Databricks. Here is a demo of their Oracle conversion tool to Databricks.

Most consulting/SI partners use similar automated conversion tools unless it is a full modernization and redesign.

Whether you choose to modernize your Legacy Oracle EDW platform in-house or with the help of a consulting partner, Databricks migration specialists and professional services team are here to help you along the way.
Please see this EDW Migration page for more information and partner migration offerings.

Feel free to reach out to the Databricks team for a customized Oracle Migration assessment.