Recursive Common Table Expressions (CTEs) are now supported in Databricks. This brings a native way to express loops and traversals in SQL, useful for working with hierarchical and graph-structured data. These capabilities are aligned with the SQL standard and follow familiar patterns used in platforms like Teradata. Recursive CTEs have long been part of the SQL standard, so they will be familiar to customers migrating from legacy data warehouses. Databricks has also contributed Recursive CTE support to Apache Spark™, making it fully open source.
Databricks uses standard ANSI SQL syntax for recursive CTEs, including the RECURSIVE keyword.
This seemingly small feature significantly enhances SQL's expressive capabilities, theoretically making it Turing complete—meaning it can perform any computation a computer can. Recursive CTEs enable composable solutions that previously required procedural code, such as Python or external tools.
Recursive CTEs are now available in Public Preview DBSQL 2025.20 and Databricks Runtime 17.0 (coming soon to Lakeflow Declarative Pipelines). In this blog, we’ll explore how recursive CTEs work—and how they can help you solve real-world problems using pure SQL.
Databricks’ recursive CTE support includes:
Recursive CTEs work well with both traditional systems that store hierarchical data in normalized tables as well as data coming from modern applications that generate flexible JSON/XML hierarchies. See examples below of each including RCTEs leveraging the Variant data type for JSON hierarchies.
Plus, support for recursive CTEs simplifies migrations from legacy database systems. Teradata and Postgres are two examples of systems whose syntax is identical, while systems like Oracle, which use CONNECT BY syntax, are easily converted.
Recursive CTEs are common table expressions defined with the RECURSIVE keyword. They consist of two parts combined by using UNION ALL:
Execution starts with the base query. Then, on each iteration, the recursive step is run using the output of the previous step. This continues until no new rows are produced.
To prevent infinite recursion from consuming excessive resources, Databricks enforces two safety limits: a maximum recursion depth of 100 steps and a row limit of 1 million. If either threshold is exceeded, the query fails with an error.
If you’re confident in your recursion requiring more than 100 steps to produce all the results, you may override the max level by using the MAX RECURSION LEVEL hint:
For more details, refer to the CTE documentation.
“At bp Supply Trading and Shipping – Market Risk, understanding portfolio hierarchy reporting across business units is critical for our business to operate efficiently. By replacing our legacy code with recursive CTEs in Databricks SQL, we reduced a hierarchical data preparation step from ~6 minutes to ~30 seconds, which is a 12× improvement.” — Dharmik Prajapati, bp Staff Software Engineer
In the manufacturing industry, every manufactured part requires a set of components to build. Each component could be broken down into a smaller set of individual parts. The complete set of all parts is called a Bill of Materials (BOM).
A BOM often forms a tree-like structure—or more generally, a directed acyclic graph (DAG). In this example, we look at the parts of a bicycle, which we’ll simplify by assuming a tree structure, where each component is used in exactly one parent.
Suppose we want to compute how many raw materials are needed to build a bicycle. Consider the following BOM:
Each row describes a component, the larger part it belongs to, and how many components are needed to assemble one unit of the parent.
The recursive CTE begins with one objective: building one bike. That’s the base case. In each recursive step, we break down components into their subcomponents. For instance, a bicycle includes a frame, a drivetrain, and two wheels. Each wheel, in turn, consists of a tire and 32 spokes. The recursive structure becomes clear as we break down parts into smaller pieces.
Once we've fully expanded the hierarchy, we filter out intermediate components (parents) to retain only the raw materials required for assembly.
This query computes the total quantity of each base material needed to build one bike:
Let’s examine a problem using a graph data structure. A graph consists of a set of nodes connected by edges. It is used to represent relationships or connections between pairs of elements. Solving a graph problem used to require Python, complicated scripting logic, or an external library. Now, recursive queries make it simple.
A typical graph structure problem is airplane travel: which airports can I reach using a series of flights? Suppose we have the following set of flights that exist on a day:
Each flight is given the IATA codes of its source and destination, along with the departure and arrival times.
Suppose a person arrives at the airport BEG at 8 AM, and wants to find all the possible travel routes they can make on that day.
This is naturally posed as an iterative problem. Each time we discover a new city we can reach, we find all flights that depart from there after our arrival time. For this reason, in the recursive CTE, we keep track of the arrival time at every airport.
This produces the set of all reachable airports, along with the required number and set of flights.
This query can help users explore all reachable destinations given schedule constraints, supporting applications like travel planning, package routing, or transportation logistics.
In the previous example, we defined column names in the WITH RECURSIVE ... AS (...) clause. Here, we define them in the anchor query instead. Both approaches are valid in recursive CTEs on Databricks.
Traditional systems often store hierarchical data in rigid, normalized tables. Meanwhile, modern applications frequently generate flexible JSON/XML hierarchies. Databricks’ combination of recursive CTEs with the VARIANT type lets you migrate these data patterns seamlessly, allowing you to query both traditional normalized data and flexible JSON/XML structures in a single system.
In this example, we’re given a (relatively small) company hierarchy. But instead of a fully structured table, we’re given it in the form of a JSON:
Suppose we want a list of all employees and their titles in a table. The fields of the people in the company don’t follow the same schema: some have direct subordinates, while others don’t; some have their location, some don’t! With the use of the VARIANT datatype in Databricks, all of their needed commonalities can be used inside a recursive CTE to fully explore the nested structure of the JSON, while their differences can be ignored.
The base case of the recursion is the full JSON data of the root employee, which includes a list of their subordinates. In each recursive step, the query processes each subordinate's data and repeats the process until it reaches an employee with no subordinates.
Here’s the recursive query for this example:
Even though all CTEs here are under a WITH RECURSIVE block, only the one with actual recursion is treated as recursive. Databricks is smart enough to detect which ones need recursion—even if you mark them all!
The output of the query:
Get started with recursive CTEs by reading the Databricks documentation.
To learn more about Databricks SQL, visit our website or read the documentation. You can also check out the product tour for Databricks SQL. If you want to migrate your existing warehouse to a high-performance, serverless data warehouse with a great user experience and lower total cost, then Databricks SQL is the solution — try it for free.