Over the years, organizations have amassed a vast amount of unstructured text data—documents, reports, and emails—but extracting meaningful insights has remained a challenge. Large Language Models (LLMs) now offer a scalable way to analyze this data, with batch inference as the most efficient solution. However, many tools still focus on online inference, leaving a gap for better batch processing capabilities.
Today, we’re excited to announce a simpler, faster, and more scalable way to apply LLMs to large documents. No more exporting data as CSV files to unmanaged locations—now you can run batch inference directly within your workflows, with full governance through Unity Catalog. Simply write the SQL query below and execute it in a notebook or workflow.
Using ai_query, you can now run at high scale with unmatched speed, ensuring fast processing of even the largest datasets. The interface supports all AI models, allowing you to securely apply LLMs, traditional AI models, or compound AI systems to analyze your data at scale.
“With Databricks, we processed over 400 billion tokens by running a multi-modal batch pipeline for document metadata extraction and post-processing. Working directly where our data resides with familiar tools, we ran the unified workflow without exporting data or managing massive GPU infrastructure, quickly bringing generative AI value directly to our data. We are excited to use batch inference for even more opportunities to add value for our customers at Scribd, Inc. “ - Steve Neola, Senior Director at Scribd
Batch inference enables businesses to apply LLMs to large datasets all at once, rather than one at a time, as with real-time inference. Processing data in bulk provides cost efficiency, faster processing, and scalability. Some common ways businesses are using batch inference include:
Existing batch inference approaches present several challenges, such as:
"With Databricks, we could automate tedious manual tasks by using LLMs to process one million+ files daily for extracting transaction and entity data from property records. We exceeded our accuracy goals by fine-tuning Meta Llama3 8b and, using Mosaic AI Model Serving, we scaled this operation massively without the need to manage a large and expensive GPU fleet." - Prabhu Narsina, VP Data and AI, First American

Mosaic AI allows you to perform batch LLM inference directly where your governed data resides with no data movement or preparation needed. Applying batch LLM inference is as simple as creating an endpoint with any AI model and running an SQL query (as shown in the figure). You can deploy any AI models—base, fine-tuned, or traditional—and execute SQL functions from any development environment on Databricks, whether interactively in the SQL editor or notebook or scheduled through Workflows and Delta Live Tables (DLT).

This release introduces multiple infrastructure improvements, enabling you to process millions of rows quickly and cost-effectively. The infrastructure scales automatically, adjusting resources to handle even the largest workloads efficiently. Additionally, built-in fault tolerance with automatic retries allows you to run large workflows confidently, seamlessly handling any errors along the way.
Real-world use cases require preprocessing and post-processing, with LLM inference often just one part of a broader workflow. Instead of piecing together multiple tools and APIs, Databricks enables you to execute the entire workflow on a single platform, reducing complexity and saving valuable time. Below is an example of how to run an end-to-end workflow with the new solution.

Or, if you’d prefer, you can leverage SQL’s advanced nesting features to directly blend these into a single query.
