Finding the right place to put an ad is a huge challenge, as traditional, keyword-based contextual content placement often falls short, missing nuance like sarcasm or non-obvious connections. This blog shows how an AI Agent built on Databricks moves beyond these limitations to achieve highly nuanced, deeply contextual content placement.
We'll explore how this can be done in the context of movie and television scripts to understand the specific scenes and moments where content will have the most impact. While we focus on this specific example, the concept can be generalized to a broader catalog of media data, including TV scripts, audio scripts (e.g., podcasts), news articles, or blogs. Alternatively, we could reposition this for programmatic advertising, where the input data would include the corpus of ad content and its associated metadata and placement, and the agent would generate the appropriate tagging to use for optimized placement via direct programmatic or ad server based placement.
This solution leverages Databricks’ latest advancements in AI Agent tooling, including Agent Framework, Vector Search, Unity Catalog, and Agent Evaluation with MLflow 3.0. The below diagram provides a high-level overview of the architecture.
From a practical standpoint, this solution enables ad sellers to ask in natural language the best place within a content corpus to slot advertisement content based on a description. So in this example, given our dataset contains a large volume of movie transcripts, if we were to ask the agent, “Where can I place an advertisement for pet food? The ad is an image of a beagle eating from a bowl”, we would expect our agent to return specific scenes from well-known dog movies, for example Air Bud or Marley & Me.
Below is a real example from our agent:
Now that we have a high-level understanding of the solution, let's dive into how we prepare the data to build the agent.
Preprocessing Movie Data for Contextual Placement
When adding a retrieval tool to an agent – a technique called Retrieval Augmented Generation (RAG) – the data processing pipeline is a critical step to achieving high quality. In this example, we follow best practices for building a robust unstructured data pipeline, which generally includes four steps:
The dataset we use for this solution includes 1200 full movie scripts, which we store as individual text files. To slot ad content in the most contextually relevant way, our preprocessing strategy is to recommend the specific scene in a movie, instead of the movie itself.
First, we perform parsing on the raw transcripts to split each script file into individual scenes, using standard screenplay writing format as our scene delimiters (e.g., “INT”, “EXT”, etc.). By doing so, we can extract relevant metadata to enrich the dataset and store it alongside the raw transcript in a Delta table (e.g., title, scene number, scene location).
Next, we implement a fixed-length chunking strategy to our cleansed scene data while filtering out shorter-length scenes, as retrieving those would not provide much value in this use case.
Note: While we initially considered fixed-length chunks (which would have likely been better than full scripts), splitting at scene delimiters offered a significant boost in the relevance of our responses.
Next, we load the scene-level data into a Vector Search Index, taking advantage of the built-in Delta-Sync and Databricks-managed embeddings for ease of deployment and use. This means that if our script database updates, our corresponding Vector Search index updates as well to accommodate the data refresh. The image below demonstrates an example of a single movie (10 Things I Hate About You) broken up by scenes. Using vector search allows our agent to find scenes that are semantically similar to the ad content’s description, even if there are no exact keyword matches.
Creating the highly available and governed Vector Search index is simple, requiring only a few lines of code to define the endpoint, source table, embedding model, and Unity Catalog location. See the code below for the creation of the index in this example.
Now that our data is in order, we can progress to building out our content placement agent.
A core principle of Agentic AI at Databricks is equipping an LLM with the requisite tools to effectively reason on enterprise data, unlocking data intelligence. Rather than asking the LLM to perform an entire end-to-end process, we offload certain tasks to tools and functions, making the LLM an intelligent process orchestrator. This enables us to use it exclusively for its strengths: understanding user semantic intent and reasoning about how to solve a problem.
For our application, we use a vector search index as a means to efficiently search for relevant scenes based on a user request. While an LLM's own knowledge base could theoretically be used to retrieve relevant scenes, using the Vector Search index approach is more practical, efficient, and secure because it guarantees retrieval from our governed enterprise data in Unity Catalog.
Note that the Agent uses the comments in the function definition to identify when and how to call the function on user inquiries. The code below demonstrates how to wrap a Vector Search index into a standard Unity Catalog SQL function, making it an accessible tool for the agent's reasoning process.
Now that we have an agent defined, what is next?
One of the biggest obstacles that prevents teams from getting agentic applications into production is the ability to measure the quality and effectiveness of the agent. Subjective 'vibes' based evaluations are not acceptable in a production deployment. Teams need a quantitative way to ensure their application is performing as expected and to guide iterative improvements. All these questions will keep product and development teams up at night. Enter Agent Evaluation with MLflow 3.0 from Databricks. MLflow 3.0 provides a robust suite of tools including model tracing, evaluation, monitoring, and a prompt registry to manage the end-to-end agent development lifecycle.
The evaluation functionality enables us to leverage built-in LLM-judges to measure quality against pre-defined metrics. However, for specialized scenarios like ours, customized evaluation is often required. Databricks supports various levels of customization, from defining natural language “guidelines”, where a user provides judge criteria in natural language and Databricks manages the judge infrastructure, Prompt-based judges where the user provides a prompt and a custom evaluation criteria, or custom scorers which may be simple heuristics or LLM judges completely defined by the user.
In this use case, we use both a custom guideline for response format and a prompt-based custom judge to assess scene relevance, offering a powerful balance of control and scalability.
Another common challenge in Agent Evaluation is not having a ground truth of user requests to evaluate against when building your agent. In our case, we do not have a robust set of possible customer requests, so we also needed to generate synthetic data to measure the effectiveness of the agent we built. We leverage the built-in `generate_evals_df` function to perform this task, giving instructions to generate examples that we expect will match our customer requests. We use this synthetically generated data as the input for an evaluation job to bootstrap a dataset and enable a clear quantitative understanding of our agent performance prior to delivering to customers.
With the dataset in place, we can run an evaluation job to determine the quality of our agent in quantitative terms. In this case, we use a mix of built-in judges (Relevance and Safety), a custom guideline that evaluates whether the agent returned data in the right format, and a prompt-based custom judge that evaluates the quality of the scene returned relative to the user query on a 1-5 scale. Lucky for us our agent seems to perform great based on our LLM judge feedback!
Within MLflow 3, we can also dive deeper into the traces to understand how our model is performing and understand the judge’s rationale behind every response. These observation-level details are extremely useful for digging into edge cases, making corresponding changes to the agent definition, and seeing how these changes impact performance. This rapid iteration and development loop is extremely powerful for building high-quality agents. We no longer are flying blind, and we now have a clear quantitative view into the performance of our application.
While LLMs-as-Judges are extremely useful and often necessary for scalability, often subject-matter expert feedback is needed to feel confident to move to production, as well as to improve the overall performance of the agent. Subject matter experts are often not the AI engineers developing the agentic process, so we need a way to gather feedback and integrate it back into our product and judges.
The Review App that comes with deployed agents via the Agent Framework provides this functionality out of the box. Subject Matter Experts can either interact in free-form with the agent, or engineers can create custom labeling sessions that ask subject matter experts to evaluate specific examples. This can be extremely useful for observing how the agent performs on challenging cases, or even as “unit-testing” on a suite of test cases that might be highly representative of end-user requests. This feedback - positive or negative - is directly integrated into the evaluation dataset, creating a “gold-standard” that can be used for downstream fine-tuning, as well as improving automated judges.
Agentic evaluation is certainly challenging and can be time-consuming, requiring coordination and investment across partner teams, including subject matter expert time, which may be perceived as outside the scope of normal role requirements. At Databricks, we view evaluations as the foundation of agentic application building, and it is critical that organizations recognize the importance of evaluation as a core component of the agentic development process.
Building agents on Databricks provides flexible options for deployment in both batch and real-time use cases. In this scenario, We leverage Databricks Model Serving to generate a scalable, secure, real-time endpoint that integrates downstream via the REST API. As a simple example, we expose this via a Databricks app that also functions as a custom Model Context Protocol (MCP) server, which enables us to leverage this agent outside of Databricks as a tool.
As an extension to the core functionality, we can integrate image-to-text capabilities into the Databricks app. Below is an example where an LLM parses the inbound image, generates a text caption, and submits a custom request to the content placement agent including a desired target audience. In this case, we leverage a multi-agent architecture to personalize an ad image using the Pet Ad Image Generator, and asked for a placement:
By wrapping this agent in a custom MCP server, it extends the integration options for advertisers, publishers, and media planners into the existing adtech ecosystem.
By providing a scalable, real-time, and deeply contextual placement engine, this AI Agent moves beyond simple keywords to deliver significantly higher ad relevance, directly improving campaign performance and reducing ad waste for advertisers and publishers alike.
Learn More About AI Agents on Databricks: Explore our dedicated resources on building and deploying Large Language Models and AI Agents on the Databricks Lakehouse Platform.
Talk to an Expert: Ready to apply this to your business? Contact our team to discuss how Databricks can help you build and scale your next-generation advertising solution.