Today’s advertising demands more than just eye-catching images. It needs a creative that actually fits the tastes, segments, and expectations of a target audience. This is the natural next step after understanding your audience; using what you know about a segment’s preferences to create imagery that truly resonates (customized to your audience).
Multimodal Retrieval-Augmented Generation (RAG) provides a practical way to do this at scale. It works by combining a text-based understanding of a target segment (e.g., “outdoor-loving dog owner”) with fast search and retrieval of semantically relevant real images. These retrieved images then serve as context for generating a new creative. This bridges the gap between customer data and high-quality content, making sure the output connects with the target audience.
This blog post will demonstrate how modern AI, through image retrieval and multimodal RAG, can make ad creative more grounded and relevant, all powered end-to-end with Databricks. We’ll show how this works in practice for a hypothetical pet food brand called “Bricks” running pet-themed campaigns, but the same technique can be applied to any industry where personalization and visual quality matter.
This solution turns audience understanding into relevant, on-brand visuals by leveraging Unity Catalog, Agent Framework, Model Serving, Batch Inference, Vector Search, and Apps. The below diagram provides a high-level overview of the architecture.
Because everything flows through UC, the system is secure by default, observable (MLflow tracing on retrieval and generation), and easy to evolve (swap models, tune prompts). Let’s dive deeper.
All of the pet image assets (seed pet images, brand image, and final ads) live in a UC Volume, e.g:
UC Volumes provide an efficient storage solution for our image data with a FUSE-mount point and are useful as they allow for one governance plane. The same ACLs apply whether you access the files from notebooks, serving endpoints, or the Databricks app. It also means there are no blob keys in our code. We pass paths (not bytes) across services.
For fast indexing and governance, we then mirror the Volume into a Delta table that loads the raw images bytes (as a BINARY column) alongside metadata. We then convert this into a base64 encoded string (required for model serving endpoint). This makes downstream batch jobs (like embedding) simple and efficient.
Now there is one queryable table and we have also preserved the original Volume paths for human readability and app use.
CLIP is a light-weight image encoder with a transformer text-encoder model that is trained via contrastive learning to produce aligned image-text embeddings for zero-shot classification and retrieval. To keep retrieval fast and reproducible, we expose CLIP encoders behind a Databricks Model Serving endpoint which we can use for both online lookups and offline batch inference. We simply package the encoder as an MLflow pyfunc function, register and log as a UC model and serve it with a custom model serving, and then call it from SQL with ai_query to fill the embeddings column in the Delta table.
We wrap CLIP ViT-L/14 as a small pyfunc that accepts a base64 image string and returns a normalized vector. After logging, we register the model for serving.
We log the model using MLflow and register it to a UC model registry.
We create a GPU serving endpoint for the registered model. This gives a low-latency, versioned API we can call.
With images landed in a Delta table, we compute embeddings in place using SQL. The result is a new table with an image_embeddings column, ready for Vector Search.
After we’ve materialized the image embeddings in Delta, we make them searchable with Databricks Vector Search. The pattern is to create a Delta Sync index with self-managed embeddings, then at runtime embed the text prompt using CLIP and run a top-K similarity search. The service returns small, structured results (paths + optional metadata), and we pass UC Volume paths forward (not raw bytes) so the rest of the pipeline stays light and governed.
Create a vector index once. It continuously syncs from our embeddings table and serves low-latency queries.
Because the index is Delta synced, any new table rows of embeddings (or updated rows of embeddings) will automatically be indexed. Access to both the table and the index inherits Unity Catalog ACLs, so you don’t need separate permissions.
At inference we embed the text prompt and query the top three results with similarity search.
When we pass the results to the agent later, we will keep the response minimal and actionable. We return just the ranked UC paths which the agent will cycle through (if a user rejects the initial seed image) without re-querying the index.
Here we create an endpoint which accepts a short text prompt and orchestrates two steps:
This second part is optional and controlled by a toggle depending on whether we want to run the image generation or just the retrieval.
For the generation step, we are calling out to Replicate using the Kontext multi-image max model. This choice is pragmatic:
Generation call:
Write-back to UC Volume - decode base64 from the generator and write via the Files API:
If desired, it is easy to replace Kontext/Replicate with any external image API (e.g. OpenAI) or an internal model served on Databricks without changing the rest of the pipeline. Simply replace the internals of the _replicate_image_generation method and keep the input contract (pet seed bytes + brand bytes) and output (PNG bytes → UC upload) identical. The chat agent, retrieval, and app stay the same because they operate on UC paths, not image payloads.
The chat agent is a final serving endpoint that holds the conversation policy and calls the image-gen endpoint as a tool. It proposes a pet type, retrieves the image seed candidates, and only generates the final ad image on the user’s confirmation.
The tool schema is kept minimal:
We then create the tool execution function where the agent can call the image-gen endpoint which returns a structured JSON response.
The replicate_toogle param controls the image generation.
Why split endpoints?
We separate out the image-gen and chat agent endpoints for a couple of reasons:
The above is all built on the Databricks Agent Framework, which gives us the tools to turn retrieval + generation into a reliable, governed agent. First, the framework lets us register the image-gen service (including Vector Search) as tools and invoke them deterministically from the policy/prompt. It also provides us with production scaffolding out of the box. We use the ResponsesAgent interface for the chat loop, with MLflow tracing on every step. This gives full observability for debugging and operating a multi-step workflow in production. The MLflow tracing UI lets us explore this visually.
The Streamlit app presents a single conversational experience that lets the user talk to the chat agent and renders the images. It combines all the pieces we have built into a single, governed product experience.
How it comes together:
Below shows the two key touchpoints in the app code. First, call the agent endpoint:
And render by UC Volume path:
Let’s walk through a single flow that ties the whole system together. Imagine a marketer wants to generate a customized pet ad image for “young urban professionals”.
The chat agent interprets the segment and proposes a type of pet e.g., French Bulldog.
On “yes”, the agent calls the image-gen endpoint in retrieval mode and returns the top three images (Volume paths) from Vector Search ranked by similarity. The app displays candidate #0 as the seed image.
If the user says something to the effect of “looks right”, we proceed. If they say “not quite”, the agent increments seed_index to 1 (then 2) and re-uses the same top-3 set (no extra vector queries) to show the next option. If after three images, the user is still not happy, the agent will suggest a new pet description and trigger another Vector Search call. This keeps the UX snappy and deterministic.
On confirmation, the agent calls the endpoint again with replicate_toggle=true and the same seed_index. The endpoint reads the chosen seed image from UC, combines it with the brand image, runs the Kontext multi-image-max generator on Replicate, and then uploads the final PNG back to a UC Volume. Only the UC path is returned. The app then downloads the image and renders it back to the user.
In this blog we have demonstrated how multimodal RAG unlocks advanced ad personalization. Grounding generation with real image retrieval is the difference between generic visuals and creative that resonates with a specific audience. This unlocks:
Databricks ensures scalable, governed, high-performance applications. This architecture stays production ready with every step running end-to-end inside the platform:
The result is a modular, governed RAG agent that turns audience insights into high-quality, on-brand creative at scale.