Multimodal retrieval represents a significant challenge in modern AI systems. Traditional retrieval systems struggle to effectively search across different data types without extensive metadata or tagging. This is particularly problematic for healthcare companies that manage large volumes of diverse content, including text, images, audio, and more, often resulting in unstructured data sources.
Anyone working in healthcare understands the difficulty of merging unstructured data with structured data. A common example of this is clinical documentation, where handwritten clinical notes or discharge summaries from patients are often submitted in PDFs, images, and similar formats. This needs to be either converted manually or processed using Optical Character Recognition (OCR) to find the necessary information. Even after this step, you must map the data to your existing structured data to utilize it effectively.
For this blog, we will review the following:
By the end of this blog, you will see how multi-modal embeddings enable the following for healthcare:
An embedding space (AWS | Azure | GCP) is an n-dimensional mathematical representation of records that allows one or more data modalities to be stored as vectors of floating-point numbers. What makes that useful is that in a well-constructed embedding space, records of similar meaning occupy a similar space. For example, imagine we had a picture of a horse, the word “truck”, and an audio recording of a dog barking. We pass these three completely different data points into our multimodal embedding model and get back the following:
Here is a visual representation of where the numbers would exist in an embedding space:
In practice, embedding space dimensions will be in the hundreds or thousands, but for illustration, let’s use 3-space. We can imagine the first position in these vectors represents “animalness,” the second is “transportation-ness,” and the third is “loudness.” That would make sense given the embeddings, but typically, we do not know what each dimension represents. The important thing is that they represent the semantic meaning of the records.
There are several ways to create a multimodal embedding space, including training multiple encoders simultaneously (such as CLIP), using cross-attention mechanisms (such as DALL-E), or using various post-training alignment methods. These methods allow the record's meaning to transcend the original modality and occupy a shared space with other disparate records or formats.
This shared semantic space is what enables powerful cross-modal search capabilities. When a text query and an image share similar vector representations, they likely share similar semantic meanings, allowing us to find relevant images based on textual descriptions without explicit tags or metadata.
To effectively implement multimodal search, we need models that can generate embeddings for different data types within a shared vector space. These models are specifically designed to understand the relationships between different modalities and represent them in a unified mathematical space.
Several powerful multimodal embedding models are available as of June 2025:
At Databricks, we provide the infrastructure and tools to host, evaluate, and develop an end-to-end solution, customizable to your use case. Consider the following scenarios as you begin deploying this use case:
For the full implementation of this solution, please visit this repo here: Github Link
This example will take synthetic patient information as our structured data and sample explanations of benefits in PDF format as our unstructured data. First, synthetic data is generated to use with a Genie Space. Then Nomic multi-modal embedding model, a state-of-the-art open source multi-modal embedding model, is loaded onto Databricks Model Serving to generate embeddings on sample explanations of benefits found online.
This process sounds complicated, but Databricks provides built-in tools that enable a complete, end-to-end solution. At a high level, the process looks like the following:
This Genie Space will be used as a tool to convert natural language into an SQL query to query our structured data.
In this example, the Faker library will be used to generate random patient information. We will create two tables to diversify our data: Patient Visits and Practice Locations, with columns such as reasons for visit, insurance providers, and insurance types.
To query data using natural language, we can utilize a Databricks Genie Spaces (AWS | Azure | GCP) to convert our query into natural language and retrieve relevant patient data. In the Databricks UI, simply click the Genie tab in the left bar → New → select patient_visits and practice_locations tables.
We need the Genie Space ID to capture the number that comes after rooms. You can see an example below:
Since we are using DSPy, all we need to do is define a Python function.
That’s it! Let’s set up the Multi-Modal Generation workflow now.
For this step, we will use the fully open colNomic-embed-multimodal-7b model on HuggingFace to generate embeddings for our unstructured data, in this case, PDFs. We selected Nomic’s model due to its Apache 2.0 license and high performance on benchmarks.
The method for generating your embeddings will vary depending on your use case and modality. Review the Databricks Vector Search Best Practices (AWS | Azure | GCP) to understand what is best for your use case.
We need this model to be available within Databricks Unity Catalog (UC), so we will use MLflow to load it from Huggingface and register it. Then, we can deploy the model to a model-serving endpoint.
The Python model includes additional logic to handle image inputs, which can be found in the complete repository.
UC Volumes are designed like file systems to host any file and are where we store our unstructured data. You can use them in the future to store other files, such as images, and repeat the process as needed. This includes the model above. In the repository, you will see that the cache refers to a volume.
You will have a folder called sample_pdf_sbc containing some example summaries of benefits and coverage. We need to prepare these PDFs to embed them.
The colNomic-embed-multimodal-7b model is specifically trained to recognize text and images within one image, a common input from PDFs. This allows the model to perform exceptionally well in retrieving these pages.
This method enables you to utilize all content within a PDF without needing a text chunking strategy to ensure retrieval works effectively. The model itself can embed these images well in their own embedding space.
We will use pdf2image to convert each page of the PDF into an image, preparing it for embedding.
Now that we have the PDF images, we can generate the embeddings. At the same time, we can save the embeddings to a Delta table with additional columns that we will retrieve alongside our Vector Search, like the file path to the Volume location.
Creating a Vector Search index can be done via UI or API. The API method is shown below.
Now we just need to tie it all together with an Agent.
We use DSPy for this because of its declarative, pure Python design. It allows us to iterate and develop quickly, testing various models to see which ones will work best for our use case. Most importantly, the declarative nature allows us to modularize our Agent so that we can isolate the Agent’s logic from the tools and focus on defining HOW the agent should accomplish its task.
And the best part? No manual prompt engineering!
This signature specifies and enforces the inputs and outputs, while also explaining how the signature should function.
The module will take the instructions from the signature and create an optimal prompt to send to the LLM. For this particular use case, we will build a custom module called `MultiModalPatientInsuranceAnalyzer()`.
This custom module will break out the signatures as steps, almost like “chaining” together calls, in the forward method. We follow this process:
Review what tools the Agent used and the reasoning the Agent went through to answer the question.
Once you have a working Agent, we recommend the following:
The evaluation framework will be crucial in understanding how effectively the Vector Search index retrieves relevant information for your RAG agent. By following these metrics, you will know where to make adjustments, from changing the embedding model to adjusting the prompts interacting with the LLM.
You should also monitor to see if the Foundation Model API (AWS | Azure | GCP) is enough for your use case. At a certain point, you will reach API limits for the Foundation Model APIs, so you will need to transition to Provisioned Throughput (AWS | Azure | GCP) to have a more reliable endpoint for your LLM.
Furthermore, keep a close eye on your costs against serverless model serving (AWS | Azure | GCP). Most of these costs will originate from the Databricks SKU serverless model serving and may grow as you scale up.
Check out these blogs to understand how to do this on Databricks.
In addition, Databricks Delivery Solutions Architects (DSAs) help accelerate Data and AI initiatives across organizations. DSAs provide architectural leadership, optimize platforms for cost and performance, enhance developer experience, and drive successful project execution. They bridge the gap between initial deployment and production-grade solutions, working closely with various teams, including data engineering, technical leads, executives, and other stakeholders to ensure tailored solutions and faster time to value. Contact your Databricks Account Team to learn more.
Get started by building your own GenAI App! Check out the documentation to get started.
At Databricks, you have all the tools you need to develop this end-to-end solution. Check out the blogs below to learn about managing and working with your new Agent with the Mosaic AI Agent Framework.