At Ibotta, our mission is to Make Every Purchase Rewarding. Helping our users (whom we call Savers) find and activate relevant offers through our direct-to-consumer (D2C) app, browser extension, and website is a critical part of this mission. Our D2C platform helps millions of shoppers earn cashback from their everyday purchases—whether they’re unlocking grocery deals, earning bonus rewards, or planning their next trip. Through the Ibotta Performance Network (IPN), we also power white-label cashback programs for some of the biggest names in retail, including Walmart and Dollar General, helping over 2,600 brands reach more than 200 million consumers with digital offers across partner ecosystems.
Behind the scenes, our Data and Machine Learning teams power critical experiences like fraud detection, offer recommendation engines, and search relevance to make the Saver journey personalized and secure. As we continue to scale, we need data-driven, intelligent systems that support every interaction at every touchpoint.
Across D2C and the IPN, search plays a pivotal role in engagement and needs to keep pace with our business scale, evolving offer content, and changing Saver expectations.
In this post we’ll walk through how we significantly refined our D2C search experience: from an ambitious hackathon project to a robust production feature now benefiting millions of Savers.
User search behavior has evolved from simple keywords to incorporating natural language, misspellings, and conversational phrases. Modern search systems must bridge the gap between what users type and what they actually mean, interpreting context and relationships to deliver relevant results even when query terms don’t exactly match the content.
At Ibotta, our original homegrown search system, at times, struggled to keep pace with the evolving expectations of our Savers and we recognized an opportunity to refine it.
The key areas for opportunity we saw included:
We believed the system could better keep pace with changing offer content, search behaviors, and evolving Saver expectations. We saw opportunities to increase the value for both our Savers and our brand partners.
Addressing the limitations of our legacy search system required a focused effort. This initiative gained significant momentum during an internal hackathon where a cross-functional team, including members from Data, Engineering, Marketing Analytics, and Machine Learning, came together with the idea to build a modern, alternative search system using Databricks Vector Search, which some members had learned about at the Databricks Data + AI Summit.
In just three days, our team developed a working proof-of-concept that delivered semantically relevant search results. Here’s how we did it:
The hackathon project won first place, generated strong internal buy-in and momentum to transition the prototype into a production system. Over the course of a few months, and with close collaboration from the Databricks team, we transformed our prototype into a robust full-fledged production search system.
Moving the hackathon proof-of-concept to a production-ready system required careful iteration and testing. This phase was critical not only for technical integration and performance tuning, but also for evaluating whether our anticipated system improvements would translate into positive changes in Saver behavior and engagement. Given search's essential role and deep integration across internal systems, we opted for the following approach: we modified a key internal service that called our original search system, replacing those calls with requests directed to the Databricks Vector Search endpoint, while building in robust, graceful fallbacks to the legacy system.
Most of our early work focused on understanding:
In the first month, we ran a test with a small percentage of our Savers which did not achieve the engagement results we had hoped for. Engagement decreased, particularly among our most active Savers, indicated by a drop in clicks, unlocks (when Savers express interest in an offer), and activations.
However, the Vector Search solution offered significant benefits including:
Pleased with the system's underlying technical performance, we saw its greater flexibility as the key advantage needed to iteratively improve search result quality and overcome the disappointing engagement results.
Following our initial test results, relying solely on A/B testing for search iterations was clearly inefficient and impractical. The number of variables influencing search quality was immense—including embedding models, text combinations, hybrid search settings, Approximate Nearest Neighbors (ANN) thresholds, reranking options, and many more.
To navigate this complexity and accelerate our progress, we decided to establish a robust evaluation framework. This framework needed to be uniquely tailored to our specific business needs and capable of predicting real-world user engagement from offline performance metrics.
Our framework was designed around a synthetic evaluation environment that tracked over 50 online and offline metrics. Offline, we monitored standard information retrieval metrics like Mean Reciprocal Rank (MRR) and precision@k to measure relevance. Crucially, this was paired with online real-world engagement signals such as offer unlocks and click-through rates. A key decision was implementing an LLM-as-a-judge. This allowed us to label data and assign quality scores to both online query-result pairs and offline outputs. This approach proved to be critical for rapid iteration based on reliable metrics and collecting the labeled data necessary for future model fine-tuning.
Along the way, we leaned into multiple parts of the Databricks Data Intelligence Platform, including:
This robust framework dramatically increased our iteration speed and confidence. We conducted over 30 distinct iterations, systematically testing major variable changes in our Vector Search solution, including:
The evaluation framework transformed our development process, allowing us to make data-driven decisions rapidly and validate potential improvements with high confidence before exposing them to users.
Following the initial broad test that showed disappointing engagement results, we shifted our focus to exploring the performance of specific models identified as promising during our offline evaluation. We selected two third-party embedding models for production testing, accessed securely through AI Gateway. We conducted short-term, iterative tests in production (lasting a few days) with these models.
Pleased with the initial results, we proceeded to run a longer, more comprehensive production test comparing our leading third-party model and its optimized configuration against the legacy system. This test yielded mixed results. While we observed overall improvements in engagement metrics and successfully eliminated the negative impacts seen previously, these gains were modest—mostly single-digit percentage increases. These incremental benefits were not compelling enough to fully justify a complete replacement of our existing search experience.
More troubling, however, was the insight gained from our granular analysis: while performance significantly improved for certain search queries, others saw worse results compared to our legacy solution. This inconsistency presented a significant architectural dilemma. We faced the unappealing choice of implementing a complex traffic-splitting system to route queries based on predicted performance—an approach that would require maintaining two distinct search experiences and introduce a new, complex layer of rule-based routing management—or accepting the limitations.
This was a critical juncture. While we had seen enough promise to keep going, we needed more significant improvements to justify fully replacing our homegrown search system. This led us to begin fine-tuning.
While the third-party embedding models explored previously showed technical promise and modest improvements in engagement, they also presented critical limitations that were unacceptable for a long-term solution at Ibotta. These included:
The clear path forward was to fine-tune a model specifically tailored to Ibotta's data and the needs of our Savers. This was made possible thanks to the millions of labeled search interactions we had accumulated from real users via our LLM-as-a-judge process within our custom evaluation framework. This high-quality production data became our training gold.
We then embarked on a methodical fine-tuning process, leveraging our offline evaluation framework extensively.
Key elements were:
After numerous iterations and evaluations within the framework, our top-performing fine-tuned model beat our best third-party baseline by 20% in synthetic evaluation. These compelling offline results provided the confidence needed to accelerate our next production test.
The technical rigor and iterative process paid off. We engineered a search solution specifically optimized for Ibotta's unique offer catalog and user behavior patterns, delivering results that exceeded our expectations and offered the flexibility needed to evolve alongside our business. Based on these strong results, we accelerated migration onto Databricks Vector Search as the foundation for our production search system.
In our final production test, using our own fine-tuned embedding model, we observed the following improvements:
Beyond user-facing gains, the new system delivered on performance. We saw 60% lower latency to our search system, attributable to Vector Search query performance and the fine-tuned model’s lower overhead.
Leveraging the flexibility of this new foundation, we also built powerful enhancements like Query Transformation (enriching vague queries) and Multi-Search (fanning out generic terms). The combination of a highly relevant core model, improved system performance, and intelligent query enhancements has resulted in a search experience that is smarter, faster, and ultimately more rewarding
One challenge with embedding models is their limited understanding of niche keywords, such as emerging brands. To address this we built a query transformation layer that dynamically enriches search terms in-flight based on predefined rules.
For example, if a user searches for an emerging yogurt brand the embedding model might not recognize, we can transform the query to add "Greek yogurt" alongside the brand name before sending it to Vector Search. This provides the embedding model with necessary product context while preserving the original text for hybrid search.
This capability also works hand-in-hand with our fine-tuning process. Successful transformations can be used to generate training data; for instance, including the original brand name as a query and the relevant yogurt products as positive results in a future training run helps the model learn these specific associations.
For broad, generic searches like "baby," Vector Search might initially return a limited number of candidates, potentially filtered down further by targeting and budget management. To address this and increase result diversity, we built a multi-search capability that fans out a single search term into multiple related searches.
Instead of just searching for "baby," our system automatically runs parallel searches for terms like "baby food," "baby clothing," "baby medicine," "baby diapers," and so on. Because of the low latency of Vector Search, we can execute several searches in parallel without increasing the overall response time to the user. This provides a much broader and more diverse set of relevant results for wide-ranging category searches.
Following the successful final production test and the full rollout of Databricks Vector Search to our user base – delivering positive engagement results, increased flexibility, and powerful search tools like Query Transformation and Multi-Search – this project journey yielded several valuable lessons:
With our fine-tuned embedding model now live across all direct-to-consumer (D2C) channels, we next plan to explore scaling this solution to the Ibotta Performance Network (IPN). This would bring improved offer discovery to millions more shoppers across our publisher network. As we continue to collect labeled data and refine our models through Databricks, we believe we are well positioned to evolve the search experience alongside the needs of our partners and the expectations of their customers.
This journey from a hackathon project to a production system proved that reimagining a core product experience rapidly is achievable with the right tools and support. Databricks was instrumental in helping us move fast, fine-tune effectively, and ultimately, make every search more rewarding for our Savers.