Databricks is proud to be a platinum sponsor of NeurIPS 2025. The conference runs from December 2 to 7 in San Diego, California.
Stop by booth #1619 in the Expo Hall from December 2 to 5 to meet members of our research, engineering, and recruiting teams and learn about our latest work and open roles.
Poster Session
FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents
Sam Havens, Michael Carbin, Andrew Drozdov, Nandan Thakur
FreshStack is a new, end-to-end framework for automatically generating modern, realistic information retrieval benchmarks. It builds evaluation datasets by collecting up-to-date technical corpora, extracting fine-grained nuggets from real community Q&A, and testing retrieval quality using a fusion of retrieval methods. Across five fast-moving technical domains, baseline retrieval models perform far below oracle systems—revealing substantial headroom for improving IR and RAG pipelines. FreshStack also uncovers cases where reranking provides no lift and where oracle context dramatically boosts LLM answer quality.
Workshop
Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs
Jacob Portes, Connor Jennings, Erica Ji Yuen, Sasha Doubov, Michael Carbin
This work examines how retrieval quality improves as LLMs scale in size, training duration, and total pretraining FLOPs. Evaluating models from 125M to 7B parameters trained on 1B–2T tokens, the study shows that zero-shot BEIR retrieval performance follows predictable scaling trends: larger and longer-trained models retrieve better. The results also reveal a strong correlation between retrieval accuracy and in-context learning capabilities, suggesting shared underlying mechanisms. These findings offer important guidance for designing and training next-generation LLM-based retrievers. Readers can explore the full paper for deeper insights and methodology.
Sponsor Talk
Databricks presents a new IDP Benchmark
Erich Elsen
Exhibit Hall A/B, Tue 2 Dec 5:15 p.m. PST — 5:27 p.m. PST
Most business/enterprise documents still exist for humans first and machines second. One of our goals at Databricks is to make this human-centered data "legible" to AI and Agents, so that we gain insights and even take actions based upon those insights. But AI can still struggle to understand the full range of messy, unstructured documents we produce for each other. We've created a benchmark, PARQA, that probes the limits of current AI systems in analyzing a large (100,000-page) public dataset. A single, non-expert human is capable of answering the questions with ~100% accuracy, while the best non-Databricks systems hover around 30%. We present both the benchmark and our Agent system, which significantly outperforms other Agents.
Join us for an evening of connections, conversations, and community during NeurIPS 2025. Over drinks and appetizers, connect with fellow attendees while meeting our Research and Engineering teams. Register here!
Please note that, given limited capacity, guest registrations will be placed on a waitlist and approved on a rolling basis. Thank you for your patience and understanding.
Are you interested in working with us? We’re hiring! Check out our open jobs and join our growing team.
