Skip to main content

Databricks at NeurIPS 2025

Explore Databricks’ presence at NeurIPS 2025

Neurips 2025 + Databricks and Mosaic AI logo

Published: December 1, 2025

Mosaic Research3 min read

Summary

  • Databricks is a platinum sponsor at NeurIPS 2025
  • Visit our booth to meet members of the research and engineering team
  • A review of our accepted publications and presentations

Databricks is proud to be a platinum sponsor of NeurIPS 2025. The conference runs from December 2 to 7 in San Diego, California.

Visit our Booth 

Stop by booth #1619 in the Expo Hall from December 2 to 5 to meet members of our research, engineering, and recruiting teams and learn about our latest work and open roles. 

Databricks at NeurIPS

Poster Session

FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Sam Havens, Michael Carbin, Andrew Drozdov, Nandan Thakur

FreshStack is a new, end-to-end framework for automatically generating modern, realistic information retrieval benchmarks. It builds evaluation datasets by collecting up-to-date technical corpora, extracting fine-grained nuggets from real community Q&A, and testing retrieval quality using a fusion of retrieval methods. Across five fast-moving technical domains, baseline retrieval models perform far below oracle systems—revealing substantial headroom for improving IR and RAG pipelines. FreshStack also uncovers cases where reranking provides no lift and where oracle context dramatically boosts LLM answer quality.

Workshop

Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs

Jacob Portes, Connor Jennings, Erica Ji Yuen, Sasha Doubov, Michael Carbin

This work examines how retrieval quality improves as LLMs scale in size, training duration, and total pretraining FLOPs. Evaluating models from 125M to 7B parameters trained on 1B–2T tokens, the study shows that zero-shot BEIR retrieval performance follows predictable scaling trends: larger and longer-trained models retrieve better. The results also reveal a strong correlation between retrieval accuracy and in-context learning capabilities, suggesting shared underlying mechanisms. These findings offer important guidance for designing and training next-generation LLM-based retrievers. Readers can explore the full paper for deeper insights and methodology.

Sponsor Talk

Databricks presents a new IDP Benchmark 

Erich Elsen

Exhibit Hall A/B, Tue 2 Dec 5:15 p.m. PST — 5:27 p.m. PST

Most business/enterprise documents still exist for humans first and machines second. One of our goals at Databricks is to make this human-centered data "legible" to AI and Agents, so that we gain insights and even take actions based upon those insights. But AI can still struggle to understand the full range of messy, unstructured documents we produce for each other. We've created a benchmark, PARQA, that probes the limits of current AI systems in analyzing a large (100,000-page) public dataset. A single, non-expert human is capable of answering the questions with ~100% accuracy, while the best non-Databricks systems hover around 30%. We present both the benchmark and our Agent system, which significantly outperforms other Agents.

Networking Event 

Join us for an evening of connections, conversations, and community during NeurIPS 2025. Over drinks and appetizers, connect with fellow attendees while meeting our Research and Engineering teams. Register here! 

Please note that, given limited capacity, guest registrations will be placed on a waitlist and approved on a rolling basis. Thank you for your patience and understanding.

Join our Team

Are you interested in working with us? We’re hiring! Check out our open jobs and join our growing team.

Never miss a Databricks post

Subscribe to our blog and get the latest posts delivered to your inbox