Skip to main content

Denny’s top session picks for Data + AI Summit 2025

Choose from 700+ technical sessions by data and AI experts, open source contributors and researchers

Denny’s top session picks for Data + AI Summit 2025

Published: May 19, 2025

Events5 min read

Summary

  • Choose from 700+ technical sessions at Data + AI Summit 2025, including a huge selection on open source.
  • Explore the latest advances in Delta Lake, Apache Iceberg™, agentic systems, MLflow, Apache Spark™, Unity Catalog, DLT, DSPy, LangChain, PyTorch, dbt, Trino and Databricks.
  • Flash Sale — one week only, with 50% off training through May 23. Courses are now just $212.50 each. Use code TRNGTM0Q at checkout.

Data + AI Summit 2025 is just a few weeks away! This year, we’re offering our largest selection of sessions ever, with over 700+ to choose from. Register to join us in-person in San Francisco or virtually.

With a career rooted in open source, I’ve seen firsthand how open technologies and formats are increasingly central to enterprise strategy. As a long-time contributor to Apache Spark™ and MLflow, a maintainer and committer for Delta Lake and Unity Catalog, and most recently a contributor to Apache Iceberg™, I’ve had the privilege of working alongside some of the brightest minds in the industry.

For this year’s sessions, I’m focusing on the intersection of open source and AI - with a particular interest around multimodal AI. Specifically, how open table formats like Delta Lake and Iceberg, combined with unified governance through Unity Catalog, are powering the next wave of real-time, trustworthy AI and analytics.

My Top Picks

The upcoming Apache Spark 4.1: The Next Chapter in Unified Analytics

Apache Spark™ has long been recognized as the leading open-source unified analytics engine, combining a simple yet powerful API with a rich ecosystem and top-notch performance. In the upcoming Spark 4.1 release, the community reimagines Spark to excel at both massive cluster deployments and local laptop development. Listen and ask questions to:

  • Xiao Li an Engineering Director at Databricks, an Apache Spark Committer, and a PMC member.
  • DB Tsai is an engineering leader at the Databricks Spark team. He is an Apache Spark Project Management Committee (PMC) Member and Committer
     

Iceberg Geo Type: Transforming Geospatial Data Management at Scale

Geospatial is becoming more and more important for lakehouse formats. Learn from Jia Yu, Co-founder and Chief Architect of Wherobots Inc., and Szehon Ho, Software Engineer at Databricks, on the latest and greatest around the geospatial data types in Apache Iceberg™.
 

Let's Save Tons of Money with Cloud-native Data Ingestion!

R. Tyler Croy from Scribd, Delta Lake maintainer, and shepherd of delta-rs since its inception, will dive into the cloud-native architecture Scribd has adopted to ingest data from AWS Aurora, SQS, Kinesis Data Firehose, and more. By using off-the-shelf open source tools like kafka-delta-ingest, oxbow, and Airbyte, Scribd has redefined its ingestion architecture to be more event-driven, reliable, and most importantly: cheaper. No jobs needed!

This session will dig into the value props of a lakehouse architecture and cost-efficiencies within the Rust/Arrow/Python ecosystems. A few recommended videos to watch beforehand:

 

Daft and Unity Catalog: a multimodal/AI-native lakehouse

Multimodal AI will fundamentally change the landscape as data is more than just tables. Workflows now often involve documents, images, audio, video, embeddings, URLs and more.

This session from Jay Chia, Co-founder of Eventual, will show how Daft + Unity Catalog can help unify authentication, authorization and data lineage, providing a holistic view of governance, with Daft, a popular multimodal framework.
 

Bridging Big Data and AI: Empowering PySpark with Lance Format for Multi-Modal AI Data Pipelines

PySpark has long been a cornerstone of big data processing, but the rise of multimodal AI and vector search introduces challenges beyond its capabilities. Spark’s new Python data source API enables integration with emerging AI data lakes built on the multi-modal Lance format.

This session will dive into how the Lance format works and why it is an important component for multimodal AI data pipelines. Allison Wang, Apache Spark™ committer, and Li Qiu, LanceDB Database Engineer and Alluxio PMC member, will dive into how combining Apache Spark (PySpark) and LanceDB allows you to advance multi-modal AI data pipelines.
 

Streamlining DSPy Development: Track, Debug and Deploy with MLflow

Chen Qian, Senior Software Engineer at Databricks, will show how to integrate MLflow with DSPy to bring full observability to your DSPy development.

You’ll get to see how to track DSPy module calls, evaluations, and optimizers using MLflow’s tracing and autologging capabilities. Combining these two tools makes it easier to debug, iterate, and understand your DSPy workflows, then deploy your DSPy program end-to-end.
 

From Code Completion to Autonomous Software Engineering Agents

Kilian Lieret, Research Software Engineer at Princeton University, was recently a guest on the Data Brew videocast for a fascinating discussion on new tools for evaluation and enhancing AI in software engineering.

This session is an extension of this conversation, where Kilian will dig into SWE-bench (a benchmarking tool) and SWE-agent (an agent framework), the current frontier of agentic AI for developers, and how to experiment with AI agents.
 

Composing high-accuracy AI systems with SLMs and mini-agents

The always-amazing Sharon Zhou, CEO and Founder of Lamini, discusses how to utilize small language models (SLMs) and mini-agents to reduce hallucinations using Mixture of Memory Exports (i.e., MoME knows best)!

Find out a little bit more about MoME in this fun Data Brew by Databricks episode featuring Sharon: Mixture of Memory Exports.
 

Beyond the Tradeoff: Differential Privacy in Tabular Data Synthesis

Differential privacy is an important tool to provide mathematical guarantees around protecting the privacy of the individuals behind the data. This talk by Lipika Ramaswamy of Gretel.ai (now part of NVIDIA) explores the use of Gretel Navigator to generate differentially private synthetic data that maintains high fidelity to the source data and high utility on downstream tasks across heterogeneous datasets.

Some good pre-reads on the topic:

Building Knowledge Agents to Automate Document Workflows
One of the biggest promises for LLM agents is automating all knowledge work over unstructured data — we call these "knowledge agents.” Jerry Liu, Founder of LlamaIndex, dives into how to create knowledge agents to automate document workflows. What can sometimes be complex to implement, Jerry showcases how to make this a simplified flow for a fundamental business process.

Honorable Mentions!

Building AI Models In Health Care using Semi-Synthetic Data: Holden Karau, Co-founder Fight Health Insurance INC, on how to fight the healthcare paperwork deluge by using AI.

The Hitchhiker's Guide to Delta Lake Streaming in an Agentic Universe: Scott Haines, Distinguished Software Engineer at Nike, on how a strong foundation around Delta Lake (and Lakehouses in general) with streaming is fundamental for the push into agentic systems.

AMAs

Simon + Denny - Unfiltered & Unscripted: Simon Whiteley and I are back together, so come with your questions, we hope to have answers!

Apache Spark AMA: Come with your questions around Apache Spark™ - we’ve got answers!

Rust and Lakehouse Format AMAs: As a Rustacean, we’d love to dive into lakehouse format like Apache Iceberg™ and Delta Lake and how they are helping create the next wave of data processing engines.

I hope to see you in San Francisco. Register now and don’t miss these sessions, plus many more!

Never miss a Databricks post

Subscribe to the categories you care about and get the latest posts delivered to your inbox