Session

From Streaming to Search: How Exa Uses Lance and Apache Spark for High-Throughput AI Workloads

Overview

ExperienceIn Person
TrackData Engineering & Streaming
IndustryEnterprise Technology
TechnologiesLakeflow
Skill LevelAdvanced
AI-native applications require data systems that combine large-scale distributed processing with fast multimodal access. This talk explores how Exa uses Lance and Spark Structured Streaming to power search and AI workloads.We introduce Lance, an open lakehouse format optimized for vectors, multimodal data, and fast random access, and show how its Spark connector enables scalable ETL, streaming ingestion, analytics, and native vector and full-text search.We then present Exa’s streaming architecture for processing large volumes of crawled web data. Using Spark Structured Streaming with Delta and writing enriched outputs into Lance, the pipeline performs local and global deduplication, generates embeddings, and sustains ~10k rows per second into Lance tables that power downstream vector search databases.We conclude with lessons learned integrating Lance and Spark to unify analytics, training, and semantic retrieval within an open architecture.

Session Speakers

Speaker placeholderIMAGE COMING SOON

Jack Ye

/Software Engineer
LanceDB

Jan van der Vegt

/ML Engineer
Exa AI