Session
From Streaming to Search: How Exa Uses Lance and Apache Spark for High-Throughput AI Workloads
Overview
| Experience | In Person |
|---|---|
| Track | Data Engineering & Streaming |
| Industry | Enterprise Technology |
| Technologies | Lakeflow |
| Skill Level | Advanced |
AI-native applications require data systems that combine large-scale distributed processing with fast multimodal access. This talk explores how Exa uses Lance and Spark Structured Streaming to power search and AI workloads.We introduce Lance, an open lakehouse format optimized for vectors, multimodal data, and fast random access, and show how its Spark connector enables scalable ETL, streaming ingestion, analytics, and native vector and full-text search.We then present Exa’s streaming architecture for processing large volumes of crawled web data. Using Spark Structured Streaming with Delta and writing enriched outputs into Lance, the pipeline performs local and global deduplication, generates embeddings, and sustains ~10k rows per second into Lance tables that power downstream vector search databases.We conclude with lessons learned integrating Lance and Spark to unify analytics, training, and semantic retrieval within an open architecture.
Session Speakers
Jack Ye
/Software Engineer
LanceDB
Jan van der Vegt
/ML Engineer
Exa AI