1000× Faster Retrieval: Indexed and Full-Text Search in the Lakehouse
Overview
| Experience | In Person |
|---|---|
| Track | Cybersecurity |
| Industry | Enterprise Technology, Consulting & Services, Financial Services |
| Technologies | Unity Catalog |
| Skill Level | Intermediate |
Delta Lake and Iceberg excel at large-scale analytics, but they are not optimized for sub-second point lookups or full-text search. In operational workloads—such as security analytics, log investigation, and telemetry—retrieval queries that require full scans or high-cardinality filtering can take minutes to hours, when seconds or milliseconds are required.
I built IndexTables, an embedded, Tantivy-based indexing layer that runs directly inside Spark executors, enabling up to 1000× faster query performance while preserving the Lakehouse model. This session explores the architecture: object-storage-hosted indexes with ACID transactions, millisecond-latency aggregations over billions of rows, native time-series bucketing for efficient GROUP BY analytics, and NVMe-backed caching with proactive pre-warming.
Attendees will learn how Capital One is using IndexTables to complement Delta and Iceberg, creating Lakehouse architectures that support both analytical and retrieval-heavy workloads.
Session Speakers
Scott Schenkein
/VP, Distinguished Engineer
Capital One Financial