Session
From "Hidden Costs" to "High Efficiency": Scaling DV's Lakehouse Observability
Overview
| Experience | In Person |
|---|---|
| Track | Governance & Security |
| Industry | Communications - Media & Entertainment |
| Technologies | Unity Catalog |
| Skill Level | Intermediate |
A practical guide to the internal monitoring tools we built on Databricks System Tables to make our data platform more efficient and performant. Hear directly from our Data Platform Engineering team about what worked for us and what we learned along the way
We’ll dive into the backend strategies that transformed our platform, including:
- Dashboards We Actually Look At: Tracking real costs and table/column sizes.
- Practical Alerts for Real Problems: Identifying costly unused tables, tables with missing or excessive retention policies, long-running queries, and constantly failing jobs.
- The "Orphan" Hunt: Programmatically identifying unreferenced files and un-vacuumed data by auditing Delta logs against cloud storage.
- The Cost/Read Metric: Using recursive lineage queries to expose expensive, high-maintenance tables with little or no downstream value.
- Footer Analysis: Scanning Parquet footers at scale to isolate metadata bloat from actual data —even in complex nested types.
Session Speakers
Saul Tawil
/Senior Data Engineer
DoubleVerify