Streaming Schema Drift Discovery and Controlled Mitigation
Overview
When creating streaming workloads with Databricks, it can sometimes be difficult to capture and understand the current structure of your source data. For example, what happens if you are ingesting JSON events from a vendor, and the keys are very sparsely populated, or contain dynamic content? Ideally, data engineers want to "lock in" a target schema in order to minimize complexity and maximize performance for known access patterns. What do you do when your data sources just don't cooperate with that vision? The first step is to quantify how far your current source data is drifting from your established Delta table. But how?
This session will demonstrate a way to capture and visual drift across all your streaming tables. The next question is, "Now that I see all of the data I'm missing, how do I selectively promote some of these keys into DataFrame columns?" The second half of this session will demonstrate precisely how to do a schema migration with minimal job downtime.
Type
- Breakout
Experience
- In Person
Track
- Data Streaming, Databricks Experience (DBX)
Industry
- Professional Services
Difficulty
- Intermediate
Duration
- 40 min
Session Speakers
Alexander Vanadio
Principal Consultant
Optiv
Don't miss this year's event!
Register now