Connecting the Dots with DataHub: Lakehouse and Beyond
- Data Engineering
- Moscone South | Level 2 | 202
- 35 min
You’ve successfully built your data lakehouse. Congratulations! But what happens when your operational data stores, streaming systems like Apache Kafka or data ingestion systems produce bad data into the lakehouse? Can you be proactive when it comes to preventing bad data from affecting your business? How can you take advantage of automation to ensure that raw data assets become well maintained data products (clear ownership, documentation and sensitivity classification) without requiring people to do redundant work across operational, ingestion and lakehouse systems? How do you get live and historical visibility into your entire data ecosystem (schemas, pipelines, data lineage, models, features and dashboards) within and across your production services, ingestion pipelines and Data Lakehouse? Data engineers struggle with data quality and data governance issues constantly interrupting their day and limiting their upside impact on the business.
In this talk, we will share how data engineers from our 3K+ strong DataHub community are using DataHub to track lineage, understand data quality, and prevent failures from impacting their important dashboards, ML models and features. The talk will include details of how DataHub extracts lineage automatically from Spark, schema and statistics from Delta Lake and shift-left strategies for developer-led governance.