How Robinhood Built a Streaming Lakehouse to Bring Data Freshness from 24h to Less Than 15 Mins
- Data Lakes, Data Warehouses and Data Lakehouses
- Financial Services
- Moscone South | Level 2 | 215
- 35 min
Robinhood’s mission is to democratize finance for all. Continuous data analysis and data driven decision making are fundamental to achieving this. The data required for analysis comes from varied sources - OLTP databases, event streams and various 3rd party sources. A reliable lakehouse with an interoperable data ecosystem and fast data ingestion service is needed to power various reporting and business critical pipelines and dashboards.
In this talk, we will describe the evolution of the big data ecosystem in Robinhood not only in terms of the scale of data stored and queries made, but also the use cases that it supports. We go in-depth into the lakehouse along with the data ingestion services we built using open source tools to reduce the data freshness latency for our core datasets from one day to under 15 minutes. We will also describe the limitations we had with the big batch ingestion model as well as lessons we learned operating incremental ingestion pipelines at massive scale.