As a data driven company, we use Machine Learning algos and A/B tests to drive all of the content recommendations for our members. To improve the quality of our personalized recommendations, we try an idea offline using historical data. Ideas that improve our offline metrics are then pushed as A/B tests which are measured through statistically significant improvements in core metrics such as member engagement, satisfaction, and retention.The heart of such offline analyses are historical facts data that are used to generate features required by the machine learning model. For example, viewing history of a member, videos in mylist etc.
Building a fact store at an ever evolving Netflix scale is non trivial. Ensuring we capture enough fact data to cover all stratification needs of various experiments and guarantee that the data we serve is temporally accurate is an important requirement. In this talk, we will present the key requirements, evolution of our fact store design, its implementation, the scale and our learnings.
We will also take a deep dive into fact vs feature logging, design tradeoffs, infrastructure performance, reliability and query API for the store. We use Spark and Scala extensively and variety of compression techniques to store/retrieve data efficiently.
Session hashtag: #DevSAIS11
Nitin is a Senior Software Engineer on the Personalization Infrastructure team at Netflix. His primary focus is on building various ML infrastructure components using Apache Spark that helps Netflix research engineers to innovate faster and improve personalized recommendations. He is passionate about Large Scale Distributed Systems, Search Platforms and Performance Optimizations. He is an active open source contributor for Apache Solr and a few other apache projects.
Senior software engineer on the Personalization Infrastructure team at Netflix that builds scalable, distributed computing systems for the algorithmic engineers that help improve member personalization.