Using Apache Spark to Predict Installer Retention from Messy Clickstream Data

Download Slides

Clickstream data is messy. A single user session in a Zynga game can generate thousands of events, with each game, client version and OS having their own event schemas. Unfortunately, most ML models require their training data to be formatted as a uniform matrix, with each user having the exact same columns. It’s a time consuming challenge to develop feature sets that capture all the nuanced trends and interactions of event streams.

At Zynga we’ve developed a technique to represent user game actions with temporal heatmap feature sets. Utilizing the power of PySpark, our generic data pipeline can generate thousands of features without the need to manually interpret the events of each game. The graphical structure of the heatmaps allow us to take advantage of established image classification techniques to make personalized user level predictions. Within 30 minutes of installing our games, Zynga is able to make accurate predictions on whether a new installer will churn or become a payer.

Session hashtag: #DSSAIS15

« back
About Patrick Halina

Patrick Halina leads the ML Engineering team at Zynga, where he works on productionalizing ML workflows and developing personalization technology. Prior to Zynga, he worked on the ML Marketing platform at Amazon. He received his undergrad in Computer Engineering and Master's in Statistics at the University of Toronto. He lives in Toronto, Canada.