At Zynga we’ve opened up our PySpark environment to our full analytics organization, which includes game analytics, data science, and engineering teams across the globe. The result of democratizing Spark is that more of our teams are able to perform analyses at scale and our data scientists are now responsible for productizing predictive modeling pipelines. The biggest impact that opening up our data platform has had is teams identifying novel applications of PySpark including large-scale experimentation, player segmentations, recommendation systems, and anomaly detection. PySpark is the latest step in the transformation of our analytics organization, which has migrated from SQL to Python to Spark. We’ve focused on three key areas to make Spark accessible at Zynga: infrastructure, onboarding, and features.
One of the prerequisites we had for scaling our usage of Spark was building connections between Databricks and the rest of our data platform. To accomplish this, we authored a set of libraries that enable our Spark environment to work seamlessly with our data lake, data warehouse, and application databases. To onboard our teams onto PySpark, we created templated notebooks, held training sessions during our onsite conferences, and provided sandbox environments for learning. In order to ease the transition from Python to PySpark, we’ve been leveraging newer features in Spark including Pandas UDFs and Koalas to provide familiar interfaces. The result of this effort is that the majority of our teams are now using PySpark for large-scale analyses and our data science teams are responsible for multiple data products in production. This session will discuss our approach for enabling our full analytics organization to leverage PySpark, discuss growing pains that we encountered, and well as successes from democratizing Spark/p>
– Good morning everyone, thank you for attending my session. My name is Ben Weber and I’m a distinguished data scientist at Zynga. Today I’m gonna be talking about how we opened up our PySpark platform to our entire analytics organization and some of the novel applications that have resulted from using this approach.
So it’s aligned with the theme of the conference it really democratizing tools and platforms and we’ve done this at Zynga we did have some challenges with kind of scaling and growing but we’ll talk through about how we’ve kinda of overcome some of those hurdles and provide some best practices for scaling things like this at your organization. So before digging in into how we’re using PySpark at Zynga I did wanna talk about one of the initiatives that we’ve been a part of recently with, COVID-19 which is we partnered up with the World Health Organization as well as a few dozen other game studios to basically spread messages around social distancing through the games which we have as a, communication channel.
So we led this initiative called play apart together which was really about kinda amplifying the signal of how to stay safe during this time, how to keep connected with people but while also being safe in this unprecedented time. So this aligned with our mission of connecting the world through games and it’s something that I’m proud to be able to mention that Zynga has been a part of.
So today we’ll talk about basically how we’ve opened up the PySpark platform to our full analytics organization. Now, the way it works is basically everyone within the analytics organization at Zynga which is over 50 people now can get hands on with PySpark through the databricks platform that we’re leveraging. We’ll talk about kinda how we’ve scaled up and how there’s been some challenges with adoption and really getting more people hands on in some of the tooling and kind of materials that we had to put in place to really get this adoption scaled up and we’ll talk about some of the kinda challenges we face as we did that as well.
To make it manageable we’ll talk about some of the trainings we’ve put in place as well as policy so that it’s not a wild West in terms of this platform but is something where teams can effectively collaborate and kinda partner across teams to build novel applications. So I’ve mentioned that basically by opening up the platform to broader team, we’re actually seeing people using ways that we hadn’t expected with just the initial smaller team that we started with. So that’s kinda of one of the key takeaways which is that kind of opening up the platform has really resulted in interesting uses of it because more teams are using it for different types of problems. So, it’s been great to see what teams have been able to accomplish with it.
So the agenda for today is first to talk about mobile game publishing and kind of the function of an analytics group within this type of organization. We’ll talk about how we opened up the platform and some of the challenges we faced. We’ll then talk about some of the lessons we learned from it and how we’ve actually scaled it while maintaining cost and functionality. And finally, we’ll dig into some example applications we’ve built out through the platform.
So to start we’ll talk about mobile game publishing if you haven’t heard about Zynga we’re a mobile game publisher we have studios located in over half a dozen countries and a lot of the headquarters is based out of San Francisco in terms of the publishing function which is really centralized functions things around marketing, around legal, around corporate development and it’s something that kinda supports our studio. So basically our studios are embedded throughout the world and then we have kind of a strong publishing function that helps those studios operate.
In terms of the portfolio games that we have we have six and more recently eight as we announced yesterday that we are actually acquiring peak games, which will bring toy blast and toon blast into our collection of forever franchises which are games that make over a 100,000,000 annually and games I would expect players to engage with for several years. So, one of the interesting challenges that we face is that we have these kind of embedded analytics teams that have different types of data sources that they’re working with and we’re collecting different types of events for all these different games. So, that’s one of the challenges we face is we’ve centralized the tooling for collecting the data but the analysis and kind of the findings that teams have varies quite a bit given based on the telemetry that teams are collected. So, we have casino games, social slot games words with friends as well as racing games and that collection different kinda merge and match mechanic games. So, it’s great working across a broad portfolio of titles in order to basically find insights for our game teams.
In terms of how Zynga analytics is structured, there’re actually combination of two disciplines where there’s the analytics discipline and the engineering discipline kind of within different parts of the Zynga company but we’re essentially one organization which is a mix of the analytics, data science and analytics engineering teams.
So we have embedded game teams where they’re actually co-located with the game teams and working with product managers and studio leadership really on understanding the launches the games, running the business metrics as well as any kind of ongoing experimentation and really tuning features and kind of working with improving the life cycle of games. There’s also a central analytics function which I’m part of which is supplementing our publishing platform which is doing things like improving our marketing efforts and being more efficient with our spending across the years of acquisition and supporting kind of initiatives that are across the portfolio of games rather than specific to one game. And then we have our analytics engineering team which is responsible for building and maintaining our data platform which ingest data, provides data lakes and data warehouses as well as kind of endpoints for game services to call or to provide real-time personalization .
So in terms of some of the functions that the team actually performs, there’s a few that we’ve done for quite a while and some newer features that are becoming a more prevalent as we’ve built up our capabilities, but the first is really insights for our game team. So this is analytics and reporting the standard kinda business intelligence kinda pipeline that you need to understand run the business metrics as well as understanding different parts of the gameplay experience. So this would be looking at the funnel of new users and understanding kinda where we can tune the first time user experience. This would be basically understanding kind of the impact of features that we’ve launched and working with product managers to determine how that rolls out. One of the key functions we really talked about a lot is Zynga passes kind of this focus around experimentation. And we have a standardized platform that allows product managers and analysts to set up experiments that I think can go live in our games and we’ve already kinda democratized the experiment side of Zynga and we’re opening up other tools as well now. So there’s a lot of practice around experimentation that we’ve used to basically tune our games over time but what we’re doing more recently is actually building out personalization services which are using machine learning to optimize game play or other kind of features or tunings within games based on kinda fee that we’re collecting from players. So this is things like using reinforcement learning to understand when to best send out notifications for players. And this is where we’re really kind of leveling up our tool stack and allowing our analytics team to work more closely with our engineering team on these really interesting platforms. And another key function of the publishing platform is essentially marketing optimization where user acquisitions is a big part of any mobile game publisher and this is something where we want to make sure that we’re leveraging tools to be sophisticated in building out these kind of automated workflows.
So our platform has evolved over the years since Zynga launched it’s first game in 2007. There’s been a big focus on using a data warehouse to basically track metrics in games and kind of find the results of A/B testing. So that was really the bread and butter of Zynga for over a decade and still continues to be core foundation in terms of what our analytics function is performing. This has mostly been working with SQL clients and kind of just running queries directly against our data warehouse. We’ve done a little bit of kind of scripting and regression modeling with our other programming languages but we didn’t really have standardization. It wasn’t until about 2017 that we’ve really changed to a new era at Zynga where I am phrasing this as the notebook era because we started aligning on python as the standard language that we wanted to use. And most analysts are spending most of their time out in JupyterLab rather than SQL clients or just working in flat files and it’s also a time where we started using GitHub for collaboration and we actually use that JupyterLab environment to have a standard ecosystem where all the libraries are the same across the environments that our teams are using so that there’s less problems in terms of sharing notebooks and having the ability to run notebooks on different machines. So it’s been a great way of actually standardizing on the tool so it’s easy to collaborate and share. And then where we’re at now is what I’m calling the production era, which is really scaling up to large datasets using Apache Spark but it’s also putting data products into production. So this is standing up web tools that other teams that Zynga might use and it’s also partnering with our engineering team to set up things like model endpoints where we’re using machine learning and reinforcement learning to actually have these personalization services that game teams can leverage for their products.
So that’s overview of kinda of the Zynga publishing function. What we’ve changed kind of over the two years has really made it kind of democratized versus restricted is our platform was initially restricted to a small set of data scientists and it was more of an opt-in approach in terms of getting access to the platform so that you could actually build out tools. And about two years ago we’ve changed how that worked and we basically made it so it’s no longer opt-in but basically everyone has access to the platform and then we hope to make sure that people use it in a reasonable way. So, this is where we wanted every team to actually be able to build data products but as a result of that, you have teams that need to maintain and kinda have more ownership of these data products. So, it’s kind of a trade-off in terms of it was a great way of having standardization ’cause everyone has a shared platform and with databricks there’s these great collaborative notebooks. But it also means that we need to keep an eye on what’s going on so that there’s not too many clusters that are provision and cost don’t get out of control.
So, in terms of why we wanted to open up there was a few motivations for doing so one of the key ones was really kind of leveling up the skill set of our team. So we wanted a path for career progression so that analysts could transition to the next levels and kind of understand more of what we had traditionally aligned with kinda the data science discipline. And we’ve kind of merged those in a way where a lot of issues that were kind of data science specific are now being encountered with our analysts as well. Things like doing some regression modeling or working with a significance testing for results. So it’s a way to allow our team members to get hands-on with that. One of the big motivations also was basically standardizing the tooling. So we had JupyterLab as kind of a exploratory notebook environment and we wanted a kind of production environment so we basically aligned on PySpark as place to do that and we wanted to revolve our data platform so that we could actually scale to large-scale datasets. We both have use case for kind of ad hoc, large data set processing as well as kind of model pipelines whether those are batch pipelines or pipelines that produce coefficients or other model outputs that are used in real time and we wanted to make sure that we had a platform that could scale to all those different use cases and I already mentioned kind of large scale data processing where we often have to kind of turn around terabytes of data quickly where it might be in flat files and we need a way of just spinning up clusters and process that. But we also have tons of millions of daily active users and if we wanna make machine learning models based on that we need to be able to do so in an efficient way. And then the final point is we also wanted to be able to distribute ownership of some of our data products. So, we previously had an approach for where there was kind of a handoff where an analytics team would hand off kind of a prototype model to an engineering team to make it. To take it to production we now have a platform where for a lot of use cases the analytics team can own that as a self service model end to end.
So the first challenge we faced was really around adoption of the platform where we didn’t really have many materials around kinda how to ramp up on PySpark or kind of the fundamental differences between PySpark and working with Pandas DataFrame. So, we first started by setting up some training notebooks that basically showed how to do things like load in a dataframe from the data Lake or data warehouse, save the dataframe to different outputs in our data platform as well as kinda standard operations like working with columns, dropping columns kind of a lot of standard operations within pandas.
And we also just set up some documentation on the platform itself and kinda how to get started on wiki. And that was kind of a set of kind of onboarding materials. We’ve also done a bit of mentoring where we started with some office hours where we set up basically weekly sessions for people to come in and kinda learn about the platform and this was great kind of at the start of trying to ramp up of adoption but we found less interest in these over time and then switched to more of a kinda one-on-one or kinda team mentoring process. Related to mentoring we also have had kind of an annual analytics event where we do a lot of trainings and kind of ramping up on PySpark has been a big focus there. But more recently where we’ve seen a lot of uptake in adoption is really when we have kind of cross-team projects and more collaboration. So I’ve mentioned our analytics team working with our kind of analytics group and the partnership between analytics and engineering has been great where we’ve actually seen things like reinforcement learning being used as kind of a prototype library in Python but then our engineering team being able to leverage a production system they’ve built and it’s kinda of helped accelerate the learning curve for some of the analysts there which has been great. So in addition to that, we have kinda more eyes on these notebooks that are being used for critical processes. So it’s a way of giving feedback to team members that’s been really useful.
Also to accelerate kind of this adoption we’ve looked at using some of the features of the platform which has helped people feel comfortable in an environment and be able to kind of leverage existing tools more easily. So, one thing’s I’ve used quite a bit is pandas UDFs which allow you to basically do a divide and conquer approach. Now where your dataframe gets split up, sent to different worker nodes and then you basically write a function that operates on it as if it was a Pandas DataFrame. I mean, we’ve used this to do things like automated feature engineering on large-scale data sets. I mean we use tools like XGBoosts in a distributed way even though the koala library is not developed for PySpark specifically. And then there’s a session on this from Zynga last year where we talked about the system we built that uses this approach at scale. And then there’s some features in spark three that I’m looking to see in terms of making these more efficient. We’ve also explored the koalas library to try to provide an intermediate step between Pandas and PySpark DataFrames. It’s been useful for kind of getting hands on with some of the basic operations but we haven’t seen kind of immediate success we’re just tryna port existing code from pandas to PySpark. But as more of those functions get implemented I think it will be a nicer way of kind of having an intermediate step between pandas and PySpark.
The other thing I’ve mentioned is ownership of data products and this has really been useful because as we’ve scaled up to more products we have more teams that are now responsible for owning them. So it’s not just one kind of engineering org that’s responsible for all these diverse projects and that the people don’t know the intricacies of them are really responsible for continuing to kinda maintenance and monitoring other platforms. So we generally have two paths to production. There’s essentially a model pipelines, which tend to be batch models where we basically have databricks jobs that run that source data from the data Lake or data warehouse and then publish it to a data sync such as our couchbased kind of real-time applications. And this is something where the results can be served through experimentation our experimentation platform and be reused immediately by game teams. So this is where analysts and data scientists can actually kinda own the model end to end. The other way we work with kind of models is through model endpoints where we’ll partner with our engineering team to use AWS SageMaker for machine learning models and then more recently we’re using a in-house library built on TF-Agents to actually serve reinforcement learning models. I mean this case it’s more of a shared development pattern but we’ve actually seen a lot of useful collaboration in this space.
So I’ve mentioned some of the challenges we faced but we’ll dig into some of learnings we’ve had as well and kind of how we kept things in check so that as we scaled up them over users, we didn’t kinda have uncontrolled costs and other issues.
So, one of the things we did similar to the JupyterLab ecosystem is we created a library with kind of standard functions for common tasks that our analysts or data scientists will wanna perform so that straightforward a loaded airframe or push a dataframe to our application database as well as work with some of the other tooling in our ecosystem.
I’m at the time that we had some functionality that was similar to MLflow for doing things like monitoring models and over time we’re actually deprecating a lot of that functionality and switching to MLflow itself. But it was something that was really useful to get kinda up and running with the platform. And we have some other functions with orchestrating through airflow or kind of doing kind of communications through jobs as well. So it’s been a great way of getting kind of the adoption curve ramped up.
One of the issues that we ran into as we scaled number of users was the need for more and more libraries and we found conflicting library versions. We found some issues with tornado that were really challenging to debug during cluster startup.
So as a result, we basically came up with an approach where we have a fixed set of development clusters with a set library versions and then we have any jobs that run on a regular schedule scheduled to run as a ephemeral clusters so there’s not conflicts between different library versions. And then we roll up new development clusters as major software releases are available. There’s also ability to spin up clusters if a set of specific libraries are needed for a team or project such as the feature tools library that we were using for the ophthalmic project.
And then in terms of job ownership we’ve had issues with people leaving teams or the organization and then you have these jobs that are running that don’t have clear owners. So you had to set up a set of processes to make sure that, that results in not an overwhelming number of jobs over time. So the first thing was to have backup owners for jobs and kinda map these to teams rather than individuals, we also monitor the outputs of these jobs to see if they’re actually being used and if there’s a large window of inactivity and then we’ll flag it to be sunset. And then jobs that we run regularly that we’ve seen successful will actually migrate to airflow where we have kinda more eyes on the kinda DAGs that are actually in production. And then we’ve also built data products that work across multiple games rather than building a lot of game specific models which means that we just have a smaller number pipelines that we need running each day.
Then the final lesson was really around cost tracking where we scaled up a lot of clusters and people wanted to do things like look at ramping up GPUs in order to keep costs in check the main thing we had to do was make sure that we had good tagging so that in the AWS ecosystem we could have good kinda management of which projects we’re basically running out costs and making sure that we were able to work with teams to optimize their jobs and make sure that they’re using the correct instance types. So this is something where we just needed the tooling in place to do this effectively.
There was a lot of challenges in just kind of ramping up usage and kind of people were kinda going wild on the platform and trying to learn what to do. We set up a Slack channel for general questions and this worked well for the majority of users but we often had issues where people were kind of asking an overwhelming amount of questions and we wanted to basically avoid that so the team kind of managing the Slack channel wasn’t overwhelmed. It was useful however for some kind of non-trivial issues we found with like cluster startup where we found some kind of common issues that were worth addressing in a more broadly way.
Basically to approach us we did set up some kind of SLAs around kind of responding to questions or just kinda cluster issues with the team being distributed around the planet. We wanna make sure that people aren’t waiting over a day for kind of responses to things. But generally the way this has been resolved over time is to have more people familiar with PySpark in being hands on that can answer questions so it’s not just one kind of single Slack channel but you can have kinda team specific PySpark channels as well. So, we found that actually just as you kinda have more people ramped up with the platform that they can ask and answer questions for more of the team rather than kinda going through one single source for answers. So this is where we’ve really been kind of trying to pair up more kinda cross-team projects to develop knowledge and kind of ramp up the skill set of our individual teams at Zynga.
So as I mentioned, we’ve opened it up to kind of a broader organization of over 50 people and what we found is people had kind of specific use cases but there was a lot of findings that we were able to generalize and actually use across multiple games and across the central functions. So we’ll talk through a few examples here. The first one is a system called AutoModel which I presented at spark summit last year where we’re basically building hundreds of propensity models daily. So trying to predict which users are likely to turn or which users are likely to purchase.
I mean this is published to Couchbase where our game teams can set up experiments around it and try to tune different features. So this was a great way of actually ramping up end to end kind of seltzer pipeline, where we found that it was actually used for ad targeting as a potential use case. So that was more novel on the actual application but some of the other applications that we found were quite interesting as well.
So one of the things we’ve talked about quite a bit at Zynga is really this idea of player archetypes where you have people that kind of have different types of gameplay behavior. And you can kind of define archetypes by doing a clustering exercise or basically some sort of user segmentation exercise. We found both our team in India as well as our team in SF were actually looking at similar problems across different games. And we’re actually aligned on kind of a standard approach where we basically look at K-means clustering as a way of coming up with different player archetypes where we also had a novel application which was around kinda marketing where we could do different targeting based on different player archetypes. So that was an application that wasn’t kind of initially intended with this project but something where we’ve actually found useful results from being able to scale up archetypes to more than just one game.
Another project that started actually as part of a hackathon we had at Zynga was this idea around cheater detection, where we wanted to use an autoencoder to basically compress the gameplay behavior of a user and then kind of decompress it to see how well we could actually kind of come up with a model for describing kind of activity in a latent space. And the general idea here was that players are represented as a one dimensional image and you use an autoencoder to compress and decompress it and where you see large discrepancies between the input and output we can actually evaluate those data points and see if there’s some sort of activity that looks suspect. So this was something where our team was using TensorFlow and they’re able to ramp up kinda large instances to use TensorFlow on the platform where we previously wouldn’t have access to those types of compute resources. So it’s a great way of actually just spending up large-scale kind of Python workflows as well. So that’s one of the great ways that opening up this platform has really benefited some of the work that we’re exploring.
And then also economy simulations is a big part of just mobile game tuning where we wanna understand how like a change in our gameplay or adding any feature will impact the economy or having some sort of different daily bonus. I mean this is something where our team based out of Austin was able to represent kind of Markov chains as dataframes where you basically have kind of a dataframe with state for each user and you run it through different transition states and by doing this in an iterative way we were able to actually simulate adding new features or kind of tuning the game.
And this was a general approach we were able to apply to other games as well. So it’s all these great approaches where we wrote at once but then it scaled to other titles.
Experimentation is also a big piece of the analytics function at Zynga and we’ve been able to use PySpark to actually scale up to larger data sets that we’re doing things like significance testing with. I mean this is really where we’re leveraging Pandas UDFs for divide and conquer approaches.
I mean this allowed us to do distributed work with SciPy, numpy and statsmodels.
And then one other novel application clause really our work around reinforcement learning where there is a talk about this at the summit as well where we now have a real-time model serving pipeline for reinforcement learning where we can do things like understand the best timing for sending out notifications to players. And we’ve used this in both words with friends two and CSR racing two. So it’s in production with large user bases and it’s a great way of actually showcasing reinforcement learning being used with a large user base. So we’ve actually opened sourced part of this pipeline it’s called RL bakery and it’s available on GitHub.
And then if you’re interested in more details please do check out the other talk at the summit.
So I’ve covered some of the novel applications of PySpark at Zynga and it’s been great seeing some of the applications that our teams have produced whenever we opened up access to this platform. So we’ve talked about some of the ways we’ve had to control costs and kind of make sure that library versions aren’t conflicting but the result has been great and we’ve seen really useful kind of outputs from opening up the platform. So to summarize the findings from basically the session today I would say that the first thing was that it was great to actually open this up to our analytics organization for large-scale adoption.
It’s something where you need to have some policies in place so that your team is able to adopt it and kind of ramp up on the platform but it’s basically a great way of having kind of a standardized platform where it allows for more collaboration and ensuring to work. So to get to kind of the critical NASA adoption we needed to have some training materials and hands-on support but once we had a number of experts within different teams it was much quicker to ramp up. I mean it’s been a great way of kinda leveling up the skillset of the analysts at Zynga. Other than that the main takeaway is that you’ll find teams building stuff that wouldn’t have otherwise been discovered so it’s been great to open up the platform at Zynga and with that, thank you for doing the session.
You can find me at BGweber on GitHub medium and Twitter and I also have the job site for using a available if you’re interested in more about analytics at Zynga.
Ben Weber is a distinguished data scientist at Zynga with past experience at Twitch, Electronic Arts, Daybreak Games, and Microsoft Studios. He received his PhD in computer science from UC Santa Cruz.