Productionizing Deep Reinforcement Learning with Spark and MLflow

Download Slides

Deep Reinforcement Learning has driven exciting AI breakthroughs like self-driving cars, beating the best Go players in the world and even winning at StarCraft. How can businesses harness this power for real world applications? Zynga has over 70 million monthly active users for our mobile games. We successfully use RL to personalize our games and increase engagement. This talks about the lessons we’ve learned productionizing Deep RL applications for millions of players per day using tools like Spark, MLflow and TensorFlow. Hear about what works and what doesn’t work when applying cutting edge AI techniques to real world users.


  • How to apply Deep Reinforcement Learning to solve business problems
  • Understand challenges that arise in applying RL to industry
  • How we use Databricks, Spark and MLflow to productionize RL
  • Tips and Tricks on training RL Agents for real world applications

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hi, my name is Patrick Halina. I lead the ML Engineering Team at Zynga. And my colleague Curren and I are gonna speak about how we use Reinforcement Learning for personalization.

Reinforcement Learning in Production at Zynga

So first, we gonna give a quick intro to what Reinforcement Learning is and how we use it in personalization. Then Curren is gonna talk about the Tech Stack we use to run this in production using off the shelf technology. And finally, I’m gonna speak about some of the lessons we learned from launching RL Applications into production.

So first, a little bit about Zynga. We’re one of the world’s largest mobile game developers. We have over 60 million monthly active users and I invite everyone to try out some of our hit games after the summit.

Game Design is Hard

Designing games is difficult art. There’s a lot of decisions to be made and these decisions completely shape the user experience. So how hard should we make this boss? What level should we show next? What type of game mode should we recommend? These are all tough to know. Not only that, but we’d like to personalize our games so that each of our 60 million users get the best experience possible.

So let me formalize what I mean by personalization. Essentially, given a user state, we wanna select an action in the game that will maximize a long term reward. So in our case, the state we use is information about the user. How long have they played our game for? What level are they at? How many times did the last boss beat them? The action we pick controls the game. So the difficulty of the next level is an example. And the reward that we try to maximize is typically engagement so that our players, we know that their players are getting the most out of our games.

Personalization Method 1: Rules Based Segments

One of the methodologies we’ve used historically with a lot of success is rule based Segmentation. So this is where our game PM will create rules to split our population up into different segments. And it ends up looking a bit like this decision tree you see on the slide. With each segment, we assign behavior for the game, then we do A/B testing to measure whether these segmentation strategies are affecting our KPIs and what type of strategies work best. We’ve had success with this, but there’s some challenges too. It’s a lot of manual work to create these segments, and then assign the behaviors for them. And it takes a lot of trial and error. Not only that, but once you go through all this effort, eventually our audiences get more mature, the game features change. And now you have to repeat the whole process to make sure that your segments are up to date. You’re also limited in terms of how much personalization you can do. There’s a limited limited amount of data that you can create these segments on. And you can only really end up with a handful of different options for our players. So it’s not really personal.

A more advanced methodologies to use prediction models. So for each of your available behaviors or actions, you can predict what the effect on this user will be if we apply it to them.

However, this has its own set of challenges. For one, you have to train a lot of models. And training these models requires a lot of data. That means randomly assigning users to these various behaviors, and then waiting long enough to measure the long term results. And that can lead to less than optimal user experiences. In the meantime, while you’re gathering this data, you’re also limited to controlling simple outputs. So it gets hard to use prediction models to optimize a continuous value since you have to search over the prediction space.

Personalization Wishlist

So when Zynga was looking at generating the next generation of personalization, we came up with a wish list. We’d like to automate some of this manual work for personalization and this automation will allow us to continuously explore and improve these algorithms over time so that we can keep pace with our users. And finally, we wanted to be able to personalize more complex outputs things like continuous values, or controlling multiple sets of discrete actions at the same time.

The solution that we found was Reinforcement Learning. And this is a branch of machine learning, just like supervised learning and unsupervised learning. It’s basically used for making sequences of decisions. Many give a quick introduction and reinforcement learning you train an agent, which looks at the state of the world and select an action that maximizes some type of long term reward. It takes the feedback in terms of that reward, and then improves its policy so that it makes better decisions in the future.

Solution: Reinforcement Learning (RL)

So it’s a perfect fit for personalization, and it automatically learns from past experiences and explores new options while balancing what exploration with what it’s learned in the past.

You’ve probably heard about some of the successes with reinforcement learning.

Things like beating the world’s best GO players, successes and beating StarCraft and Dota players all the way to self driving cars. So we wanted to see if reinforcement learning can achieve all of this.

If RL can beat the world’s best GO player can we use it to make our games better?

Can we use this to make our games better as well.

Application; WWF Daily Message Timing

One of our first applications was controlling what time of day to send a daily message. So many of our games send out messages to remind our users that their friends are waiting for them to make a move, or there’s some challenge waiting for them. But we have a worldwide audience and everyone has their own personal schedule, so it’s hard to know what time of day works best for each user. We trained to Reinforcement Learning Agent to find the best hourly time to send message based on users historic activity. The results are a significant increase in click through rate compared to the hand tune system we use previously. We’re now running this in production from millions our users every single day.

And now Curren is gonna talk about the Tech Stack that we use to run these reinforcement learning applications in production. – Thanks, Patrick. The next few slides, I’m gonna talk about how we structured our Tech Stack to make these RL applications possible.

RL Model Training Agent State Environment

So as Patrick mentioned, an RL agent learns by interacting with an environment and doing trial and error actions in that environment and optimizing a long term. Unlike supervised models, we don’t always have label data and training data beforehand, because sometimes the optimal action is not very obvious. For example, if you think of a chess game, you might need to sacrifice upon in a move and that looks like an immediate loss. But over the long term, it materializes and you’re winning the game. And so, RL is really good at these long sequences of interactions that optimize long term rewards such as engagement or user retention. How that works in practice is you have the environment to log these experiences. So when one agent takes an action, there’s something log of experience. And those experiences at training time aggregator to create an experience replay buffer, which is a set of experiences that the agent performed and also tells you how those experience those actions resulted in rewards. And those experiences are used to train an agent would send a (mumbles) and hopefully improves and to add some detail and experience looks as follows. It has the state of the environment before an action was taken, the action that was taken, any reward that resulted from taking that action. So things like winning a game, increasing a score in a game. Also include includes the state of the environment after the action and in some algorithms, you also include the next most likely action.

Academic RL Applications

And this kind of process has been used very successfully in academic RL applications and usually in games to things like Atari go StarCraft, chess. And these applications, and our agent is made to interact with an M simulator. It plays multiple games from beginning to end many times RL and learns how to play that game well, based on whether it’s winning or losing.

Production RL Applications for Personalization

However, we don’t really have the luxury of doing that in production and that makes production applications much more complicated. For one, we don’t really know how to simulate a user well, users are unpredictable they also different from each other. So there’s no one simulator we can use to optimize this RL agent. Secondly, because we can’t simulate the user, we have to use a semi-tuned agent in production and let it learn from the actual users. And lastly, since their production applications are deployed to many different users, all at the same time, so you have to do this learning in batch and in balance. The analogy we try to use often for batch RL application is, whereas an academic RL application would you play one game of chess from beginning to end and do that a million times to play chess? A batch RL application is playing a million games of chess and making one or two or three moves, and using all those experiences to become a chess master.

RL Model Training

And so when we started trying to build our batch RL system at Zynga, we came up with a wish list. We wanted something that use off-the-shelf components, So we ended up doing the (mumbles). We wanted something that was scalable, reliable and robust, so that we could put it in prod and affect real players but not affect them negatively or hurt revenue. We also wanted something that allowed was state of the art, but also extendable. We wanted something that use the latest algorithms that are available today, but also could incorporate new libraries and new algorithms and new best practices in the future.

And after searching for a while, we settled on using TensorFlow agents, also known as TF-Agents. And this is a TensorFlow made RL library that implements new and innovative RL algorithms and makes it really easy to build us in a nice manner. And the reason why we chose this was because it had some key advantages. For one, we found that the library was very well architected and had a very modular design, so we could pick and choose the components that fit into our system rather than having to take it wholesale. Secondly, it was well written the code was good, it was well documented and we could modify it if needed. Also, because it’s written by the TensorFlow team and used by our customers, we trusted its implementation and its accuracy. And lastly, it includes a lot of the new algorithms that we wanted to try. So things like DQN, PPO, TD3, rainbow, and most of the ones we needed were already in the in the library and they keep adding new ones as a result, so we are pretty confident you can be close to the bleeding edge.

Production RL Challenges

However, picking an RL library is not doesn’t solve all your issues that solves the RL algorithm part of it, but you still have a lot of messy production challenges that you have to deal with. Things like orchestrating saving trajectories and models and positioning them between days and converting log data that is really messy and showed in different places into TensorFlow trajectories that you can use for training. And how do you do all of this at scale is also a challenge. And finally, we want to do all of this in a way that’s robust, but also that is repeatable and friendly for our data-scientists so that they can use it and focus only on the RL and the model and not on the orchestration challenges of the productionization challenges.

And we looked for a lot of off the shelf systems, and we couldn’t find any so we built RL-Bakery. RL-Bakery is Zynga open source library to help you build batch RL applications in production and at scale, I have included a GitHub link at the bottom of the slide. I hope you guys go check it out, download it, use it in applications and provide feedback and also contribute.

The next couple slides I’m gonna talk a little bit more about what an RL-Bakery application looks like. Firstly, RL-Bakery is a wrapper around RL algorithm library. It’s not a replacement for them, It’s written in Python. It uses Spark and TensorFlow extensively to do training model related things and also use a Spark to fetch data processing in a distributed manner and scalable way to create the trajectories needed by an RL application.

So an RL-Bakery application has three components, the application layer, the library layer and then the core RL model system. And the outermost layer is application. This is application-specific. So the data scientists would build a new one for each RL application at Zynga, it’s made to be extremely data scientist friendly. It’s written in databricks, which is something they love and use on a daily basis. It’s written in Python. So the way it works is RL-Bakery makes an interface available and the data-scientist who’s responsible for filling out that interface. And this includes information such as what kind of model do you want to use? What are the hyperparameters? what is the network architecture, but also it has more business specific information such as, where and which data warehouse or data system do you fetch observations from actions from rewards, and you collect those all as a Spark DataFrame and pass it to the next layer which is the RL-Bakery.

And once an application is built, all the other components of the system are shared across all the RL applications at Zynga. So it’s very reusable. We think of RL-Bakery layer as an orchestration layer. It’s responsible for things like taking those Spark DataFrames of the RL application layer provided and massaging them and kind of processing them at scale in a distributed manner, using Spark to create TensorFlow trajectories. It’s also responsible for saving these trajectories so that you can reuse them on future days without having to recompute them. And once you do the training of the model, It’s also responsible things like saving the model so it can be used again to evolve in the future, and also could apply them to our model serving system.

And the final the innermost layer is the RL library today that only TF-Agents but in the future, we could extend it to use new and cool RL frameworks that come out as this is an ever evolving field. I talked about a lot of the advantages of TF-Agents and why we chose it. so I’m not gonna go over them all the way again over yet.

Real Time Model Serving

And finally, like I said RL-Bakery deploys the system to our library Real Time Model Serving system. And at Zynga, we call it single personalized. If you look at it, it’s very similar to a lot of the ML model serving systems that you’re used to. But that’s because solving an RL model at inference time looks very similar to solving a supervised model inference time. For example, a DQN network, just so it is a forward pass through a neural network, which is very similar to traditional deep neural networks. So the system can be shared with your other ML serving systems and includes the same common features you would expect things like feature hydration to add features, pre-processing with normalization model inference that we do using SageMaker because that abstracts away kind of the complexities of different frameworks for us post-processing. And finally, Zynga personalize that’s relevant for RL is that it also logs all this experience information for each request.


And so this is a high level slide that kind of brings everything together. On the left is the model serving component that I just talked about. This is where a game client would call Singapore slides, which is our real-time serving system, which in turn would call SageMaker to run an RL agent would get a recommendation post, process that and then return it back to the game client with the recommendation. At the same time, it also logs experiences to S3. And then at a regular cadence maybe once a day or once a week. There’ll be the training phase, which is the right hand side of the slide. And this is an RL-Bakery application that runs in data in a databricks notebook on the database cluster. It uses Spark to take these experiences that will log to S3, process them in a distributed manner and create an experience replay buffer that is then used to train an agent and all this is gone it by RL-Bakery. Finally, if the agent looks good, we deploy it to SageMaker, where it’ll sell new recommendations and hopefully better recommendations and better personalization to our game players. And then the cycle continues. And with that, I’m gonna hand back control to Patrick, who’s gonna tell you a little bit more about how we face challenges designing RL applications in the real world and how to overcome them.

– Thank you, Curren. So we’ve some experience now launching a few RL applications in production. And we’ve learned a few lessons I’m gonna share with you now.

Choose the Right Application

So the first is that reinforcement learning applications are challenging to work to get good results. It’s more challenging than training, good prediction models in our experience. So one of the things to keep in mind is to make sure you’re applying RL to the right types of applications and it doesn’t work everywhere. So the first thing is to make sure your problem is best modeled as a sequence of decisions. So for example, if you were trying to optimize the first level, maybe you’re trying to pick just the difficulty of the first level of our game and that’s it. That’s not really a sequence of decisions and you could use simpler methods like contextual multi-armed bandits or maybe predictive models.

If your current action doesn’t really affect future actions, then you don’t need to use RL for that. But if you’re trying to say optimize the difficulty of every single level, now you can see how that’s a sequence and the difficulty of one level could affect the actions to take in the future. Something else to consider is whether your reward is really learnable and there’s two parts to this. One is that you have to make sure the action you’re taking or the way you’re changing your application actually affects that KPI and sometimes these can be kind of disconnected. Maybe you’re personalizing some small aspect of an application, but you’re looking for results. In terms of really high level KPIs, there might just not be that link. And reinforcement learning isn’t gonna be, a magical bullet to fix that. So the other thing to consider is the sparsity of your rewards. If the rewards that you’re getting are very infrequent and not really connected to the actions, it can be hard to learn that as well. And those cases, you have to figure out some type of intermediate rewards so that you can signal to the agent that it’s on the right track.

The next step in designing an RL application is choosing your states. And something we’ve heard and we’ve experienced is that RL Agents are sensitive to having too many inputs. So you don’t wanna just throw the kitchen sink at these agents and expect it to learn. It’s best to start with a very simple state space. And we’re also starting to incorporate auto-encoding to compress state space even further.

Designing Actions

When you’re designing the actions, it’s best to start small. So we typically start with a small set of discrete actions. And that allows us to use simpler algorithms as well like DQN. If you move on to larger action spaces or continuous spaces, you have to use different types of algorithms and explore what’s available there. Policy gradient type of RL algorithms are usually what’s used for continuous action spaces. And if you have a situation or you’re trying to recommend an item out of a very large catalog, so for example, you’re recommending the best song or the best book or something like that, that kind of falls under more classic recommendation systems. And some of the cutting edge techniques do use reinforcement learning as part of that, but that’s not really the types of situations we’re talking about applying reinforcement learning.

Choosing RL Algorithms

You also have to pick your reinforcement learning algorithm. Curren mentioned how we use TF-Agents for our implementation. There’s a lot of new algorithms that are coming out. And there’s a hot new algorithm. And it seems like every half year, again, benchmarked against each other on common testing frameworks, but it’s hard to tell which algorithms work best in your production problem. So it’s good to be able to test them out. It’s an active area of research and for us, we don’t research these algorithms, we use what’s available. It can also be difficult to implement these algorithms because subtle changes and implementation can lead to drastically different results from what you see in the papers. So that’s why we use off the shelf libraries like TF-Agents.

Hyperparameter Tuning

So I’ve gone over some of the design choices you have to make in your RL application. At the heart of deep reinforcement learning applications, there’s also a deep learning model as well. And anyone who’s worked with these knows that there’s a lot of hyperparameters around there. So tuning a reinforcement learning application, it combines all the difficulties of tuning a typical deep learning network with all these new choices about reinforcement learning. And there’s the added challenge that unlike predictive models or supervised learning, you can’t just run through a static set of labeled data. You only really truly learn in our situation when you’re dealing with people by actually launching into production and interacting with people. So this brings is a pretty big challenge. How can we start off with the best agent possible without before we actually interact with our users.

There are a couple of options available. So one of them is to make mimic some type of existing behavior. I brought up how historically we’ve used rules based segmentation. So if we’re applying reinforcement learning to an application that already has some type of personalization strategy available, you can start off by training the agent to mimic the existing behavior. That way when we launch, we’re hoping that it won’t do any harm and it’ll slowly learn how to deviate from the existing behavior and add in more personalization. Another way to perform hyperparameter tuning before going live is by simulating simple scenarios. So if you can simulate your state space and actions and a few different scenarios in which you know what the best outcome would be, you can measure whether your agents will learn those simple scenarios and you can also see which hyperparameters work best for learning quickly. That way when you launch live, you hope that the hyperparameter you’ve chosen will also be the best ones that work in the real world as well. Or you don’t really know the true mechanics.

Hyperparameter Tuning Automation

Something we’ve done for hyperparameter tuning is to automate it with some of the best in class tools. At Zynga, we use databricks, which has MLFlow built into it. And we use MLFlow to keep track for results with hyperopt to do the actual optimizations, or actual selection of the hyperparameters. So these two tools together have served us really well. And it’s made it a lot easier than manually changing parameters and waiting for results.

Key Takeaways

So I’d like to leave you with a few key takeaways. The first is that reinforcement learning is a great methodology to personalize applications. It’s ready for production with off the shelf technology. So you don’t need a research team of PhDs to get this going and we’ve had good results with this. However, It is more challenging than some of the other machine learning techniques out there. And it’s an emerging field with new technologies and best practices being discovered all the time. So it is worth the results, but it takes a bit of work.

Thank You!

And finally, I’d like to say thank you from the ML Engineering team at Zynga.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Patrick Halina


Patrick Halina leads the ML Engineering team at Zynga, where he works on productionalizing ML workflows and developing personalization technology. Prior to Zynga, he worked on the ML Marketing platform at Amazon. He received his undergrad in Computer Engineering and Master's in Statistics at the University of Toronto. He lives in Toronto, Canada.

About Curren Pangler


Curren Pangler is a Principal Engineer on Zynga’s Machine Learning Engineering team. He currently builds ML personalization systems that automatically tailor games to individual players. Curren received his Bachelor’s in Engineering Science and a Master’s in Applied Computing from the University of Toronto. He lives in Toronto, and loves snowboarding, board games, sports, and stand-up comedy.