Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King

Download Slides

For fast food recommendation use cases, user behavior sequences and context features (such as time, weather, and location) are both important factors to be taken into consideration. At Burger King, we have developed a new state-of-the-art recommendation model called Transformer Cross Transformer (TxT). It applies Transformer encoders to capture both user behavior sequences and complicated context features and combines both transformers through the latent cross for joint context-aware fast food recommendations. Online A/B testings show not only the superiority of TxT comparing to existing methods results but also TxT can be successfully applied to other fast food recommendation use cases outside of Burger King.

In addition, we have built an end-to-end recommendation system leveraging Ray (, Apache Spark and Apache MXNet, which integrates data processing (with Spark) and distributed training (with MXNet and Ray) into a unified data analytics and AI pipeline, running on the same cluster where our big data is stored and processed. Such a unified system has been proven to be efficient, scalable, and easy to maintain in the production environment.

In this session, we would elaborate on our model topology and the architecture of our end-to-end recommendation system in detail. We are also going to share our practical experience in successfully building such a recommendation system on big data platforms.

Speakers: Kai Huang and Luyang Wang


– Alright, hello everyone. Thanks for listening to our session today. My name is Lu, and I’m senior manager of the Data Science and Machine Learning Team at Burger King. I’m excited to be presenting our work at the AI Summit. Today, we are going to talk about the context of our fast food recommendation services we built at Burger King. I’ll start with the use case introductions and then talk about newly developed TxT models in detail. And then Kai from Intel will talk about AI on big data, and how we build our distributed training pipeline using RayOnSpark. Our food recommendation use case takes place in the drive-through model. So, for some of you who don’t know, a drive-through is a type of ordering services that allows customers to order their food without leaving their car. Drive-through becomes especially important during the current global pandemic environment, because it’s very easy to enforce a social distance without asking customers to leave their car. So this is a typical drive-through process where, you know, the process starts, our guest arrives at the drive-through. And then typically they first check the outdoor digital menu board to see if there’s something they like. And then they start ordering process by telling the cashier to insert some options. And then, they normally check the menu board again to make sure the order is correct and see if there’s something else they wanna order. So this process typically iterates a few times, until the guest finds all the items they wanna order in the drive-through and then the process concludes. So how can we add in the recommendation into this process? So this is what we see, right? When the guest first is looking at the menu board, there’s opportunity for us to use machine learning based services to show some recommended items to the guest, right? And as guest continue with the ordering process, our recommendation services need to react to what guest already ordered and continually refreshing the recommendation items to make sure it captures stuff that the guest would like. And, yeah, that’s our food recommendation use case. So how is that different from E-commerce, our traditional E-commerce recommendations use case? What are some unique challenges in our use case? So first challenge here is the cases that are obviously in offline environment, they’re just very limited in terms of user identifier, right? The guest stays in the car. There are certain technologies like license plate recognition, or a scanner. It works for you to identify the guest, but they all have their own limitations, right? So in our case, we need to build a solution that can work even without user identifier. And also a unique challenge is here related to the food items. So in our recommendations space, we need to consider the same session food compatibilities. So just take some example, right. If you buy some games on the E-commerce website, it’s naturally recommending your system, recommending some other games. But because this is food and, you know, you only have limited appetite, right? So if you order things like soft drinks, it’s not likely ordering a soft drink during the same order. So there’s some unique challenges in our use case. And also we have to consider other variables in our use case, right? So there are things like locations, the weather, the time. They all plays an important role when a guest is ordering food. So our model needs to be able to consider all these. Also deployment challenges. We need to be able to deploy the solutions across all the working restaurant across the world, and not only the locations with good network connections. So our deployment solution also needs to be very possible. So we kind of, you know, breaking our solution to form two parts. So the model part and the deployment part. So for the model part, given our unique use case, we think we need to develop a session-based recommendation model to solve this. Because our session base able to react during the same session behavior. So it’s not that, you know, relying on the user identifier. It can even work without user identifier. And also our model, you should be able to take complex context features into consideration. So without the user identifier, the only thing model has is that same session behavior information and the complex information. So it needs to be able to capture almost And from the deployment architecture perspective, our model needs to be able to be deployed anywhere. So both edge and the cloud. So we cannot leverage the docker container Registry to allow the same recommendation services running in any environment without as an opening store. Let’s talk about our new developed Transformer Cross Transformer. So we build this new recommendation model called Transformer Cross Transformer, also TxT to solve our recommendation use case. It contains three main components. The sequence transformer part, where it takes item order sequence as input and feed into your actions to be brought to learn order sequencing recommendation. And the context transformer part, where it takes all the context features as input and feeding this into the transformer block to learn the context feature recommendations. And also the latent cross pack, where we conduct element-wise product for both transformer related output to join and then train our recommendation model. So inspiration could be of this model, I mean, from Google’s latent cross model. So Google discovered the context features where doing recommendation services and traditional ways of adding complex features or recommendation. It has some of the images. So Google then be able to our RNN latent cross model where it contains an RNN pack and they can cross packing the common features in the RNN The different between RNN latent cross and our TxT model, first is we were basing RNN . So it can achieve faster pairing speed when doing big data monitoring. And secondly, we also use a separate transformer App to capture the are complicated interactions, some different complex features, so the model can learn a better context representation when there are multiple context features of input. During the offline evaluation process, we benchmarked TxT model against a few other models. Aside from RNN latent cross, we also compared with RNN model, where it doesn’t have the context features input. And the variation of the item model was the input of the contextual of efforts input So this one doesn’t have the information. These actions TxT model show you the lowest cross entry accuracy that we need and also have the highest Top1, Top3 accuracy among all the other models. During the the online A/B testing, due to a lot more complex model architecture, TxT model that shows slightly lower averaging for speed compared to RNN latent cross. But only by a little bit, doesn’t really affect our user experience when inputing. So that part is fine. On the other side, the result also shows TxT model has a much better conversion rate. Also the Add-on Sales consider that our RNN cross model. So the last part I wanna talk about is the model training architecture. So previously we were using a separate YARN cluster and a GPU cluster to train our model, where the YARN clusters were mainly running the Spark job to do the data processing. And the GPU clusters running the MXNet Core to train our model. The limitation with this process is that there are two different clusters. And more and more data pipelines needs to be properly maintained in the production. Also, it takes a very long time, as everyday science grows, for the data to be transferred between our HDFS and a YARN cluster and the GPU cluster. So that’s kind of a limitation. And then by working in array, we kind of develop unifying model training pipelines where everything happens inside the cluster. So the same cluster will do the Spark data processing. Will also leverage the same cluster… Will do distribute them on our training on MXNet and RAY. And Kai will talk the way RayOnSpark part in a little more details in the following section.

– Okay. Thanks Lu, for sharing the recommendation use case at Burger King and the TxT model architecture for context recommendation. Since Burger King processes an enormous gas transaction data on their big data clusters, it will be essential to build a unified system for the entire recommendation pipeline. So in remaining part of the session, I will elaborate how we use Ray on Apache Spark to implement the distributed training pipeline for the TxT model. Well, first of all, I will briefly introduce the works that Intel has done this year enabling the latest AI technologies on big data. In 2016, we open sourced our first project named BigDL, which is Distributed Deep Learning Framework power plate built on top of Apache Spark. Deep learning applications, retaining BigDL are standard stack applications, and therefore can directly run on existing hardware postbac clusters without modifications to the cluster. We have downloads of optimizations so that BigDL can achieve both outstanding performance and good scalability on the CPU clusters. Later in 2018, with other open sourced Analytics Zoo, a unified data analytics and AI platform for distributing TensorFlow, Keras and PyTorch. With Analytics Zoo, TensorFlow or PyTorch users can seamlessly scale their AI workloads on big clusters for their production data. We provide high level machine learning workflows in Analytics Zoo, including cluster serving and auto ML, to automate the process of building large scale AI applications. There are a bunch built-in models and out-of-the-box solutions in Analytics Zoo for many common fields, including recommendation, time series, forecasting, computer vision, natural language processing, etc. Recently, we developed a step project in Analytics Zoo called Project Orca. And Project Orca is a stipulated time to easily scale out single node Python AI applications on notebooks across large clusters by providing data-parallel pre-processing for common Python libraries. And Sklearn-style APIs for distributed training and inference. So we use the distributed MXNet training support in Project Orca for Burger Kings recommendation system, which was implemented based on Ray and Spark. Also, in case some of you may not be familiar with Ray, I will talk about Ray for a little bit before I proceed. So Ray is a fast and simple framework open sourced by UC Berkeley RISELab to easily build and run emerging AI applications in a distributed fashion. So Ray Core provides simple primitive and friendly interface to help users easily achieve parallelism with remote functions and actors. Ray is packaged with several high-level libraries built on top of Ray Core to accelerate machine learning workloads. For example, Ray Tune can be used for large scale experiment execution and parameter tuning. RLlib provides a unified interface for a variety of scalable Reinforcement Learning applications. So the third one, RaySGD is our focus today. RaySGD implements Wrappers around TensorFlow or PyTorch to ease the deployment of data parallel distributed training. If users want to directly use the native distributed modules of TensorFlow or PyTorch, they might go through the complicated startup steps in the production environment. So the RaySGD can help to relieve the deployment efforts to some extent. So we adopted design idea of RaySGD for MXNet, and we have done some further work to make our implementation fit the Spark data processing pipeline. Okay. So let’s go into the implementation details of the Distributed Training Pipeline on Big Data. We developed RayOnSpark to seamlessly integrate Ray applications into Spark data processing pipelines. So as the name indicates, RayOnSpark can run right on top of PySpark on Burger King’s YARN cluster. So first of all, for the environment preparation, we leverage conduct pack and YARN distributed cash to automatically package and distribute all the passing dependencies on the driver note across the cluster at runtime so that users don’t need to install them on all nodes before hand. And the cluster environment remains clean after the programs finish. So in the Spark setting as we all know, a Spark context is created on the driver node and Spark is used by Burger King to load data and perform data cleaning, ETL and pre-processing steps. And afterwards, RayOnSpark will create a Ray context object on the Spark driver as well. And the Ray context would utilize the existing Spark context to automatically launch Spark processes alongside Spark executors under the same cluster. Additionally, we can see in the figure that a Ray manager would be created within each Spark executer to automatically shut down the Ray processes and release the corresponding resources after the Ray applications exit. Similar to RaySGD, we implement a lightweight wrapper layer around the native MXNet modules to help handle the distributed standings of a distributed MXNet training on the YARN cluster. So in our implementation, MXNet workers and parameter servers all work as reactors. They communicate with each other all through the distributed key values that were natively provided by MXNet. So implementation as you can see in the figure, we have Spark and Ray in the same cluster. And therefore, Spark in memory RDDs or data frames can be streamed into Ray’s plasma object store. And each MXNet worker can take the data partition of the spark RDD from its local plasma object store of the model training. So for the coding part, Project Orca provides a thinking learning style estimator API for distributed MXNet training based on RayOnSpark. But for users, actually, they do not need to know much about the details of RayOnSpark. They just need to import the corresponding packages in Project Orca, and coordinate Orca context for the cluster setup. So when you coordinate Orca context, you specify the cluster mode to be YARN, and you can also specify the amount of resources to be allocated for your application. For example, the number of nodes to use in the cluster, the number of cores, and the amount of memory per node to use, etc. So in an Orca context, we would have prepared a right-hand passing environment and launch the Spark and Ray cluster and YARN for you. So after creating a Spark context, you can use the created Spark context to do data processing using spark. And finally, you can get the Spark RDDs for training and validation. So the third step is to create the estimator for MXNet in Project Orca. So when created an excellent estimator, in the train config, you can specify number of MXNet workers or parameter servers to launch. The code modifications and learning efforts should be minimal if you use Project Orca to scale out single node MXNet training script to a large clusters, since, as you can see in the code, you can directly input the model, loss, matrix, defining pure MXNet when initiating the MXNet estimator. So here in our use case, we input TxT model, define MXNet. And since our recommendation use case is a next item prediction problem, we use the soft mass cross entropy loss, and we choose Top1, Top3 accuracy as the matrix. So invoking would launch the distributed MXNet training across the underlying YARN cluster, given this Spark process RDDs and number of epochs and batch size for training. So that’s pretty much the code you need to write for the distributed MXNet training pipeline. So it should be straightforward and such a distributed training are exactly the same cluster where this data is stored and processed. So there is no extra data transfer needed. Previously, as Lu mentioned, when they use a separate GPU cluster for model training, nearly 20% of the total time is spent on the data transfer between two separate clusters, which is quite expensive. So after switching to the solution provided by Project Orca, Analytics Zoo, the entire pipeline becomes more efficient, scalable, and easier to maintain since it only needs a single cluster. And there’s no extra labor needed to maintain a separate cluster. So Burger King has successfully deployed a pipeline into their production environment to serve the customers. Okay. So here it comes to the end of this session. And as a wrap up in this session, we talk about the joint work of breaking Intel to build end-to-end Context-Aware Fast Recommendation using RayOnSpark. So we have released a paper and a blog for our cooperation with the links shown here. And you may take a look at it later. And if you want to know more about the technical details of RayOnSpark, we have a previous session in the past Spark AI summit this June in this year, and to which particularly talks about implementation details, and you may take a look. So if you want to know more information about Analytics Zoo, you can visit our GitHub page. And I think that the other functionalities of Analytics Zoo would be useful to your work as well. So if you have a GitHub account, don’t hesitate to star our project, Analytics Zoo on the GitHub page, so that you can find us whenever in need, and we can give you timely help and support. Also, we are now actively working with other industrial customers for more use cases with RayOnSpark, and we’ll be glad to share our progress in future chances. Okay. That’s pretty much what I want you to talk today. Feel free to raise questions if any, and thank you for attending this session. And hope that what we have talked about could arouse attention. Thank you so much, and have a good day.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Kai Huang

Intel Corporation

Kai Huang is a software engineer at Intel. His work mainly focuses on developing and supporting deep learning frameworks on Apache Spark. He has successfully helped many enterprise customers work out optimized end-to-end data analytics and AI solutions on big data platforms. He is a main contributor to open source big data + AI projects Analytics Zoo ( and BigDL(

About Luyang Wang

Burger King Corporation

Luyang Wang is a Sr. Manager on the Burger King data science team at Restaurant Brands International, where he works on developing large scale recommendation systems and machine learning services. Previously, Luyang Wang was working at the AI lab at Philips and Office Depot.