As the customer acquisition costs are rising steadily, organizations are looking into ways to optimize their end-to-end customer experience in order to convert prospects into customers quickly and to retain them for a longer period of time. In today’s omnichannel environment where non-linear events and micro-moments’ drive the customer engagement with brands, the traditional one-size-fits-all customer journey will not be able to deliver true value to the customer and to the organization.
COSMOS customer intelligence platform helps organizations to address this challenge by offering a set of comprehensive and scalable Marketing Machine Learning (MML) Models for recommending the ‘next-best-action’ based on the customer journey. Trained on one of the largest customer datasets available in the United States, COSMOS MML Models leverage Spark, Databricks, and Delta Lake to stitch and analyze profile-based, behavioral, transactional, financial, and operational data to deliver customer journey orchestration at scale. In this session, we will discuss the business benefits of the dynamic customer journey orchestration, limitations of the classic customer journey models, and demonstrate how COSMOS MML models overcome these limitations. We will also review the global customer journey decision system that is built on top of ensemble machine learning techniques, leveraging Customer Lifetime Value (CLV) as the foundation.
– Hey everyone, welcome to the session on Delivered Dynamic Customer Journey Orchestration at Scale. My name is Krish Kuruppath, senior vice president of technology at Epsilon.
And I have here with me, Sharad. – Hello everyone, this is Sharad Varshney. I’m a vice president of Data Science at Publicis Epsilon, and I’ll be co-presenting with Kish on delivering dynamic customer journey orchestration at scale. – Let’s take a quick look at the agenda. We’ll start off with who we are and what we do, then quickly move on to the core topic, the customer journey orchestration itself. What does it, why is it important, and how do we enable it through big data technologies like Spark and Databricks. Then Sharad will walk you through, model building and training, some of the foundational components we have developed, and our decisioning process. In the end, we will wrap up with model performance and key business results.
So are we ready to go? Let’s go on. We are Epsilon. Epsilon is a leader in data driven marketing. It’s part of Publicis Groupe, while third largest communication company. We offer customer acquisition, and retention products and services to over 2,500 clients worldwide. The first three boxes that you see on top are digital media centric products for customer acquisition. The next three are, the customer retention and loyalty solutions that we offer. Epsilon has a unique advantage in the marketplace that we own over 200 million unique IDs in the United States with over 7,000 personal level attributes. And we also have the transaction data of over 56% of all non-cash transactions. And we also manage over 600 million loyalty accounts on behalf of our clients. And how do we leverage this? Our solution is based on three core pillars, data, activation and measurement built around a unique anonymous identifier called Core ID. This enables us to track customer journey with the brand and target them effectively. Info about us, let’s now dive into the journey orchestration. As we all know, the marketing industry is at a key inflection point.
The consumer expectations are rising, the industry is full shaky promises, fragmentation, and complexity. This is making the lives of brand managers and marketers extremely challenging. And of course COVID-19 has posed unprecedented challenges when it comes to resources and budget cuts.
So how do we solve it?
A key study by PWC found that one in three consumers say that they will walk away from a brand, if they have just one bad experience. Just one bad experience is sufficient to lose your customer, that’s pretty bad. How do we solve it? We think that data driven marketing can do everything. No, data has its limitations. We need to make every customer interaction personal and purposeful to retaining them.
And we can do that through delivering best customer experience.
To deliver the best customer experience, you need to know who the consumers are, where they are, what they want, and when they want it.
And then, deliver a personalized experience. It is quite challenging, especially when the customer needs and brand needs are different. As you can see on the left-hand side, the customer’s needs are different. They want the brands to know who they are, respite their time, make the interaction easy and fun, anticipate their needs, and give them the best and most cost effective option they can choose from.
And on the other hand, business goals are different. Businesses want to conquest customers, retain the brand affinity, increase customer lifetime value and reduce churn, and increase overall operational efficiency and reduce costs. How can we achieve both? We can do that only through data driven insights and compelling creative messages, delivered through the right channel with empathy. And customer journeys is all about that. Our solution optimizes the journey by detecting the micro-moments of customer interactions, and delivering the right call to action at the right time.
As you can see, this is a quick high level summary
of a marketing funnel. We start the customers at the awareness stage, take them through customer consideration stage, then to purchase, then to service and loyalty. Of course it’s not linear. Every touch that customer wakes with the brand could be through a different channel or through a different message. So we need to continuously nurture the customer interactions and gently guide them along this funnel. Now let’s take a look at the realistic customer journey. Here a mom want to buy a cleats for her son for baseball, walks into a store and makes an impulsive purchase of buying a pair of cleats. But at the time she sees some in-store advertisement, and she sees some brochures and she is also offered, an opportunity to open a credit card.
She first thinks about it, then goes online and then decides to actually apply for the credit card. Then she thinks that, “Oh yeah, I’m not so sure about the security “of the webpage, “whoever have to give the social security number “and additional information, I don’t want to do that.” So she walks into the store, gives the details and then gets the credit card. And then she decides to make a purchase. And then, she sees that she’s getting a discount and the loyalty points are being added to her loyalty account. This actually shows how a customer gets acquired by a brand, through offline campaign, and then in-store interactions. But we can make this a lot seamless and efficient through multichannel targeting and activation.
The models that we are gonna discuss later show how do we nurture this customer journey to target the customer at the right point at the right channel with the right message.
Now, let’s take a look at this from a data flow perspective. When a customer makes a purchase online or offline, you get a lot of data about the customer.
It could come from the CRM systems, the e-commerce platform, or if they visit your website or mobile app, you get the clickstream data. And then if they interact with you, with your brand on any social media channels, you get the social media, social buzz as well.
Unfortunately, most organizations don’t have the ability to stitch all this together. That’s where we come into picture. Our platform allows customers or our clients to bring in their customer data from various sources and different types of data. Whether it’s in batch mode or real time, we can bring it all together. Stitch the customer data together to create a single view of the customer, as you can see on the right-hand side. And this single unique ID will be used for tracking the customer, and engaging with the customer across multiple channels and devices. How do we enable this?
So we have built a platform on Databricks and leveraging some of the Azure services. So as you can see on the bottom, there are some cross-cutting services built on Azure and ML and AI technology stack, including TensorFlow and Keras. Then the core orchestration of bringing the data in, processing it, generating the machine learning insights and delivering the outbound results to the specific activation channels, it’s all done through Databricks. So we built both real time and batch pipelines and leveraged both Lambda and Kappa architecture to stitch the data together, and to deliver the right message to the customer through the right channel using Databricks pipelines.
And Sharad is is gonna walk you through the actual machine learning models, and how we are actually processing this information to derive insights, and generate key actionable messages that can be actually delivered to the customer. Thank you, Sharad take it over. – That was really great, thanks Krish. Now I’ll walk you through to a journey orchestration enablement, and however models play a role into this particular journey.
This particular representation of a customer is our customer journey attribution, using some of our predictive models that we have in our particular application or a platform. In this one scenario, what I want to walk you through is a customer churn model using a channel affinity platform. How what we see that if a customer has not been using an email from, let’s say about certain timeframe, we should first identify their channel affinity, if the push notifications or SMS might be more better for certain segment of customers, that is where we should activate those segments on.
From this point on, let’s look into more forward into how our customers churn model is being built, and how do we use it to train this model. The very first phase is about our machine learning pipelines to bring lots of different sources of data, to aggregate into our data generation and preparation phase.
So looking into multiple different demographic data, their transactions, their returns, some of their product information. We aggregate all this data in addition to their user behavior that if they’re going or browsing a website and spent some amount of time browsing a product versus a less amount of time browsing other different products, we could extract those behaviors using our entity resolution APIs, and which we have multiple levels of stitching, either deterministically or let’s say if there are certain times when the users are anonymous, we could even use probabilistic stitching to bring all this data in an aggregated form, and then we feed all this data in a flatten view to our MML automation pipelines.
In the next slide, we’re going to look into the feature generation phase.
Generally, if we think about our marketing models, we think about RFMT as our feature generation phase. We do go over beyond doing RFMT, but RFMT is overall the building block. What I really mean by RFMT is, think from a point of a recency, frequency, monetary value and the tenure of the customer in that particular business.
If you look into this time series timeline, how it identifies the different-different timelines is how the user behavior has been over the period of time. So circles that we see on this chart, the bigger circle means a bigger monetary value. The time difference between these two circles mean how far they are, why they are part when a customer came back and probably bought another or did another transactions there. So if we look at the very fast level, the regular customer keeps coming back either every month or every quarter, and they make different-different purchases. That could be a bigger purchase, that could be a very similar purchase, it just keep happening. These are the particular behaviors which is exhibited by a regular customer. The next level that see is more highly frequent customer, as compared to regular. They may not have done bigger purchases, but they’re more frequently coming in. This in our CLTV model is a best valued customer which comes out to be. Now on the third time series, what it shows is a customer who has been regular for about two quarters in about three or about a year back. If we see this timeline from today, that customer had been active, but from some point in the past, which is two quarters in the past from today, that customer has not made a purchase. Either, this customer is towards on a churn, it’s either already churned or towards going into a churn behavior. These are the insights that we want to extract as our features. So we look into a different time scores of these variables. So we generate a time series, we look into a certain time window and say, “Okay, these are my features.” How I look for my target is basically looking bit forward in the future and deriving my targets from there. Once this model is prepared, then we associate into timeline which is current, and then use it to predict in future. In the next slide, what we look is this particular model churn model, and we look into our model parameters and the model architecture. This particular model is a sequenced LSTM based model, and we used an embedding layer to derive or extract the customer behavior in an hyperdimensional embedding space, and use those embeddings to feed through a transactional data into an LSTM, and then feed this through our dense layers. So if we see this is our model summary, and what it shows is a multiple different layers. What I really wanted us to concentrate on is the fourth layer from the top, which is the customer behavior embedding, and the parameters in that layer was about 342 million. And this is just a simple training that we did on about 3.5 million customers. Which traverses through to a LSTM and different dense layers to do the prediction. If we go to the next slide, we’ll go through the model architecture and we see how we have used this model for training, and the problems that we faced during our inference, because we are looking at a data at a scale, and what approach we have used to identify it and fix those scalability issues. So on the left side, that is the model architecture that we talked about with our parameters for training. So we use our features from users and products and multiple different transactions into our user latent vector embeddings layer, which flattens it out, feeds it into an LSTM in addition to the transactions, which goes into a multiple dense layers to go through a softmax to do a multi-class classification or to a segment to do a zero versus one binary classification. So what scalability did we see? So we just talked about one scalability issue that having 344 million parameters, that mean our model in GBs. And we use PySpark on Databricks cluster to optimize all the inference. But what really happens is that once you’re using multiple nodes, these models needs to be broadcasted. One of the problem that we saw when we were broadcasting the model on all the nodes, there were some kind of a prickling issues during serialization or deserialization when it reads back. So we basically rather than sending the whole model, we wanted to send only the weights, so we only broadcasted weights. But every time we do this, the inference that we go through a back size every time it would try to de-realize that model. And that model being in GBs, it would take time to load back into memory and then do the inference. So we really needed to step back at that point and think this through that how would we really optimize this? So what we really did, we have to de-stitch this vertical or TensorFlow graph, and we had to remove this embedding layers to reduce our model size. And we cashed this particular embeddings into an in-memory database, so that the inference basically goes through very, very quick that there is no disc I/O happens at that point. So if we go back and look into the de-stiched embeddings layer, so the very first layer that we see at a user latent vector embeddings, it gets de-stiched this from this graph and it gets captured into our customer behavior embedding store. So when we now feed for our inference, we feed the customer’s ID, but that customer ID would be looked up through an embeddings layer and then calculated using their transactions to feed into this model. So now once we have a de-stiched, our scalability issues gets resolved. So one particular time that we were doing this inference on about, let’s say, 20 million customers, we ran it on Friday, took almost over the weekend, even Friday, it was continuing to run. There were multiple Spark executors which failed. Basically, it was not even close to get it done. After this particular resolution that we talked about, we tested it on multiple different numbers. We get all of these numbers less than, or all of this inference less than an hour. So that is what the optimizations that we have gained out of this. So what really is that user embedding layer? Is basically an encapsulation of all behavior of a user, how they exhibit that behavior, that if I let’s say hypothetical user comes in every first of the month to a business, some retail business and do certain purchases, assumption is if they continue to keep coming back on the first, and somehow they change their behavior on one of the month, it could identify that behavior and say, Okay, yeah, this might be an anomalous behavior out of that. So, yeah, having talked about the embeddings, so this particular customer behavior to vector embeddings is basically a representation of a hundred hyperdimensional space embeddings into a plot. Let’s watch a video, which basically will describe how that 3D plane shows us which users are closer together, who are farther apart, what kind of a hyperplane do they come across. and gain some intuition around how we could use this to do our modeling.
– [Instructor] So far, we have talked about customer journey and it’s orchestrated components. Churn being one of the attribution that we infer on regular interval, so we can optimize across multiple different campaigns, making sure high risks churn profile customers are actually not activated in a different business objective campaigns, such as cross sell and resell, but in fact, they should be added to a customer retention based objectives campaign, which would basically enable them to get some promotional offer based on a multiple CDs campaigns, that they could start with the 5% discount offers were 10, 20, depending on their high CLV value, we could go up to a certain discount offer which could be 50% or more depending on the businesses. This particular model is our churn model which is based off a sequence generated data sets for last two years on a weekly aggregated feature level. So we have about 104 sequences that we feed into an LSTM model, followed by in dense layers, and then we process them through a sigmoid, so to identify whether they were churn or not. The very first layer before we feed the half of the input to LSTM, we feed it through an embedding layers, which creates our user latent vector representation which I talked about in an architecture slide before.
This particular user latent vector embeddings is basically a representation of our customer behavior to VEC, and this particular notebook here is in the Microsoft Azure Databricks environment, which I will showcase a 3D plot of a customer behavior to VEC using a TSNE plot. Most of us have looked into a TSNE plot at some point, if we are running our data science models, what this really does, it’s a time stochastic distributed embeddings or neighbor embeddings, which is basically a non-linear dimension reduction algorithm for exploring high dimensional data. So these user embeddings that I’m taking a 5,000 or 2000 different customer samples, is basically if we see it’s a combination of a cluster, if we represent in a 3D plot. So we have reviews from 100 dimensions to three dimensions just to visualize. And what we see is that most of these customers, they try to sit together in this particular shape, which is let’s say a cube type. But if you think of our 663 and 802, this sit a little bit outside of this whole plane. At a very simple level, what we could do is that the distance computation between 104 and 23, the customer ID 23, and that’s ID 104, and when we do the distance computation, it would be very, very close together as compared to the same user, but 663 that may even look like a separate hyper dimensional space or 802. And if one of the dimension here is somehow related to time, they will show is like they may not have made a purchase for a long time.
So 663 is at risk for churn, and 802 has probably might have churn. That’s the whole intuition I wanted to get out of this part of the plot from here. Thanks for watching. – Okay, let’s look into our model performance and finally the business results to see what have we achieved, and what are our business key takeaways. So for our churn model, let’s look into our data processing volumes. I think I gave a hint a little bit in there, but let’s take a few seconds to talk about. So we have used this model to train on 25 million customers are inferred on 25 million customers with the training. And when we use those weekly aggregated transactions, the 25 million customers basically correlated into 2.5 billion transactions. So that is correct, it’s billion with a B.
And how many data sources that we had to go through to come to this particular point, is about 70 plus data sources. So lots of data aggregation happening in our customer data platform as well. And then we bring this data using our data generation and ML pipelines that we looked into. And what we activate is through our omnichannel, which is emails, push notifications, SMS, and site-site core personalization.
So the model that we were talking about it basically had about 88% accuracy and about 91% precision. This model has basically gone through a long journey that the very first phase we looked into, it was about 75% accuracy, and then we had to make multiple different version of dates to come to 88% accuracy. And at that point we said, it’s just not the precision, we also wanted to optimize on recall. The matrix that we wanted to really tune was an F1 score, which is a harmonic mean between precision and recall. And in our final version, which is in production right now, it’s at about 90%. Once this model, and we looked into the confusion matrix, we also wanted to look back into the past to see what was the hit rate, and that is above 67%. But this discussion let’s go into our achievements from the business perspective.
So what did we really achieve with these models, enabling the customer journey. And we wanted to optimize multiple of our campaigns. The business results here is to have an improved customer retention rate. So we were able to increase revenue for one of our retail client by 2.3 million in one year. And that one year was not even full 12 months, it was about nine to 10 months. So given the scale, given the optimizations, these particular models in its predictive form can really boast the campaigns and bring the incremental revenue to the clients. Optimize marketing campaign dollars, which are just trust back. And also the cost optimization through our on-demand auto-scaled clusters. So we use Databricks clusters which are auto-scale in nature, that actually brought the huge cost optimization to us and our clients.
On the operational excellence site, it’s about 2.5 billion transactions which was processed in less than 25% of the time as compared to the clusters that we were using before Databricks. And it had a full scale automation and a faster time to market.
So this is good, we had business results, we can define an operation excellence. What does it do to the customers? So the benefits to customer is, personalized recommendation products. One of our recommendation engine model had a huge, huge inference time. So we could basically now could run it in the Databricks cluster to bring those products. Better promotional offers and higher customer satisfaction.
From this point, I would like to give this to Krish so he can talk about a few key takeaways.
– Yeah, thank you, Sharad. That was an awesome presentation. Thank you for walking the audience through some of the amazing work we have done in terms of modeling this customer data and to come up with insights that are actionable. I think that’s critical for every client, every business. We need to actually generate actionable insights to deliver customer value. So what are some of the key takeaways from this?
As you might’ve heard from Sharad,
we had to deal with a large volume of data. And the data’s essential. The more the data you have about your customer, the better you know their behavior, and better you can target them with the right message.
So we need to get not only the core customer data, that’s like customer profile and preferences, but you need to also get their behavioral and transactional data, and to be able to process it effectively.
The second component of any solution, any big data solution for either real time or batch based marketing is the processing speed itself. Ability to handle large volumes of data, process it and generate insights, and send those messages to activation platforms. That speed is critical in today’s world.
And finally, the automation for handling all these data flows, as well as the activations is essential for scalability.
Databricks enabled us to deliver customer journey orchestration at scale at multiple levels, data processing, pipeline automation, and by leveraging the Delta Lake platform and the core service that we have for storing data, stitching the real time data with the batch data, we are able to actually create consumer insights at an unprecedented scale. And I hope you can actually leverage some of the techniques and concepts that you have shared you today on your data platform, and customer engagement platform build-outs. Thank you for watching this presentation.
Krish is a leader in the digital marketing technology space who has extensive experience with building and delivering digital transformation products and services for organizations ranging from startups to Fortune 500 companies. Krish currently leads the Publicis COSMOS Customer Intelligence Platform development, sales, and delivery within the Publicis Media organization. Prior to this, Krish has played senior technology leadership roles within Razorfish, and Sapient as a technology practice leader within the West Region managing the delivery of digital marketing technology solutions for some of Publicis’ largest client accounts including Hewlett Packard Enterprise, Honda, Microsoft, and Sephora.
Sharad is a Vice President, Head of Data Science for Publicis-COSMOS, based out of San Francisco and has more than 18 years of unique data science cross-domain experience from different industry verticals leveraging Big data. Sharad has researched and designed various marketing based machine learning models ranging from CLV, Churn, Product Propensity, Affinity, Next Arrival, Next Best Action models and have been exceptional in delivering and productionalizing MML models. Prior to joining Publicis, Sharad was Founding member and Chief Data Scientist of Palo Alto based startup, Peritus AI and was instrumental designing the product offering.