A/B testing, i.e., measuring the impact of proposed variants of e.g. e-commerce websites, is fundamental for increasing conversion rates and other key business metrics.
We have developed a solution that makes it possible to run dozens of simultaneous A/B tests, obtain conclusive results sooner, and get more interpretable results than just statistical significance, but rather probabilities of the change having a positive effect, how much revenue is risked, etc.
To compute those metrics, we need to estimate the posterior distributions of the metrics, which are computed using Generalized Linear Models (GLMs). Since we process gigabytes of data, we use a PySpark implementation, which however does not provide standard errors of coefficients. We, therefore, use bootstrapping to estimate the distributions.
In this talk, I’ll describe how we’ve implemented parallelization of an already parallelized GLM computation to be able to scale this computation horizontally over a large cluster in Databricks and describe various tweaks and how they’ve improved the performance.
Ondřej Havíček: Hi, everyone. I’m happy to be here and be able to tell you a bit about what I’ve been working on for the last two years, and that’s A/B testing. I really like this because it’s probably the most impactful project I have worked on. Why? Well, it’s in the core of innovation in every large company we have worked with. If you do not A/B test then you don’t really know if the machine learning models that you have created or your user experience tweaks really work. And so I feel that everyone, or almost everyone, should be doing A/B testing. So if you don’t know what it is, don’t worry. I will explain soon. I also like this topic because it involves a lot of interesting problems and one of them is bootstrapping. So, obviously, the topic of this talk is mainly bootstrapping, what it is, why it is useful, and how to do it in Spark to improve A/B testing.
But first, let me tell you a bit of a few words about myself. So I’m a data scientist now but the path that brought me here was through an interesting detour which is now actually paying off. So I studied computer science but then I went into academia and I got a PhD in neuroscience slash psychology. I was doing experiments on human cognition, putting people into scanners, all the time of stuff. And in the process, I learned a lot of statistics, how to analyze data from our experiments and also how to design experiments. And A/B testing is experiments. In some ways simpler in some ways more complicated than what I was doing back then. So after some [inaudible], I left academia and went back to industry where this combined background is not quite useful. So now my grand focus is on inferential statistics, machine learning pipelines, ETLs. I use Apache Spark, Python, R, and the types of problems I worked on range from A/B testing, recommendation systems, search engines and so on, in various industries like e-commerce, social media, insurance and so on.
And I come from a company called DataSentics. As you can read in the middle, we are making data science and machine learning have a real impact on organizations. We have many scientists and engineers [inaudible], which you can see on the left. And they focus on various industries and use cases. For example, personalization in banking and insurance, computer version apps for analysis of shelves in stores, recommendation systems in e-commerce. But in all of these solutions and products we need to do A/B testing to know that they really work and can bring benefits to our customers. So we use A/B testing internally for our solutions, but also we implement it for clients who want to bring their A/B testing to the next level.
So now the agenda of this talk will be… I will finally tell you what A/B testing is and in particular, what factorial A/B testing is. Then we will briefly go into the analysis of such results. And the main portion of the analysis that I will be talking about is this mysterious world bootstrapping. So I will explain what bootstrapping is, how glued in Spark and, finally, we will look at some performance considerations.
So, finally, what is A/B testing? Wikipedia says it’s, “A user experience research methodology which consists of a randomized experiment with two variants, A and B. It includes application of statistical hypothesis testing and determining which of the two variants is more effective.” Well, to put it simpler, let’s say that you have a website and you want to change something, for example, color of a button or something in backend, and you only show this change to a portion of your visitors and measure some KPI like conversion rate and see in which of these two groups the metrics are better. So you have, typically, some control version and some experimental version that you expose your visitors to.
Why would you want to do A/B testing? Well, it’s the only way to improve your KPIs consistently without relying on guessing or intuition or the highest paid person opinion, the HiPPO. The evidence is always better than anyone’s opinions because it turns out that most of the tested ideas are actually incorrect. They don’t bring value. But you don’t know it if you don’t test it.
And how do you do the testing? Usually you do isolated tests. That means you divide your traffic into parallel planes so that each user is exposed to just one variant so you can run such tests in parallel or one after another.
Yeah, that’s all fine but these tests prove to be limiting if you really want to do it seriously, if you really want to test all ideas as you should, because you can only run few such concurrent experiments or they take a very long time. On the right you can see a chart from Microsoft Experimentation Platform from banks, specifically, and in 2003, which is where the red text points to, something happened. So this chart shows the number of A/B tests that they were able to perform per week, I think. And so until 2013, there was a small trend. The number of tests that they were doing was rising but then something happened and the amount of A/B tests they were doing skyrocketed. And what happened was that they introduced a experimentation platform that used so-called factorial design of experiments.
In detail, it’s a bit more complicated than what I will describe, but, in a sense, it’s this: so let’s say you have two tests, test 1 and test 2. Test one tests whether you want to display some text in original black or you want to change it to blue font and test 2 tests some old current recommendation engine against a new one. And, basically, you expose your users to both of these tests, actually. So when a user comes, you first flip a coin for the first test, whether they will see the black or the blue test, and then for the same user, you flip a coin again. And decide whether the user will be exposed to the old or the new recommendation engine. And you can do this for dozens of tests. I have fixed some caveats but in a sense that this.
And this is called factorial design of an experiment. So you have to cross these tests orthogonally. There must really be a random allocation. And, therefore, each visitor gets assigned into all the tests. And this will allow running dozens of simultaneous tests, if done correctly, because then you will be testing all of these tests on all the traffic that you get, therefore, you will get faster results.
Okay, so let’s say you have collected some data from the experiments and now you want to analyze them. How do you do that? What you often get in some products for A/B testing that you can buy, some services, is results of this kind: version B has a statistically significant effect on conversion rate. The value is 0.04. Well, what does this mean? This show that actually not even statisticians can always interpret this correctly, what’s this mysterious [inaudible] means, what is statistical significance.
And business users, they don’t get it. They think that it’s something important but they don’t know what exactly. So what you ideally want is something along these lines: the variation B increased the conversion rate with some probability and most likely by that much with some confidence interval. And so you quantify both the most likely effect size, how much it probably helped, and also the uncertainty that the two value of this improvement can be in this confidence interval. And it can even be negative with some probability. And then you can also show nice density charts like that or box plots, et cetera. And so here on the left, we can see the possible distributions, distributions of the possible values of the conversion rate in the control and the B group on the right. You can see the possible differences, the uplifts.
So it’s somewhere around 1.8% but it can be even negative, but most likely it will be positive. So in my opinion, this would be, even though this might not be statistically significant, it might be still a good business decision to implement this. If you consistently implement tests with such a probability then you will, in the long run, benefit from it. But it depends on your policies, on your decision making under uncertainty, which is quantified here. So you need an estimate of not just the effect size but also the uncertainty.
So how do you get it? So the effect size, we estimate it using Apache Spark because we have typically big data, millions of impressions, conversions, and so on. So we use GLM, which now [inaudible] models in Spark. And the model specification is whether there was a convention, that’s the target, better allocations to the different tests and the product there means that we can also look at the interactions between those tests. And we use a binomial GLM with the logit link, basically that means logistic regression. But you can also use different GLMs for different types of metrics.
And how do you quantify the uncertainty? Well, unfortunately Spark doesn’t always produce standard errors, which is a metric of uncertainty. So you can get it using this process called bootstrapping. Bootstrapping is a way to estimate distribution of some statistic. For example, these customers, the effect sizes. And it’s also said that bootstrapping is a, “Poor man’s Bayes.” So Bayesian in France, that’s often talked about in A/B testing, and really, it’s a good approach but often infeasible, especially if you have really big data then it’s hard to do Bayesian in France. So this is a, “Poor man’s Bayes,” if noninformative prior, which is often desirable anyway. And then you can get distribution using bootstrapping that looks like this.
So what is bootstrapping? It’s very simple, actually. You just iterate many times, hundreds of times, likely thousands of times, if you can afford it. This simple process, these two steps, you just randomly re-sample data with replacement and then you estimate some statistic. That means you fit the model and get the GLM coefficients. In Spark the code can look like this: for the one iteration, you just sample with replacement fraction 1.0, that means all the data. Then you fit some model on the data and extract some stats, statistics, and save the coefficients. And you do this many times and you always keep the stats. Here the model fit will execute and spark action, right? So that can look like this on the right. Many Spark jobs. In each Spark job you’ll have some stages with some tasks in them. So this process is parallelized, right? This one model fitting, and there is sampling that’s parallelized over the course in your cluster.
How do you do bootstrapping efficiently in Spark? Given that bootstrapping is also so-called embarrassingly parallel problem, the individual iterations don’t depend on each other, so you can around them simultaneously. But how to do that? Because Spark parallelizes tasks of the re-sampling and model fitting within one iteration but you need to, if you want to scale this computation, if you really want hundreds of iterations, then you need to scale this. You need to parallelize this parallelization, basically. You need to run many instances of this model fitting in parallel. And how to do that in Spark? So that was the question that I was [inaudible] and the solution was, in a way, simple but maybe a bit involved for those of you who do not have experience with Python and Spark. And it’s multithreading. So let’s go into this in a bit more detailed perhaps.
You need to put the bootstrap iterations into batches. So let’s say you have, I don’t know, in this case 12 iterations. In the real life you would have, for example, 2000. You would put them into batches. In this example on the right, I have four batches. And so you can see that in each batch I have three iterations and each batch consists of sequential iterations. So these iterations will be on sequentially. But I prepare many of them, of these batches, that will get executed in parallel so that their tasks can go in parallel to the different cores in the cluster. Because each iteration will perform a spark action. That means it will produce some tasks that will need to go to some cores. And you need to prepare the data in such a way that the stages in those iterations will have fewer tasks than you have cores in the cluster as we have repartition your data and size the cluster in such a way that you can actually fit these batches next to each other. I will talk a bit about that in a moment.
Then once you have these batches repaired then you can submit them using Python multithreading. And they get executed, these iterations, one by one, in all these budgets in parallel and the tasks from these iterations will get scheduled in a First In, First Out manner or in a Fair fashion, depending on the setting of your Spark application to the executors. I don’t think that it really matters which of these two approaches you use.
So we’ve been in one iteration, this picture, can see that I have, for example, one stage. In reality you’ll have probably more stages, but let’s say let’s look at one stage. And in this stage I’ve sliced it in a such a way that there’s only four tasks. And I create so many batches that these four tasks will be pushed, on average, to just two cores. So, first, let’s say task 1 will go to core 1, and task 2 will go to core 2, then task 3 to core 1, task 4 to core 2. And similar things will be happening in the other batches. And in the other iterations inside those batches. [inaudible] that batch 2 will push the tasks to cores 3 and 4 and [inaudible]. In the end it doesn’t really matter. It’s not always batch 1 pushing to cores 1 and 2 but it doesn’t matter. You have enough cores to satisfy all the needs of all the different batches that are running in parallel if you size those tasks and the numbers of correctly.
So now to illustrate this in code in Python, just briefly, you first need to prepare the batch sizes. So in the first three lines, I just take the desired number of iterations and the desired number of threads and, basically, figure out how many iterations I need to have in each batch and I create the list of batches. So let’s say here I will create the batches and the variable. There’ll be a list of size four. And in each of them I will have some reps, some batch size, some amount of iteration that I want to run in this one batch. And then using multithreading, using the ThreadPoolExecutor, I can run n_threads and workers, for example, 4 workers in this case, and I will submit a function called run_batch, here. And this function, run_batch, will run all the iterations for this one batch.
So the parameters of this function or the data frame, the DF, the model, and then the number of repetitions, number of iterations for this one batch. We let it run and then we collect the results.
So we also need to think a bit about performance because we don’t want to waste resources. What use there will be to do all this exercise if in the end it doesn’t help you with how long it takes for all this to run, to finish?
So, one question is how many parallel batches, how many threads do you need? And there’s a simple computation without any rounding. I omitted roundings and numbers of cores, if you actually have. And, basically, it’s the numbers of cores divided by the numbers of tasks in one station, one iteration. So that would be, in my case, eight divided by four, because I had four tasks in one iteration, times some constant, some number of tasks that are core that you want to have. How many tasks you want to push to one core at any given time. And number of tasks depends on the sizes of the tasks. So you want to repartition data so that one task has something around 100, 200 megabytes, not kilobytes, not gigabytes. And the number of tasks per core? That’s an empirical question, what is best? But in our experience, two to four works best.
So you don’t just push one task per core, you want to have a bit more of such tasks in the queue, so they are not slacking off the cores. And then, well, you need to check a Ganglia UI to see whether your configuration, whether this number of threads and the sizes of the tasks, whether it actually can utilize your cluster well. Can look something like that. And so here, the utilization is around 70%, which is not great, not terrible. We can get idealists. It’s sometimes set to 80%, will be better. So it depends. It can get there with some experimentation.
And then you need to also run a performance test. So time, the duration. So in this experiment we have tried to vary the number of threads, the number of batches that you have in parallel, and also the number of partitions into which you repartition your data. That means the sizes and numbers of tasks. And you can see that the shortest duration is somewhere around, in this combination, of 8 threads and 16 partitions or 16 threads and 8 partitions. So both of them are very comparable. And, of course, you need to do some typical Spark optimizations, like the data should be cached before you start using this re-sampling and model fitting because it’s an iterative process. And also in our experience, helped that we didn’t just cache it, we first resisted the data using Checkpointing and then after reading from the Checkpoint, we then cashed into memory. That proved to be the fastest solution which really utilized the CPU’s and didn’t waste resources in IO.
So what are all the lessons that we have learned from this? What I have personally learned. And so I come from experimental background where we typically wanted to do inferential statistics to figure out whether some experiment works and not predict something, which is typical in machine learning. And Spark is actually…It seems to be better suited for machine learning than inferential statistics.
And how to do an inferential statistics in Spark? One way is to use bootstrapping because you cannot do inferential statistics without quantification of the uncertainty. And bootstrapping can do this for you. But it’s quite expensive, right? Compared to computing, for example, standard errors, it takes a lot of time. So in our clusters, bootstrapping can take hours, tens of hours sometimes.
So you can also do parallelization or parallelization in Spark. That was very interesting to me because you often hear about Spark as being parallel computation engine, and for these embarrassingly parallel problems, I really didn’t know how to do it. And this multithreading proves to be a way how to do that.
Then I learned that shifting from p-values to these distributions early was a good idea because the business users really love the outputs. They say that’s one of the best advantages of our solution for them. They know what probability it will help, how much.
Then the core of factorial A/B testing is quite simple. It’s just simple, orthogonal, factorial design, and simple inferential statistics. But there are many interesting challenges in reality. And so you have to deal with things like overlapping tests, they don’t always start and end at the same time, there can be interactions between the tests, there can be funnels, you have to deal with outliers with VR distributions of the data with zero-inflated metrics. And then the business users will ask you, “Well, can you still reduce the time it takes for [inaudible] of the tests?” So you then start dealing with variance reduction techniques, and so on. And, well, it pays off to really tackle these challenges but if you want, then we will be happy to help you with that because we have already went through this journey and experience with that.
So thank you for your attention. If you want to know more then just drop me a line and have a nice day.
Ondrej has background in computer science and cognitive science. During his academic career, he has conducted and analyzed data from behavioral and neuroimaging experiments. Since his moving to indust...