In this week’s ‘Sunday Reading Notes’, I am writing about ‘continuous monitoring of A/B tests without pain: optional stopping in Bayesian testing‘ by Alex Deng, Jiannan Lu and Shouyuan Chen at Microsoft. The authors formally proved the validity of Bayesian testing with proper stopping rule when we have a genuine prior in Theorem 1 through an elegant use of change of measure.
Under the bayesian model comparison framework, we have posterior odds equals to prior odds times Bayes Factor. The Bayes Factor is the likelihood ratio under model 1 and model 0, which can be intractable if there are hidden variables in the model. There has been research dedicated to unbiasedly estimating the Bayes factor (like bridge sampling in Gelman and Meng 1998). The posterior odds summaries all the information about an experiment, and more importantly as the authors pointed out:
Conditioning on a posterior odds , rejecting with expose us to a risk of false rejection/discovery with probability
The authors refers to it as the ‘Bayesian promise’: and the main result in this paper says that we still have the Bayesian promise with optional stopping. In other words: with a proper stopping time (which does not depend on any future event), we always have
I really like the simulations in this paper because they explains the concept of ‘Bayesian promise’ really well.
Figure 1 from the paper illustrate the Bayesian promise in a fixed horizon test setting. The prior odds is set to 1 and when data is simulated from as 50-50 mixture model of and , we look at the histogram of the Bayes factors of each of the 100K runs. If we look at the bin with BF = 2.1, there are about 4000 runs from with BF = 2.1 and 2000 runs from In this figure, the top margins which shows the ratio between counts from alternative and null are very close to the x-axis which are the Bayes factors. Conditioning on the Bases factor, the probability of data coming from is BF times that of probability of data coming from .
This Bayesian promise is kept whatever proper stopping rule we decide to use:
These two figures are Figure 2 & 3 from the paper, where the stopping rules are (top left) , (bottom left) and (right) $p-value <0.1 $ and sample size larger than 10.
The result from this paper should be reassuring to those running Bayesian A/B tests. Despite the theoretical guarantees, I feel a lot of people are still not convinced that peeking at Bayesian AB tests is okay. For example, I read the post Is Bayesian A/B Testing Immune to Peeking? Not Exactly by David Robinson many times. Although I feel there is still some connection I have not found out yet, I believe these two papers are talking about different things. In Robinson’s post, his focus in Type-I error control, which is not why Bayesians promise to do. In this paper by Deng and Lu, the ‘Bayesian promise’ gives false discovery rate (FDR) control. And I think the authors try to make it clear in Table 1.
Although the FDR is inflated than the fixed horizon test, it is nevertheless still controlled below the promised 10%.
In my opinion, what might hurt the Deng and Lu paper is not that it inflates Type-I error or FDR, but how people could mis-use the optional stopping in Bayesian A/B tests in practice. I want to summarize some bad practices for people to avoid here:
- Not every prior promises FDR control with optional stopping! The prior in Theorem 1 and in the simulations is a genuine prior. It is in other words, the true prior, the prior from which data is generated. If you do not have this true prior or something very close to it, do not use optional stopping. Alex Deng has a paper on learning objective priors from historical data, which I plan to discuss in the future.
- Only proper stopping rules work! Do not peek into the future, only ‘peek’ at present and past. These are included in Example 2 and 3 in the paper.
- Do not test until winning. You will fall into the trap of multiple testing. If we run 5 replications of the same experiment, they all fail to be ‘significant’, but after running it a 6th time we see a ‘significant’ result, we cannot ignore the previous runs and declare a victory.
- Do not expect to unbiasedly measure the average treatment effect (ATE) with an Bayesian AB test with optional stopping. I believe we can confirm its existence with an Bayesian AB test, but to estimate the ATE? I believe we are introducing some bias into the estimation with an optional stopping. I hope the authors discussed more of this in the paper, or maybe I show study this myself.
After all reading this paper has been a enjoyable and thought provoking journey for me. Every time I read it, I understand more of it. I asked many questions while reading it, some I answered myself by re-reading, some I have to ask others, and some I still do not understand.