SRN – Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments by Alex Deng

The Sunday Reading Notes (SRN) series is back! This Sunday’s topic is Bayesian A/B tests. Last month, I wrote a post on Bayesian continuous monitoring, commenting on the DSAA’16 paper Continuous Monitoring of A/B Tests without Pain by Deng et al.

In that paper, the main take away is that peeking is not a problem in Bayesian A/B tests if we use genuine priors and prior odds, and Bayesian A/B tests controls False Discovery Rate (FDR).  In this paper, the authors mention that we can learn an objective prior from empirical data, instead of using non-infomration priors in Bayesian A/B tests.

Practical prior specification for Bayesian statistics is a big topic. Yesterday I met a machine learning engineer at a BBQ and told him that I have been thinking about objective prior learning and earning stopping for Bayesian A/B tests. His reaction was: don’t you know conjugate priors? don’t you know that priors will be overwhelmed by data very soon? My response was: yes you can use conjugate priors because it’s convenient. But I am not sure if we still want to assume we are in the asymptotic regime when we want to do an early stopping in A/B tests. While I convinced by him and myself with this argument, I am not sure whether this is the only reason prior specification matters in A/B tests. I’d be super happy if there can be some discussions in the comments!

Going back to Deng’s paper, to use priors learned from past experiments for a new test, the only assumption we need to make is that we assume the prior behinds these experiments are the same. This would happen because the variants are designed by the same group of engineers and the treatment effort would have the same distribution. We assume that observations in control come i.i.d. from some distribution with mean $\tau_C$ and observations in treatment come i.i.d. from some distribution with mean $\tau_T.$ The null hypothesis is $H_0: \tau_C = \tau_T$ and alternative $H_1: \tau_T \not=\tau_C.$ Let $\sigma^2$ be the pooled variance, then the average treatment effect scaled by $\sigma$ is $\mu := \mathbb{E}(\delta) = \frac{\tau_T - \tau_C}{\sigma}.$ (For details of the t-test, see section 3.)

We need to learn the prior odds $\frac{p}{1-p}$ and the prior density $\mu \sim \pi.$.  The algorithm used to learn the parameters is Expectation-Maximization (EM).   The M-step for $p$ is a straightforward one, and the M-step for $V$ is a generalized M-step using moment matching. (Details in Algorithm 1 in Section 3.2).

Once the model is set-up, the procedure is straight-forward. What I find most interesting in this paper is Section 3.4 where the author discuss problems of the Beta-Bernoulli model, which is what people usually use for conversion rates or click-through-rates experiments. The first practical problem comes from a miss match between the experiment randomizing on the user-level while the model assumes page-views being iid. The second problem, which I honestly do not quite understand, is that the conjugate Beta prior cannot be a genuine prior. I wish the author had elaborated more on this point because conversion rate is tested so often in online experiments.

When the author concludes the paper, he talks about what we should do if there are not thousands experiments from which we can learn an objective prior. He talks about the trade-off between the sample size of using all the experiments and the precision from using only relevant experiments. To this end, he suggests setting up a hierarchical model.

I really like this paper and I have read it several times. Every time I read it, I learn a little more about Bayesian A/B tests. I like how it is a good blend of technical derivations, practical considerations and philosophical discussions.While reading the paper, I kind of feel that the author needs more space than what the page limit had given him. Because there are so many places where I hope he had elaborates on or given more details about. The DSAA’16 paper is a follow up on optional stopping for Bayesian A/B tests. I am personally very intrigued by the Beta-Bernoulli discussions and I also want to learn more about what the author has to say about multiple-testing!

SRN – Winner’s Curse by Lee and Shen from Airbnb

Online controlled experiments (A/B tests) has been my reading theme for this summer. This weekend I decide to read

• Minyong R. Lee and Milan Shen. 2018. Winner’s Curse: Bias Estimation for Total E ects of Features in Online Controlled Experiments. In KDD ’18: The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 19–23, 2018, London, United Kingdom. (link)

This is the conference submission following the illuminating medium post

• Milan Shen and Mining Lee. Selection Bias in Online Experimentation:Thinking through a method for the Winner’s Curse in A/B testing. (link)

Today websites are testing many incremental changes to the webpage at the same time on large-scale experimentation platforms. The average treatment effect of each incremental change is unbiased estimated. Experiments with statistically significant improvements to some metric will be launched to the users. But the total effect of aggregating these incremental changes is over-estimated, if we simply add up the individual estimates. This is the winner’s curse the authors describe, and they quantified the bias in estimating the total effects in this paper.

Suppose after running $N$ experiments, the set of chosen experiments is $A$ (which is a random set!) with true total effect $T_A$ and estimated total effect $S_A$. Then it can be shown that $\beta = \mathbb{E}\left[S_A - T_A\right] > 0$.

The authors provides closed form solution for the bias in the simple case where each estimate follows a normal distribution with known standard deviation $X_i \sim \mathcal{N}(a_i,\sigma^2_i)$. Let’s say $b_i$ is some value choose by analysts running the experiments to select the set $A$. The authors show that $\beta = \sum_{i=1}^n \sigma_i \phi\left(\frac{\sigma_i b_i - a_i}{\sigma_i}\right)$. The point of this equation is that all the epxeriments contribute to the bias, not just those selected through experimentation, because the sum is over all the experiments! As the Medium post pointed out:

If the set of experiments being aggregated A is fixed from the beginning, then no bias would be introduced. In the end, the bias exists because of the process of selecting the successful ones among many experiments we run, and we do this every day in a large scale experimentation platform.

The authors also provide a bootstrap confidence interval for the bias corrected total effect estimator.

I did not have time to go-through Section 5 – applications at Airbnb from the paper this time and I’d like to revisit it at another chance.

I really like the authors’ way of formulating the solving this selective inference problem. Their de-biasing method requires very little assumptions and is straightforward to calculate. I have not given must thought to it but I am wondering how would a Bayesian approach this problem. Would that be setting up a hierarchical model by pulling all the incremental changes together? I would certainly meditate on this question during running or yoga!

Sunday Reading Notes – Continuous Monitoring of AB tests without Pain by Deng et al.

In this week’s ‘Sunday Reading Notes’, I am writing about ‘continuous monitoring of A/B tests without pain: optional stopping in Bayesian testing‘ by Alex Deng, Jiannan Lu and Shouyuan Chen at Microsoft. The authors formally proved the validity of Bayesian testing with proper stopping rule when we have a genuine prior in Theorem 1 through an elegant use of change of measure.

Under the bayesian model comparison framework,  we have posterior odds equals to prior odds times Bayes Factor. The Bayes Factor is the likelihood ratio under model 1 and model 0, which can be intractable if there are hidden variables in the model. There has been research dedicated to unbiasedly estimating the Bayes factor (like bridge sampling in Gelman and Meng 1998). The posterior odds summaries all the information about an experiment, and more importantly as the authors pointed out:

Conditioning on a posterior odds $K$, rejecting $H_1$ with expose us to a risk of false rejection/discovery with probability $P(H_0|data) = \frac{1}{K+1}.$

The authors refers to it as the ‘Bayesian promise’: $\frac{\mathbb{P}(H_1|PostOdds)}{\mathbb{P}(H_0|PostOdds)} = PostOdds.$ and the main result in this paper says that we still have the Bayesian promise with optional stopping. In other words: with a proper stopping time $\tau$ (which does not depend on any future event), we always have $\frac{\mathbb{P}(H_1|PostOdds_\tau)}{\mathbb{P}(H_0|PostOdds_\tau)} = PostOdds_\tau .$

I really like the simulations in this paper because they explains the concept of ‘Bayesian promise’ really well.

Figure 1 from the paper illustrate the Bayesian promise in a fixed horizon test setting. The prior odds is set to 1 and when data is simulated from as 50-50 mixture model of $H_1$ and $H_0$, we look at the histogram of the Bayes factors of each of the 100K runs. If we look at the bin with BF = 2.1, there are about 4000 runs from $H_1$ with BF = 2.1 and 2000 runs from $H_0.$ In this figure, the top margins which shows the ratio between counts from alternative and null are very close to the x-axis which are the Bayes factors. Conditioning on the Bases factor, the probability of data coming from $H_1$ is BF times that of probability of data coming from $H_0$.

This Bayesian promise is kept whatever proper stopping rule we decide to use:

These two figures are Figure 2 & 3 from the paper, where the stopping rules are (top left) $BF >9$, (bottom left) $BF>9 or BF<1/9$ and (right) $p-value <0.1$ and sample size larger than 10.

The result from this paper should be reassuring to those running Bayesian A/B tests. Despite the theoretical guarantees, I feel a lot of people are still not convinced that peeking at Bayesian AB tests is okay. For example, I read the post Is Bayesian A/B Testing Immune to Peeking? Not Exactly by David Robinson many times. Although I feel there is still some connection I have not found out yet, I believe these two papers are talking about different things. In Robinson’s post, his focus in Type-I error control, which is not why Bayesians promise to do. In this paper by Deng and Lu, the ‘Bayesian promise’ gives false discovery rate (FDR) control. And I think the authors try to make it clear in Table 1.

Although the FDR is inflated than the fixed horizon test, it is nevertheless still controlled below the promised 10%.

In my opinion, what might hurt the Deng and Lu paper is not that it inflates Type-I error or FDR, but how people could mis-use the optional stopping in Bayesian A/B tests in practice. I want to summarize some bad practices for people to avoid here:

1. Not every prior promises FDR control with optional stopping! The prior in Theorem 1 and in the simulations is a genuine prior. It is in other words, the true prior, the prior from which data is generated. If you do not have this true prior or something very close to it, do not use optional stopping. Alex Deng has a paper on learning objective priors from historical data, which I plan to discuss in the future.
2. Only proper stopping rules work! Do not peek into the future, only ‘peek’ at present and past.  These are included in Example 2 and 3 in the paper.
3. Do not test until winning. You will fall into the trap of multiple testing. If we run 5 replications of the same experiment, they all fail to be ‘significant’, but after running it a 6th time we see a ‘significant’ result, we cannot ignore the previous runs and declare a victory.
4. Do not expect to unbiasedly measure the average treatment effect (ATE) with an Bayesian AB test with optional stopping. I believe we can confirm its existence with an Bayesian AB test, but to estimate the ATE? I believe we are introducing some bias into the estimation with an optional stopping. I hope the authors discussed more of this in the paper, or maybe I show study this myself.

After all reading this paper has been a enjoyable and thought provoking journey for me. Every time I read it, I understand more of it. I asked many questions while reading it, some I answered myself by re-reading, some I have to ask others, and some I still do not understand.

Sunday Reading Notes – Chapter 2 of Sequential Analysis by Abraham Wald

I decide to start a series called ‘Sunday Reading Notes’ posting about papers, book chapters and anything statistics related that I read recently.

Opening remarks:

This week I will write about Chapter 2 of Sequential Analysis by Abraham Wald. Unlike the statistical null hypothesis testing we all learned from textbooks, sequential analysis is a method of inference where number of observations is not determined in advance of the experiment. At every stage of the experiment, the experimenter makes a decision to accept, reject or continue data collection. In Chapter 2 “general discussion” Wald formaly set up the system of sequential test in preparation for Chapter 3 “sequential likelihood ratio test”.

If you do not enjoy technical details, skip the rest and jump to the closing paragraph for some high-level summary.

Reading this chapter is an eye – opening experience for me. As a statistics PhD student with a somewhat traditional training, I am so used to the common problem setup like: given a sequence of data $X_{1:n} \overset{iid}{\sim} F(\theta)$ where $F$ is some distribution parametrized by $\theta$, consider some null hypothesis $H_0$ and some alternative hypothesis $H_1$, determine rejection region of the most powerful level $\alpha$ test; or given a model, a null and an alternative and a level $\alpha$, find the sample size to achieve at least a $1- \beta = 0.95$ power.

Wald sets up the sequential test framework differently. He starts with a hypothesis $H_0$ and any decision rule that terminates in finite time. The decision rule is defined by mutually exclusive sets $R_m^0, R_m^1, R_m$ (accept after $m$ steps, reject after $m$ steps, and make more observations) for all values of $m$.

Given such a decision rule, for every value of parameter $\theta$ we can define the function $L(\theta)$ to be the probability of accepting $H$ if $\theta$ were the data generating value (and yes I think we are in the well-specified model case). This function is called the Operating Characteristic (OC) function.

Another notation before jumping to the principles of a good sequential test is Averaged Sample Number (ASN) function. This function denoted by $\mathbb{E}_n(\theta)$ is the expected number of samples to terminate the test.

At this point the alternative hypothesis is not part of the picture yet, but we are ready to discussion some principles of a sequential test.

Wald goes on to talking about degree of preference for acceptance (yes! the familiar notion of a loss function in decision theory). He partitions the parameter space into three sets: a region with preference for accepting $H_0$, a region with preference for rejection $H$ and a region of indifference, and these regions are characterized by loss functions $w_0(\theta), w_1(\theta)$. We want to make zone of preference for acceptance a subset of $H_0: \theta \in \omega$ because acceptance of $H_0$ should preferred whenever $\theta \in \omega$ but this preference would not be so strong if $\theta$ is close to the boundary of $\omega.$ Likewise an experimenter need to make zone of preference for rejection a subset of $w^c$, the complement of $w.$

At first, this was like NEWS to me because I never thought about hypothesis testing this way. Who talks about partitioning the parameter space into three sets? This is a decision theoretic way to approach a statistical inference problem. Indeed discussion decision theory was part of Stat 210 at Harvard, but that was only after all discussions about hypothesis testing (including nonparametric tests and asymptotic tests) were done.  Here we do not even have an alternative hypothesis.

Another interesting notion is the idea of ‘acceptance’ here. We learn from textbook that we can only ‘reject null with some confidence’ but ‘not reject’ is not the same as ‘acceptance’. Why? One, because we have another option — get more data. Two, because ‘not reject’ might consist of two regions: zone of preference for acceptance and zone of indifference. I believe that the three zones of preference theory would really clear the discussions about “isn’t not reject the same as accept?”.

Ideally we want to have $L(\theta) = 1 \ \forall \theta \in \omega$ and $L(\theta) = 0 \ \forall \theta \not\in \omega.$ But because of randomize in observations, this is not achievable with finite samples. Instead we can give it some slack and require that $1 - L(\theta) \leq \alpha$ for every $\theta$ in the zone of preference for acceptance  and $L(\theta) \leq \beta$ for every $\theta$ in the zone of preference for rejection.  These already looks like power and level from our textbook, except that we do not have an $H_1$ yet! A test satisfying the two equations above are called admissible.

Admissibility, which I believe is the same admissibility from decision theory) is defined using the OC function, which describes how well the test is making correct decisions. In order to define efficiency, we need to examine the ASN function. Recall that for every  set $S$ and parameter $\theta$ the expected number of samples required is $\mathbb{E}_\theta(n|S)$. If there is an admissible test $S_0$ such that $\mathbb{E}_\theta (n|S_0) = \min_{S} \mathbb{E}_\theta(n|S)$ for every $\theta$  then $S_0$  is the ‘uniformly best test.’ In general this is too good to be true because the above equation has to be true for every value of $\theta$.

But we can define ‘optimum test’ if we have simple hypotheses $H_0: \theta = \theta_0$ and $H_1: \theta = \theta_1 \not= \theta_0.$ We can use Type I and II errors ($\alpha, \beta$) to define ‘strength’ of a test and if a test $S_0$ exists such that $\mathbb{E}_{\theta_0}(n|S_0) \leq \mathbb{E}_{\theta_0}(n|S)$ and $\mathbb{E}_{\theta_1}(n|S_0) \leq \mathbb{E}_{\theta_1}(n|S)$ for every $S$ of the same strength, then $S_0$ is an optimum test.

Concluding remarks:

Unlike treatment of fixed-sample size NHST controlling Type I and II error with Neyman-Pearson lemma to get ‘most powerful level $\alpha$ test’, in a sequential test we are looking for tests achieving the greatest efficiency (measured by average number of sample required) under some Type I and II error control (characterized by $(\alpha,\beta)$-strength).

Instead of straighforwardly presenting the framework to us, Wald carefully walks the reader through the development of the theory motivated by decision  theoretic considerations like: 1) what is the loss of a bad decisions? 2) what are some good properties of a sequential test? This chapter offers intuitions rather than extensive mathematical derivations and I would recommend it to anyone interested in statical inference and sequential testing in particular. The reader do need a good understanding of notions like power and significance level to appreciate it!

References:

Wald, Abraham. Sequential analysis. Courier Corporation, 1973.