# SRN – Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments by Alex Deng

The Sunday Reading Notes (SRN) series is back! This Sunday’s topic is Bayesian A/B tests. Last month, I wrote a post on Bayesian continuous monitoring, commenting on the DSAA’16 paper Continuous Monitoring of A/B Tests without Pain by Deng et al.

In that paper, the main take away is that peeking is not a problem in Bayesian A/B tests if we use genuine priors and prior odds, and Bayesian A/B tests controls False Discovery Rate (FDR).  In this paper, the authors mention that we can learn an objective prior from empirical data, instead of using non-infomration priors in Bayesian A/B tests.

Practical prior specification for Bayesian statistics is a big topic. Yesterday I met a machine learning engineer at a BBQ and told him that I have been thinking about objective prior learning and earning stopping for Bayesian A/B tests. His reaction was: don’t you know conjugate priors? don’t you know that priors will be overwhelmed by data very soon? My response was: yes you can use conjugate priors because it’s convenient. But I am not sure if we still want to assume we are in the asymptotic regime when we want to do an early stopping in A/B tests. While I convinced by him and myself with this argument, I am not sure whether this is the only reason prior specification matters in A/B tests. I’d be super happy if there can be some discussions in the comments!

Going back to Deng’s paper, to use priors learned from past experiments for a new test, the only assumption we need to make is that we assume the prior behinds these experiments are the same. This would happen because the variants are designed by the same group of engineers and the treatment effort would have the same distribution. We assume that observations in control come i.i.d. from some distribution with mean $\tau_C$ and observations in treatment come i.i.d. from some distribution with mean $\tau_T.$ The null hypothesis is $H_0: \tau_C = \tau_T$ and alternative $H_1: \tau_T \not=\tau_C.$ Let $\sigma^2$ be the pooled variance, then the average treatment effect scaled by $\sigma$ is $\mu := \mathbb{E}(\delta) = \frac{\tau_T - \tau_C}{\sigma}.$ (For details of the t-test, see section 3.)

We need to learn the prior odds $\frac{p}{1-p}$ and the prior density $\mu \sim \pi.$.  The algorithm used to learn the parameters is Expectation-Maximization (EM).   The M-step for $p$ is a straightforward one, and the M-step for $V$ is a generalized M-step using moment matching. (Details in Algorithm 1 in Section 3.2).

Once the model is set-up, the procedure is straight-forward. What I find most interesting in this paper is Section 3.4 where the author discuss problems of the Beta-Bernoulli model, which is what people usually use for conversion rates or click-through-rates experiments. The first practical problem comes from a miss match between the experiment randomizing on the user-level while the model assumes page-views being iid. The second problem, which I honestly do not quite understand, is that the conjugate Beta prior cannot be a genuine prior. I wish the author had elaborated more on this point because conversion rate is tested so often in online experiments.

When the author concludes the paper, he talks about what we should do if there are not thousands experiments from which we can learn an objective prior. He talks about the trade-off between the sample size of using all the experiments and the precision from using only relevant experiments. To this end, he suggests setting up a hierarchical model.