## SRN – evidence and cross-validation

A paper that has intrigued me recently is “On the marginal likelihood and cross-validation” by E. Fong and C.C.Holmes on Biometrika.

Model evidence appears as marginal likelihood or prior predictive in this paper:

$\displaystyle p_{\mathcal{M}}(y_{1:n}) = \int f_{\theta}(y_{1:n}) d\pi(\theta).$

As the paper seeks to establish the connection between model evidence and cross-validation (CV), which is based on predictive performance on held-out test sets, it notes that the log evidence relates to log posterior predictive probability

$\displaystyle \log p_{\mathcal{M}}(y_{1:n}) = \sum_{i=1}^{n} \log p_{\mathcal{M}}(y_i \mid y_{1:i-1}) = \sum_{i=1}^{n} \log \int f_{\theta}(y_i) d\pi(\theta \mid y_{1:i-1}).$

So log evidence can be interpreted as a predictive sequential scoring rule with score function
$\displaystyle s(y_i \mid y_{1:i-1}) = \log p_{\mathcal{M}}(y_i \mid y_{1:i-1}).$

The paper begins by arguing that model evidence is the unique scoring rule that guarantees coherency (data ordering should not change the result of inference when data are exchangeable).

Then the authors move on to show the equivalence between evidence and cumulative cross-validation scores. When we consider leave-${p}$-out CV, there are ${\binom{n}{p}}$ number of held-out test sets and each set has some predictive score when the rest of the data is used for training. The leave-${p}$-out CV score, denoted by ${S_{CV}(y_{1:n};p)}$, is the average of these predictive scores. When the log posterior predictive probability is used as the scoring rule, then we must have

$\displaystyle \log p_{\mathcal{M}}(y_{1:n}) = \sum_{p=1}^n S_{CV}(y_{1:n};p).$

So the log evidence is also the sum of leave-${p}$-out CV scores, for all values of ${p = 1,\ldots,n.}$

## SRN – Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments by Alex Deng

The Sunday Reading Notes (SRN) series is back! This Sunday’s topic is Bayesian A/B tests. Last month, I wrote a post on Bayesian continuous monitoring, commenting on the DSAA’16 paper Continuous Monitoring of A/B Tests without Pain by Deng et al.

In that paper, the main take away is that peeking is not a problem in Bayesian A/B tests if we use genuine priors and prior odds, and Bayesian A/B tests controls False Discovery Rate (FDR).  In this paper, the authors mention that we can learn an objective prior from empirical data, instead of using non-infomration priors in Bayesian A/B tests.

Practical prior specification for Bayesian statistics is a big topic. Yesterday I met a machine learning engineer at a BBQ and told him that I have been thinking about objective prior learning and earning stopping for Bayesian A/B tests. His reaction was: don’t you know conjugate priors? don’t you know that priors will be overwhelmed by data very soon? My response was: yes you can use conjugate priors because it’s convenient. But I am not sure if we still want to assume we are in the asymptotic regime when we want to do an early stopping in A/B tests. While I convinced by him and myself with this argument, I am not sure whether this is the only reason prior specification matters in A/B tests. I’d be super happy if there can be some discussions in the comments!

Going back to Deng’s paper, to use priors learned from past experiments for a new test, the only assumption we need to make is that we assume the prior behinds these experiments are the same. This would happen because the variants are designed by the same group of engineers and the treatment effort would have the same distribution. We assume that observations in control come i.i.d. from some distribution with mean $\tau_C$ and observations in treatment come i.i.d. from some distribution with mean $\tau_T.$ The null hypothesis is $H_0: \tau_C = \tau_T$ and alternative $H_1: \tau_T \not=\tau_C.$ Let $\sigma^2$ be the pooled variance, then the average treatment effect scaled by $\sigma$ is $\mu := \mathbb{E}(\delta) = \frac{\tau_T - \tau_C}{\sigma}.$ (For details of the t-test, see section 3.)

We need to learn the prior odds $\frac{p}{1-p}$ and the prior density $\mu \sim \pi.$.  The algorithm used to learn the parameters is Expectation-Maximization (EM).   The M-step for $p$ is a straightforward one, and the M-step for $V$ is a generalized M-step using moment matching. (Details in Algorithm 1 in Section 3.2).

Once the model is set-up, the procedure is straight-forward. What I find most interesting in this paper is Section 3.4 where the author discuss problems of the Beta-Bernoulli model, which is what people usually use for conversion rates or click-through-rates experiments. The first practical problem comes from a miss match between the experiment randomizing on the user-level while the model assumes page-views being iid. The second problem, which I honestly do not quite understand, is that the conjugate Beta prior cannot be a genuine prior. I wish the author had elaborated more on this point because conversion rate is tested so often in online experiments.

When the author concludes the paper, he talks about what we should do if there are not thousands experiments from which we can learn an objective prior. He talks about the trade-off between the sample size of using all the experiments and the precision from using only relevant experiments. To this end, he suggests setting up a hierarchical model.