SRN – Winner’s Curse by Lee and Shen from Airbnb

Online controlled experiments (A/B tests) has been my reading theme for this summer. This weekend I decide to read

  • Minyong R. Lee and Milan Shen. 2018. Winner’s Curse: Bias Estimation for Total E ects of Features in Online Controlled Experiments. In KDD ’18: The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 19–23, 2018, London, United Kingdom. (link)

This is the conference submission following the illuminating medium post

  • Milan Shen and Mining Lee. Selection Bias in Online Experimentation:Thinking through a method for the Winner’s Curse in A/B testing. (link)

Today websites are testing many incremental changes to the webpage at the same time on large-scale experimentation platforms. The average treatment effect of each incremental change is unbiased estimated. Experiments with statistically significant improvements to some metric will be launched to the users. But the total effect of aggregating these incremental changes is over-estimated, if we simply add up the individual estimates. This is the winner’s curse the authors describe, and they quantified the bias in estimating the total effects in this paper.

Suppose after running N experiments, the set of chosen experiments is A (which is a random set!) with true total effect T_A and estimated total effect S_A. Then it can be shown that \beta = \mathbb{E}\left[S_A - T_A\right] > 0.

The authors provides closed form solution for the bias in the simple case where each estimate follows a normal distribution with known standard deviation X_i \sim \mathcal{N}(a_i,\sigma^2_i). Let’s say b_i is some value choose by analysts running the experiments to select the set A. The authors show that \beta = \sum_{i=1}^n \sigma_i \phi\left(\frac{\sigma_i b_i - a_i}{\sigma_i}\right). The point of this equation is that all the epxeriments contribute to the bias, not just those selected through experimentation, because the sum is over all the experiments! As the Medium post pointed out:

If the set of experiments being aggregated A is fixed from the beginning, then no bias would be introduced. In the end, the bias exists because of the process of selecting the successful ones among many experiments we run, and we do this every day in a large scale experimentation platform.

The authors also provide a bootstrap confidence interval for the bias corrected total effect estimator.

Screen Shot 2018-08-05 at 11.44.08 PM.png
Fig: comparison between (left) adding up the estimated effect of 6 selected experiments, (middle) bias adjusted total effect estimate and (right) total effect estimated from a separate hold-out. Source: Medium post from Airbnb Engineering and Data Science. This is also Figure 6 from KDD paper.

I did not have time to go-through Section 5 – applications at Airbnb from the paper this time and I’d like to revisit it at another chance.

I really like the authors’ way of formulating the solving this selective inference problem. Their de-biasing method requires very little assumptions and is straightforward to calculate. I have not given must thought to it but I am wondering how would a Bayesian approach this problem. Would that be setting up a hierarchical model by pulling all the incremental changes together? I would certainly meditate on this question during running or yoga!

Author: PhyllisWithData

Statistics PhD student at Harvard University.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s