SRN – Should I follow the crowd? by Canamares and Castells from SIGIR’18

It’s time for Sunday Reading Notes again. This past week I have been reading the paper ‘Should I Follow the Crowd? A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems‘ by Rocio Canamares and Pablo Castells. It won the best paper award at SIGIR 2018!

Current state of the art recommender systems tend to recommend popular items to their users. As the paper points out in its abstract:

The fundamental questions remains open though whether popularity is really a bias we should avoid or not, whether it could be a useful and reliable signal in recommendation, or it may be unfairly rewarded by the experimentation biases.

This paper is concerned with un-personalized recommendations evaluated in off-line experiments. The dataset consists of rated items only and is randomly split into training and test. The ratings takes binary values (positive rating / negative rating). For every item, popularity is defined in terms of number of positive ratings, which is different from average rating. In order to make recommendations, we want to measure ‘relevance‘, which is a binary random variable and res(i) = 1 if the user likes item i. From this set-up, we have pop(i) \sim p(rel,rate|i) and avg(i) \sim p(rel|rate,i). But we cannot directly measure p(rel|i).

The author uses the metric ‘expected precision’ as the metric for recommendations and emphasizes that true precision (cannot be measured from offline experiments) is different from observed precision (measured with rating from the test set). The authors find the optimal ranking function for both true precision and observed precision. Using relevance optimizes the true precision : f(i) = p(rel|i)\frac{1-\rho p(rated|rel,i)}{1-\rho p(rated|i)} is the optimal ranking function for \mathbb{E}[P@1|f], and as a result of experimental design the expected observed precision \mathbb{E}[\hat{p}@1|f] is optimized by \hat{f}(i) \sim p(rel|i)\frac{p(rated|rel,i)}{1-\rho p(rated|i)}. As we can see, popularity has advantage in optimizing the expected observed precision.

Because customers have a tendency to rate items that they like, the rating information is not missing at random. To understand rated given relevance and item, the authors gives two simple conditions: conditional item independence and conditional relevance independence. Under both conditions, using pop(i) optimizes the expected observed precision and using avg(i) optimized the expected true precision.

Because we can only measure the expected observed precision with offline experiments and because popularity has advantage optimizing \mathbb{E}(\hat{P}@1), the current recommender systems that are based on offline evaluations favors popularity. Although not all recommender systems are designed the same way as described in this paper, I believe this elegant framework provides intuitions because the ‘popularity of popularity’.

Acknowledging that a user can rate an item only if he or she has bought the item, the authors further consider ‘discovery bias’, because rating depends on all of relevance, discovery and item. This dependencies are characterized in Figure 2 from the paper.

Screen Shot 2018-08-26 at 4.37.58 PM.png

In a realistic senario, our data should fall into category 4 – ‘no assumption’. In this case the expected precision can be approximated with Monte Carlo integration.

The key results and experiment are summaries in Table 1 and Figure 5 in the paper.

Screen Shot 2018-08-26 at 4.42.49 PMScreen Shot 2018-08-26 at 4.42.26 PM

Besides the probability based framework, what I really like about this paper is that the authors designed and collected a crowdsourced dataset that makes real data experiments of relevance-independent discovery and item-independent discovery possible. After reading going through the math and the experiments, I feel quite convinced that

average rating may be in fact a better, safer, more robust signal than the number of positive ratings in terms of true achieved accuracy in most general situations.

Although the metric \mathbb{E}(P@1|R) seems too simple because essentially it is like we can only recommend one item to a user, it is a good and tractable measure to start with. The authors suggest that it empirically consistently generalize other metric such as nDCG@k. However I am not sure how much I would agree with this point, because nDCG@k cares much beyond the top ranked item.

Overall I really like this paper and I think it touches many fundamental problems in popularity bias and provides enough mathematical clairty. I wonder if this paper suggests we should do more online experiments for recommender systems because true accuracy cannot be measured with offline experiments. I am also eager to see what the author has to say about temporal data splitting. Lastly I hope the authors talk about what we should do with the very common ‘5-star’ rating system.



  1. Rocío Cañamares and Pablo Castells. 2018. Should I Follow the Crowd? A Prob- abilistic Analysis of the Effectiveness of Popularity in Recommender Systems. In Proceedings of ACM SIGIR ’18, July 8–12, 2018, Ann Arbor, MI, USA.
    ACM, NY, NY, USA, 10 pages.

SRN – Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments by Alex Deng

The Sunday Reading Notes (SRN) series is back! This Sunday’s topic is Bayesian A/B tests. Last month, I wrote a post on Bayesian continuous monitoring, commenting on the DSAA’16 paper Continuous Monitoring of A/B Tests without Pain by Deng et al.

In that paper, the main take away is that peeking is not a problem in Bayesian A/B tests if we use genuine priors and prior odds, and Bayesian A/B tests controls False Discovery Rate (FDR).  In this paper, the authors mention that we can learn an objective prior from empirical data, instead of using non-infomration priors in Bayesian A/B tests.

Practical prior specification for Bayesian statistics is a big topic. Yesterday I met a machine learning engineer at a BBQ and told him that I have been thinking about objective prior learning and earning stopping for Bayesian A/B tests. His reaction was: don’t you know conjugate priors? don’t you know that priors will be overwhelmed by data very soon? My response was: yes you can use conjugate priors because it’s convenient. But I am not sure if we still want to assume we are in the asymptotic regime when we want to do an early stopping in A/B tests. While I convinced by him and myself with this argument, I am not sure whether this is the only reason prior specification matters in A/B tests. I’d be super happy if there can be some discussions in the comments!

Going back to Deng’s paper, to use priors learned from past experiments for a new test, the only assumption we need to make is that we assume the prior behinds these experiments are the same. This would happen because the variants are designed by the same group of engineers and the treatment effort would have the same distribution. We assume that observations in control come i.i.d. from some distribution with mean \tau_C and observations in treatment come i.i.d. from some distribution with mean \tau_T. The null hypothesis is H_0: \tau_C = \tau_T and alternative H_1: \tau_T \not=\tau_C. Let \sigma^2 be the pooled variance, then the average treatment effect scaled by \sigma is \mu := \mathbb{E}(\delta) = \frac{\tau_T - \tau_C}{\sigma}. (For details of the t-test, see section 3.)

We need to learn the prior odds \frac{p}{1-p} and the prior density \mu \sim \pi..  The algorithm used to learn the parameters is Expectation-Maximization (EM).   The M-step for p is a straightforward one, and the M-step for V is a generalized M-step using moment matching. (Details in Algorithm 1 in Section 3.2).

Once the model is set-up, the procedure is straight-forward. What I find most interesting in this paper is Section 3.4 where the author discuss problems of the Beta-Bernoulli model, which is what people usually use for conversion rates or click-through-rates experiments. The first practical problem comes from a miss match between the experiment randomizing on the user-level while the model assumes page-views being iid. The second problem, which I honestly do not quite understand, is that the conjugate Beta prior cannot be a genuine prior. I wish the author had elaborated more on this point because conversion rate is tested so often in online experiments.

When the author concludes the paper, he talks about what we should do if there are not thousands experiments from which we can learn an objective prior. He talks about the trade-off between the sample size of using all the experiments and the precision from using only relevant experiments. To this end, he suggests setting up a hierarchical model.

I really like this paper and I have read it several times. Every time I read it, I learn a little more about Bayesian A/B tests. I like how it is a good blend of technical derivations, practical considerations and philosophical discussions.While reading the paper, I kind of feel that the author needs more space than what the page limit had given him. Because there are so many places where I hope he had elaborates on or given more details about. The DSAA’16 paper is a follow up on optional stopping for Bayesian A/B tests. I am personally very intrigued by the Beta-Bernoulli discussions and I also want to learn more about what the author has to say about multiple-testing!




SRN – Winner’s Curse by Lee and Shen from Airbnb

Online controlled experiments (A/B tests) has been my reading theme for this summer. This weekend I decide to read

  • Minyong R. Lee and Milan Shen. 2018. Winner’s Curse: Bias Estimation for Total E ects of Features in Online Controlled Experiments. In KDD ’18: The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 19–23, 2018, London, United Kingdom. (link)

This is the conference submission following the illuminating medium post

  • Milan Shen and Mining Lee. Selection Bias in Online Experimentation:Thinking through a method for the Winner’s Curse in A/B testing. (link)

Today websites are testing many incremental changes to the webpage at the same time on large-scale experimentation platforms. The average treatment effect of each incremental change is unbiased estimated. Experiments with statistically significant improvements to some metric will be launched to the users. But the total effect of aggregating these incremental changes is over-estimated, if we simply add up the individual estimates. This is the winner’s curse the authors describe, and they quantified the bias in estimating the total effects in this paper.

Suppose after running N experiments, the set of chosen experiments is A (which is a random set!) with true total effect T_A and estimated total effect S_A. Then it can be shown that \beta = \mathbb{E}\left[S_A - T_A\right] > 0.

The authors provides closed form solution for the bias in the simple case where each estimate follows a normal distribution with known standard deviation X_i \sim \mathcal{N}(a_i,\sigma^2_i). Let’s say b_i is some value choose by analysts running the experiments to select the set A. The authors show that \beta = \sum_{i=1}^n \sigma_i \phi\left(\frac{\sigma_i b_i - a_i}{\sigma_i}\right). The point of this equation is that all the epxeriments contribute to the bias, not just those selected through experimentation, because the sum is over all the experiments! As the Medium post pointed out:

If the set of experiments being aggregated A is fixed from the beginning, then no bias would be introduced. In the end, the bias exists because of the process of selecting the successful ones among many experiments we run, and we do this every day in a large scale experimentation platform.

The authors also provide a bootstrap confidence interval for the bias corrected total effect estimator.

Screen Shot 2018-08-05 at 11.44.08 PM.png
Fig: comparison between (left) adding up the estimated effect of 6 selected experiments, (middle) bias adjusted total effect estimate and (right) total effect estimated from a separate hold-out. Source: Medium post from Airbnb Engineering and Data Science. This is also Figure 6 from KDD paper.

I did not have time to go-through Section 5 – applications at Airbnb from the paper this time and I’d like to revisit it at another chance.

I really like the authors’ way of formulating the solving this selective inference problem. Their de-biasing method requires very little assumptions and is straightforward to calculate. I have not given must thought to it but I am wondering how would a Bayesian approach this problem. Would that be setting up a hierarchical model by pulling all the incremental changes together? I would certainly meditate on this question during running or yoga!