SRN – Should I follow the crowd? by Canamares and Castells from SIGIR’18

It’s time for Sunday Reading Notes again. This past week I have been reading the paper ‘Should I Follow the Crowd? A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems‘ by Rocio Canamares and Pablo Castells. It won the best paper award at SIGIR 2018!

Current state of the art recommender systems tend to recommend popular items to their users. As the paper points out in its abstract:

The fundamental questions remains open though whether popularity is really a bias we should avoid or not, whether it could be a useful and reliable signal in recommendation, or it may be unfairly rewarded by the experimentation biases.

This paper is concerned with un-personalized recommendations evaluated in off-line experiments. The dataset consists of rated items only and is randomly split into training and test. The ratings takes binary values (positive rating / negative rating). For every item, popularity is defined in terms of number of positive ratings, which is different from average rating. In order to make recommendations, we want to measure ‘relevance‘, which is a binary random variable and res(i) = 1 if the user likes item i. From this set-up, we have pop(i) \sim p(rel,rate|i) and avg(i) \sim p(rel|rate,i). But we cannot directly measure p(rel|i).

The author uses the metric ‘expected precision’ as the metric for recommendations and emphasizes that true precision (cannot be measured from offline experiments) is different from observed precision (measured with rating from the test set). The authors find the optimal ranking function for both true precision and observed precision. Using relevance optimizes the true precision : f(i) = p(rel|i)\frac{1-\rho p(rated|rel,i)}{1-\rho p(rated|i)} is the optimal ranking function for \mathbb{E}[P@1|f], and as a result of experimental design the expected observed precision \mathbb{E}[\hat{p}@1|f] is optimized by \hat{f}(i) \sim p(rel|i)\frac{p(rated|rel,i)}{1-\rho p(rated|i)}. As we can see, popularity has advantage in optimizing the expected observed precision.

Because customers have a tendency to rate items that they like, the rating information is not missing at random. To understand rated given relevance and item, the authors gives two simple conditions: conditional item independence and conditional relevance independence. Under both conditions, using pop(i) optimizes the expected observed precision and using avg(i) optimized the expected true precision.

Because we can only measure the expected observed precision with offline experiments and because popularity has advantage optimizing \mathbb{E}(\hat{P}@1), the current recommender systems that are based on offline evaluations favors popularity. Although not all recommender systems are designed the same way as described in this paper, I believe this elegant framework provides intuitions because the ‘popularity of popularity’.

Acknowledging that a user can rate an item only if he or she has bought the item, the authors further consider ‘discovery bias’, because rating depends on all of relevance, discovery and item. This dependencies are characterized in Figure 2 from the paper.

Screen Shot 2018-08-26 at 4.37.58 PM.png

In a realistic senario, our data should fall into category 4 – ‘no assumption’. In this case the expected precision can be approximated with Monte Carlo integration.

The key results and experiment are summaries in Table 1 and Figure 5 in the paper.

Screen Shot 2018-08-26 at 4.42.49 PMScreen Shot 2018-08-26 at 4.42.26 PM

Besides the probability based framework, what I really like about this paper is that the authors designed and collected a crowdsourced dataset that makes real data experiments of relevance-independent discovery and item-independent discovery possible. After reading going through the math and the experiments, I feel quite convinced that

average rating may be in fact a better, safer, more robust signal than the number of positive ratings in terms of true achieved accuracy in most general situations.

Although the metric \mathbb{E}(P@1|R) seems too simple because essentially it is like we can only recommend one item to a user, it is a good and tractable measure to start with. The authors suggest that it empirically consistently generalize other metric such as nDCG@k. However I am not sure how much I would agree with this point, because nDCG@k cares much beyond the top ranked item.

Overall I really like this paper and I think it touches many fundamental problems in popularity bias and provides enough mathematical clairty. I wonder if this paper suggests we should do more online experiments for recommender systems because true accuracy cannot be measured with offline experiments. I am also eager to see what the author has to say about temporal data splitting. Lastly I hope the authors talk about what we should do with the very common ‘5-star’ rating system.



  1. Rocío Cañamares and Pablo Castells. 2018. Should I Follow the Crowd? A Prob- abilistic Analysis of the Effectiveness of Popularity in Recommender Systems. In Proceedings of ACM SIGIR ’18, July 8–12, 2018, Ann Arbor, MI, USA.
    ACM, NY, NY, USA, 10 pages.