SRN – Informed proposals for local MCMC in discrete spaces by Zanella (Part I)

This week I am reading ‘Informed proposals for local MCMC in discrete spaces‘ by Giacomo Zanella. This paper is about designing MCMC algorithms for discrete-values high-dimensional parameters, and the goal is similar to the papers discussed in previous posts (Hamming ball sampler & auxiliary-variable HMC). I decide to split the Sunday Reading Notes on this paper into two parts, because I find many interesting ideas in this paper.

In this paper, Zanella come up with locally-balanced proposals. Suppose \pi(x) is the target density and K_{\sigma}(x,dy) is an uninformed proposal. We assume that as \sigma \to 0 the kernel K_{\sigma}(x,dy) converges to the delta measure. Zanella seeks to modify this uninformed proposal so that it incorporates information about the target \pi and is biased towards areas with higher density. An example of locally-balanced proposals is Q_{\sqrt{\pi}} (x,dy) = \frac{\sqrt{\pi(y) }K_{\sigma}(x,dy)}{(\sqrt{\pi} * K_{\sigma})(x)}. This kernel is reversible with respect to \sqrt{\pi(x)}(\sqrt{\pi} * K_{\sigma})(x), which converges to \pi(x)dx as x \to 0. [Note the normalizing constatn is the convolution \sqrt{\pi(x)}* K_{\sigma} = \int \sqrt{\pi(y)} K_{\sigma}(x,dy)].]

More generally, Zanella considers a class of pointwise informed proposals that has the structure Q_{g,\sigma} = \frac{1}{Z_{g}}\cdot g\left(\frac{\pi(y)}{\pi(x)}\right) K_{\sigma}(x,dy). It is suggested that the function g satisfy g(t) = t g(1/t).

I will save the discussion on locally-balanced proposals and Peskun optimality to Part II. In this part, I want to discuss Section 5: Connection to MALA and gradient-based MCMC. In continuous space, the point-wise informed proposal Q_{g,\sigma} would be infeasible to sample from because of the term g\left(\frac{\pi(y)}{\pi(x)}\right) . If we take a first-order Taylor expansion, we would have Q_{g,\sigma}^{(1)} \propto g \left( \exp ( \nabla  \log \pi(x) (y-x)) \right) K_{\sigma}(x,dy). If we choose g(t) = \sqrt{t} and K_{\sigma}(x,\cdot) =N(x,\sigma^2), this is the MALA proposal.

I find this connection very interesting, although I do not have a good intuition about where this connection comes from. One way to explain it is that gradient-based MCMC in continuous space is using local information to design informed proposals. In the conclusions, the author mentions that this connection should improve robustness of gradient-based MCMC schemes and help with parameter tuning.


  •  Zanella, G. (2017). Informed proposals for local MCMC in discrete spaces. arXiv preprint arXiv:1711.07424.

SRN – Auxiliary-variable Exact HMC samplers for Binary Distributions by Pakman and Paninski

It’s time for Sunday Reading Notes again! This week I am discussing another computational statistics paper: ‘Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions’ by Ari Pakman and Liam Paninski from Columbia University. This paper is published at NIPS 2013.

In the previous Hamming Ball sampler SRN post, the algorithm uses data augmentation to sample from discrete distributions. In this week’s paper, the goal is to sample from generic binary distributions with data augmentation into continuous variables.

Let’s say we want to sample from the distribution p(s) defined over s \in \{\pm 1 \}^d given an un-normalized density f(s). The authors propose augmenting with a continuous variable y \in \mathbb{R}^d with joint density p(s,y) = p(s)p(y|s) where p(y|s) is a density we can design but it must satisfy s_i = \mathrm{sgn}(y_i) for all i = 1,\cdots,d. The marginal distribution of $y$ is $p(y) = p(s)p(y|s)$ as a result of this constraint. It turns out that at this point we transformed a d-dimentional binary  problem on s into a d-dimensional continuous problem on y.

To sample from y, the authors suggest using Hamiltonian Monte Carlo, the potential energy is U(y) = - \log p(y) = -\log p(y|s) - log f(s) and the kinetic energy terms is K(q) = <q,q>/2. The HMC sampler involves simulating a trajectory of y that preserves the Hamiltonian H(y,q) = U(y) + K(q) and typically leap-frog simulation is used. With the constraint in p(y|s), the potential function is defined only piece-wise and we need to be careful when the trajectory crosses regions. o this end, the authors insist we choose p(y|s) such that \log p(y|s) is quadratic, so that the trajectory is deterministic and approximations methods are not necessary.

Because U(y) has a jump at y_i = 0, the value of momentum q_i should change when we cross boundaries. This is, in my opinion, the most interesting part of the paper. Suppose at time t_j we have y_j = 0, then a change in trajectory must happen and let’s say the momentum just before and after y_j = 0 are q_j(t_j^-) and q_j(t_j^+). Conservation of energy says we must have q_j^2(t_j^+)/2+ U(y_j = 0, s_j = -1) = q_j^2(t_j^-)/ 2+ U(y_j = 0, s_j = +1) if y_j <0 before the y_j = 0. From this equation, if q_j^2(t_j^+)>0 then we continue the trajectory with q_j(t_j^+) =q_j(t_j^-); however, if q_j^2(t_j^+)<0  then the particle is reflected from a wall at y_j = 0 and the trajectory gets reflected with q_j(t_j^+) = - q_j(t_j^-).

The two augmentations mentioned in this paper are Gaussian augmentation and exponential augmentation. Both results in quadratic log likelihood. The Gaussian augmentation is very interesting because there is a fixed order that each coordinate y_j reaches zero and the successive hits occur at t_j + n \pi. The authors makes an observation that:

Interestingly, the rate at which wall y_j = 0 is crossed coincides with the acceptance rate in a Metropolis algorithm that samples uniformly a value for i and makes a proposal of flipping the binary variable s_i.

To me this is a sanity check rather than a surprise because each coordinate hits the boundary the same number of times and making a decision to continue or to bounce back in y_j is the same as deciding whether we should flip the sign of s_i. But I think the authors give a very help comment pointing out that although the acceptance probability is the same,  the method proposed is still different from Metropolis because

in HMC the order in which the walls are hit is fixed given the initial velocity, and the values of q_i^2 at successive hits of y_i = 0 within the same iteration are not independent.

What’s interesting for the exponential augmentation method is that

particles moves away faster from areas of lower probability.

This is certainly a nice feature to have so that the sample mixes well.

In the simulation examples, the authors compared Gaussian HMC and Metropolis on 1d and 2d ising models and showed that:

  1. ‘the HMC sampler explores faster the samples space once chain has reached equilibrium distribution.’
  2. ‘the HMC sampler is faster in reaching the equilibrium distribution.’

I think the take away from this paper is the continuous data augmentation to sample discrete variables and their dealing with piece-wise defined potential function.



  • Pakman, A., & Paninski, L. (2013). Auxiliary-variable exact Hamiltonian Monte Carlo samplers for binary distributions. In Advances in neural information processing systems (pp. 2490-2498).
  •  Neal, R. M. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo2(11), 2.

SRN – The Bayesian Lasso by Park and Casella

‘The Bayesian Lasso’ paper by Park and Casella has been on my desk for a while. Thanks to the Sunday Reading Notes (SRN) series, I finally sit down to read and think about this paper.

Lasso (Least absolute shrinkage and selection operator) is a regression method where comparing to ordinary least squares some estimators shrinks to zero while others are set to zero. Lasso is able to do both variable selection (done with step-wise regression before Lasso) and shrinkage to avoid overfitting (for example with ridge regression) at the same time. The objective function for Lasso is \min_{\beta} (y-X\beta)^T(y-X\beta) + \lambda \sum_{j=1}^p |\beta_j|. In the 1996 paper, Tibshirani points out that Lasso estimates can be interpreted as posterior mode estimates (MAP) when we put i.i.d. Lapalace priors (\pi(\beta) \propto \prod_{j=1}^p \exp(-\lambda |\beta_j|).

When I took the Bayesian Data Analysis class (STAT 220 at Harvard), our textbook BDA3 contains an exercise on Lasso regularization and it claims that ‘a fully Bayesian lasso is the same model as lasso but assigning a hyperprior to \lambda. ‘ But ‘The Bayesian Lasso’ paper sets up the hierarchy with a different route:


The full conditional distributions on \beta, \sigma^2 and \vec{\tau^2} have closed-form solutions and are provided in Section 2 of the paper. In Section 3, the authors discuss ways of choosing the Lasso parameter: 1) Empirical Bayes by marginal maximum likelihood: iteratively update \lambda^{(k)} = \sqrt{\frac{2p}{\sum_{j=1}^p \mathbb{E}_{\lambda^{(k-1)}} [\tau_j^2|y] }}; 2) hyper-priors on \lambda^2.

The examples in this paper come from the diabetes dataset from the Tibshirani 1996 paper and can be found in the ‘lars‘ R package. On this dataset the results from a Bayesian lasso are very similar to a regular Lasso, as seen in Figure 1 & 2 from the paper or the plot below where I borrowed code from the Rdocumentation of the ‘monomvn‘ R package.


In the plot above, we can see the shrinkage effect and the variable selection happening. Compared to ordinary least squares (green), the both the lasso estimates and samples from Bayesian lasso shrink towards zero (except for b.3) and the lasso estimate fixes some parameters (b.1,2,5,6,8,10) at zero. Comparing to Lasso where we do not have uncertainty estimates unless with bootstrap, the Bayesian lasso is providing these uncertainty estimates.


Hastie, T., & Efron, B. (2013). lars: Least angle regression, lasso and forward stagewise, 2013.

Gramacy, R. B., & Gramacy, M. R. B. (2013). Package ‘monomvn’.

Park, T., & Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association103(482), 681-686.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267-288.

SRN – The Hamming Ball Sampler by Titsias and Yau

My Sunday Reading Notes (SRN) this semester will mostly be about Bayesian Computations. This week’s post is on The Hamming Ball Sampler proposed by Titsias and Yau.The hamming ball sampler is a MCMC algorithm for high-dimensional discrete-valued vectors or matrices.  While reading this paper, I also found a blog post about it from Xi’an’s OG, which provided some high-level intuitions and background knowledge.

The paper considers a state space model with discrete hidden space X with parameters \theta and observations y.  Factorial Hidden Markov Model (fHMM) is an example of such a model. In state space models,  the complete data likelihood can be factorized with p(y,X,\theta) =  p(X,\theta) \prod_{i=1}^N p(y_i|X,\theta). Given some prior, we want to sample from the posterior distribution X,\theta | y.

When the dimension of X is large, we would suffer from ‘the curse of dimensionality’. Using a Gibbs sampler, we can iteratively sample \theta  \sim \cdot | X,y and $\theta X \sim \cdot | \theta,y$. Because the dimension of $X$ is high, we should also consider blocked Gibbs sampling on X by for example updating one row (or column) of X at a time. While this is conceptually straightforward and potentially also easy to implement, as the authors pointed out:

Conditional sampling may lead to an inability to escape from local modes in the posterior distribution particularly if the elements of X exhibit strong correlations with each other and together with \theta.

The Hamming Ball Sampler (HBS) introduces an auxiliary variable U that has the same dimension as the latent space X. The augmented joint probability can be factorized with as p(y,X,\theta,U) = p(U|X) p(y,X,\theta). The conditional distribution p(U|X) is chosen to be uniform over a neighborhood set \mathcal{H}_m(X). This set \mathcal{H}_m(X) is a Hamming Ball and it basically says that if U,X are K \times N matrices, then U and X can be different on at most m positions among the K rows. With the auxiliary variable U, the Hamming Ball Sampler alternate between the steps U \gets p(U|X) and (\theta, X) \gets p(\theta,X|U,y).

The Hamming Ball Sampler is like slice-sampling in discrete spaces, and each Hamming Ball \mathcal{H}_m(X) is a slice. Introducing the slice introduces random exploration, and makes it easier to escape from local modes. For the simplest example where X is a K \times N matrix and the hamming distance is defined the the number of different elements each column, if we set m = K/2 then we can potentially change all elements of X in one update. But when m is large, the algorithm complexity also increases.

In this paper the authors provided several numerical examples comparing the Hamming Ball Sampler with block Gibbs Samplers. In the fHMM examples (Figure 4 in the paper) we can see that HBS with m = 2 or 3 achieves having joint posterior density much faster than the block Gibbs Samplers. They also conclude that HB-2 is best balances computational time and sampling efficiency.


Titsias, M. K., & Yau, C. (2017). The Hamming ball sampler. Journal of the American Statistical Association112(520), 1598-1611.

Ghahramani, Z., & Jordan, M. I. (1996). Factorial hidden Markov models. In Advances in Neural Information Processing Systems (pp. 472-478).

SRN – Should I follow the crowd? by Canamares and Castells from SIGIR’18

It’s time for Sunday Reading Notes again. This past week I have been reading the paper ‘Should I Follow the Crowd? A Probabilistic Analysis of the Effectiveness of Popularity in Recommender Systems‘ by Rocio Canamares and Pablo Castells. It won the best paper award at SIGIR 2018!

Current state of the art recommender systems tend to recommend popular items to their users. As the paper points out in its abstract:

The fundamental questions remains open though whether popularity is really a bias we should avoid or not, whether it could be a useful and reliable signal in recommendation, or it may be unfairly rewarded by the experimentation biases.

This paper is concerned with un-personalized recommendations evaluated in off-line experiments. The dataset consists of rated items only and is randomly split into training and test. The ratings takes binary values (positive rating / negative rating). For every item, popularity is defined in terms of number of positive ratings, which is different from average rating. In order to make recommendations, we want to measure ‘relevance‘, which is a binary random variable and res(i) = 1 if the user likes item i. From this set-up, we have pop(i) \sim p(rel,rate|i) and avg(i) \sim p(rel|rate,i). But we cannot directly measure p(rel|i).

The author uses the metric ‘expected precision’ as the metric for recommendations and emphasizes that true precision (cannot be measured from offline experiments) is different from observed precision (measured with rating from the test set). The authors find the optimal ranking function for both true precision and observed precision. Using relevance optimizes the true precision : f(i) = p(rel|i)\frac{1-\rho p(rated|rel,i)}{1-\rho p(rated|i)} is the optimal ranking function for \mathbb{E}[P@1|f], and as a result of experimental design the expected observed precision \mathbb{E}[\hat{p}@1|f] is optimized by \hat{f}(i) \sim p(rel|i)\frac{p(rated|rel,i)}{1-\rho p(rated|i)}. As we can see, popularity has advantage in optimizing the expected observed precision.

Because customers have a tendency to rate items that they like, the rating information is not missing at random. To understand rated given relevance and item, the authors gives two simple conditions: conditional item independence and conditional relevance independence. Under both conditions, using pop(i) optimizes the expected observed precision and using avg(i) optimized the expected true precision.

Because we can only measure the expected observed precision with offline experiments and because popularity has advantage optimizing \mathbb{E}(\hat{P}@1), the current recommender systems that are based on offline evaluations favors popularity. Although not all recommender systems are designed the same way as described in this paper, I believe this elegant framework provides intuitions because the ‘popularity of popularity’.

Acknowledging that a user can rate an item only if he or she has bought the item, the authors further consider ‘discovery bias’, because rating depends on all of relevance, discovery and item. This dependencies are characterized in Figure 2 from the paper.

Screen Shot 2018-08-26 at 4.37.58 PM.png

In a realistic senario, our data should fall into category 4 – ‘no assumption’. In this case the expected precision can be approximated with Monte Carlo integration.

The key results and experiment are summaries in Table 1 and Figure 5 in the paper.

Screen Shot 2018-08-26 at 4.42.49 PMScreen Shot 2018-08-26 at 4.42.26 PM

Besides the probability based framework, what I really like about this paper is that the authors designed and collected a crowdsourced dataset that makes real data experiments of relevance-independent discovery and item-independent discovery possible. After reading going through the math and the experiments, I feel quite convinced that

average rating may be in fact a better, safer, more robust signal than the number of positive ratings in terms of true achieved accuracy in most general situations.

Although the metric \mathbb{E}(P@1|R) seems too simple because essentially it is like we can only recommend one item to a user, it is a good and tractable measure to start with. The authors suggest that it empirically consistently generalize other metric such as nDCG@k. However I am not sure how much I would agree with this point, because nDCG@k cares much beyond the top ranked item.

Overall I really like this paper and I think it touches many fundamental problems in popularity bias and provides enough mathematical clairty. I wonder if this paper suggests we should do more online experiments for recommender systems because true accuracy cannot be measured with offline experiments. I am also eager to see what the author has to say about temporal data splitting. Lastly I hope the authors talk about what we should do with the very common ‘5-star’ rating system.



  1. Rocío Cañamares and Pablo Castells. 2018. Should I Follow the Crowd? A Prob- abilistic Analysis of the Effectiveness of Popularity in Recommender Systems. In Proceedings of ACM SIGIR ’18, July 8–12, 2018, Ann Arbor, MI, USA.
    ACM, NY, NY, USA, 10 pages.

SRN – Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments by Alex Deng

The Sunday Reading Notes (SRN) series is back! This Sunday’s topic is Bayesian A/B tests. Last month, I wrote a post on Bayesian continuous monitoring, commenting on the DSAA’16 paper Continuous Monitoring of A/B Tests without Pain by Deng et al.

In that paper, the main take away is that peeking is not a problem in Bayesian A/B tests if we use genuine priors and prior odds, and Bayesian A/B tests controls False Discovery Rate (FDR).  In this paper, the authors mention that we can learn an objective prior from empirical data, instead of using non-infomration priors in Bayesian A/B tests.

Practical prior specification for Bayesian statistics is a big topic. Yesterday I met a machine learning engineer at a BBQ and told him that I have been thinking about objective prior learning and earning stopping for Bayesian A/B tests. His reaction was: don’t you know conjugate priors? don’t you know that priors will be overwhelmed by data very soon? My response was: yes you can use conjugate priors because it’s convenient. But I am not sure if we still want to assume we are in the asymptotic regime when we want to do an early stopping in A/B tests. While I convinced by him and myself with this argument, I am not sure whether this is the only reason prior specification matters in A/B tests. I’d be super happy if there can be some discussions in the comments!

Going back to Deng’s paper, to use priors learned from past experiments for a new test, the only assumption we need to make is that we assume the prior behinds these experiments are the same. This would happen because the variants are designed by the same group of engineers and the treatment effort would have the same distribution. We assume that observations in control come i.i.d. from some distribution with mean \tau_C and observations in treatment come i.i.d. from some distribution with mean \tau_T. The null hypothesis is H_0: \tau_C = \tau_T and alternative H_1: \tau_T \not=\tau_C. Let \sigma^2 be the pooled variance, then the average treatment effect scaled by \sigma is \mu := \mathbb{E}(\delta) = \frac{\tau_T - \tau_C}{\sigma}. (For details of the t-test, see section 3.)

We need to learn the prior odds \frac{p}{1-p} and the prior density \mu \sim \pi..  The algorithm used to learn the parameters is Expectation-Maximization (EM).   The M-step for p is a straightforward one, and the M-step for V is a generalized M-step using moment matching. (Details in Algorithm 1 in Section 3.2).

Once the model is set-up, the procedure is straight-forward. What I find most interesting in this paper is Section 3.4 where the author discuss problems of the Beta-Bernoulli model, which is what people usually use for conversion rates or click-through-rates experiments. The first practical problem comes from a miss match between the experiment randomizing on the user-level while the model assumes page-views being iid. The second problem, which I honestly do not quite understand, is that the conjugate Beta prior cannot be a genuine prior. I wish the author had elaborated more on this point because conversion rate is tested so often in online experiments.

When the author concludes the paper, he talks about what we should do if there are not thousands experiments from which we can learn an objective prior. He talks about the trade-off between the sample size of using all the experiments and the precision from using only relevant experiments. To this end, he suggests setting up a hierarchical model.

I really like this paper and I have read it several times. Every time I read it, I learn a little more about Bayesian A/B tests. I like how it is a good blend of technical derivations, practical considerations and philosophical discussions.While reading the paper, I kind of feel that the author needs more space than what the page limit had given him. Because there are so many places where I hope he had elaborates on or given more details about. The DSAA’16 paper is a follow up on optional stopping for Bayesian A/B tests. I am personally very intrigued by the Beta-Bernoulli discussions and I also want to learn more about what the author has to say about multiple-testing!




SRN – Winner’s Curse by Lee and Shen from Airbnb

Online controlled experiments (A/B tests) has been my reading theme for this summer. This weekend I decide to read

  • Minyong R. Lee and Milan Shen. 2018. Winner’s Curse: Bias Estimation for Total E ects of Features in Online Controlled Experiments. In KDD ’18: The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 19–23, 2018, London, United Kingdom. (link)

This is the conference submission following the illuminating medium post

  • Milan Shen and Mining Lee. Selection Bias in Online Experimentation:Thinking through a method for the Winner’s Curse in A/B testing. (link)

Today websites are testing many incremental changes to the webpage at the same time on large-scale experimentation platforms. The average treatment effect of each incremental change is unbiased estimated. Experiments with statistically significant improvements to some metric will be launched to the users. But the total effect of aggregating these incremental changes is over-estimated, if we simply add up the individual estimates. This is the winner’s curse the authors describe, and they quantified the bias in estimating the total effects in this paper.

Suppose after running N experiments, the set of chosen experiments is A (which is a random set!) with true total effect T_A and estimated total effect S_A. Then it can be shown that \beta = \mathbb{E}\left[S_A - T_A\right] > 0.

The authors provides closed form solution for the bias in the simple case where each estimate follows a normal distribution with known standard deviation X_i \sim \mathcal{N}(a_i,\sigma^2_i). Let’s say b_i is some value choose by analysts running the experiments to select the set A. The authors show that \beta = \sum_{i=1}^n \sigma_i \phi\left(\frac{\sigma_i b_i - a_i}{\sigma_i}\right). The point of this equation is that all the epxeriments contribute to the bias, not just those selected through experimentation, because the sum is over all the experiments! As the Medium post pointed out:

If the set of experiments being aggregated A is fixed from the beginning, then no bias would be introduced. In the end, the bias exists because of the process of selecting the successful ones among many experiments we run, and we do this every day in a large scale experimentation platform.

The authors also provide a bootstrap confidence interval for the bias corrected total effect estimator.

Screen Shot 2018-08-05 at 11.44.08 PM.png
Fig: comparison between (left) adding up the estimated effect of 6 selected experiments, (middle) bias adjusted total effect estimate and (right) total effect estimated from a separate hold-out. Source: Medium post from Airbnb Engineering and Data Science. This is also Figure 6 from KDD paper.

I did not have time to go-through Section 5 – applications at Airbnb from the paper this time and I’d like to revisit it at another chance.

I really like the authors’ way of formulating the solving this selective inference problem. Their de-biasing method requires very little assumptions and is straightforward to calculate. I have not given must thought to it but I am wondering how would a Bayesian approach this problem. Would that be setting up a hierarchical model by pulling all the incremental changes together? I would certainly meditate on this question during running or yoga!

SRN – A near-optimal exploration-exploitation approach for assortment selection by Agrawal et al.

I am very interested in exploration-exploitation trade-off. In a previous Sunday Reading Notes post, I discussed bayesian optimization for learning optimal hyper-parameters for machine learning algorithms, as an example for this trade-off. Today I study another exploration-exploitation algorithm for learning the best assortment: A Near-Optimal Exploration-Exploitation Approach for Assortment Selection by Agrawal et al. It has applications in online advertisement display and recommendation systems for e-commerce.

If we take the recommendation system for e-commerce websites as an example. Suppose this website has N products in total, and we can only recommend a subset S to customers and this subset is at most size K.  This subset S is an assortment and K is a display constraint. When an assortment S is offered to a customer, the probability of the customer purchasing the i-th item in the assortment is p_i(S)  \propto \nu_i where i=0,...,|S| and p_o(S) is the probability of not purchasing any item from S. Further more if the price of each item in this assortment is r_i then the expected revenue from assortment S is R(S) = \sum_{i\in S}. If the probabilities are known, then we will have a static assortment optimization problem of S^\star  = \max_{s \in S} R(S). But if the preferences are not known or if they change over time, then we would want to learn user preference among assortments (explore) and recommend the optimal assortment to them (exploitation), and this is called a (dynamic) assortment optimization problem.  

In the optimization, we want to minimize the cumulative regret until time T which is reg(T) = \sum_{t=1}^T R(S^\star) - \mathbb{E}\left[R(S_t) \right]. In this paper, the user preferences which are described through probabilities p_i(S) are modeled with Multinomial Logistic model (MNL). With the MNL model, we have to estimate the model parameter \nu_i with some multi-armed bandit algorithm to balance exploration and exploitation. Therefore the authors refer to this problem as bandit-MNL. The main contribution of this paper is Theorem 4.1, in which they give a regret bound of O(\sqrt{NT} \log T + N \log^3T) with the proposed Algorithm 1.

Section 4 gives the proof of the main theorem and it is in my opinion very well-written. Proof outline is given in Section 4.1 and the authors marked the 3 steps of the proof, which are detailed later in Sections 4.2 – 4.4.

The exploration-exportation algorithm for bandit-MNL algorithm in Algorithm I iterations between learning the current best possible recommendation with using static assortment optimization methods. After recommending the calculated assortment to a customer, they observe purchasing decisions of customers until a no-purchase happens. Using information form user purchases, we can learn upper confidence bounds of model parameters $later \nu_{i}.$ The UCB estimates are used to find the assortment for the next iteration. This process of going back between offering assortments and observing purchase decisions continuous until T steps set before starting the experiment. In  Section 4.2, the authors should that the UCBs converge to the true value with high probability. (Be careful this is not a convergence in probability!).

In my opinion, this paper lies on the theoretical end of the spectrum and their is no simulation results or real-data examples accompanying the theoretical results. It would be nice to see some visualization of the cumulative regret with this algorithm and how it compare to the theoretical lower bound presented in Section 5.

I have two idea of how this problems can be extended to accommodate for complex applied problems. For examples, 1)  item prices are assumed to be known for now. It would be interesting extension for this paper to see if we can treat r_i‘s as random variables or something we can learn to optimize. 2) In this algorithm we have to assume customers are homogenous. Personalized assortment optimization would be an interesting direction to explore.


  • Agrawal, Shipra, et al. “A near-optimal exploration-exploitation approach for assortment selection.” Proceedings of the 2016 ACM Conference on Economics and Computation. ACM, 2016.



SRN – applications of embeddings in search ranking and recommendations

In this Sunday’s Sunday Reading Notes (which actually is posted on Monday), I am venturing into the applied machine learning world and discussing two blog posts about the application of embeddings at AirBnB and Etsy.

It is a fascinating read for me because I am deeply attracted by the versatility of the algorithm. Neither blog posts focused on the details on the training of embeddings. Instead they are written to motivate readers to understand the intuition behind the algorithms and how to adapt the loss function to each websites specific needs.

Semantic embeddings are invented in Natural Language Process fields to learn continuous low-dimensional representations of high-dimensional sparse-vectors. Needless to say, working in a lower dimension makes computations faster. These neural network based algorithms are trained with large text data sets and are based on the intuitions that words that appear together a lot are related. AirBnB and Etsy used embedding to model user behavior. Using the analogy at Nishan provided in the Etsy article, each user session is a sentence and the sequence of actions by the user are the words.

The AirBnB article provided more details on the negative sampling, the algorithm used to  train embeddings.1*dOy3xKj5ts1FW_YypI1MAQ

This diagram from the AirBnB post illustrate the training process of the embeddings. The ‘booked listing’ (in purple) is what the AirBnB team added to the algorithm. The booked listing serves as a global context token, aiding the prediction of eventual booking in the embedding training process. In addition to training the central listing with context listings and the booked listing, because the travelers often only search with-in the same market, AirBnB engineers also added a randomly selected listing from the same market of the central listing. As a result listings within the same market should be closer in the embedding space. Indeed encoding geographical information in the embedding achieved as evident from the plot below (from the AirBnB post).1*JhWHx2mwuD898RiE18dF8Q

What the Etsy engineer’s do differently is that they added some ‘implicit contextual information’ into the lose function:

Training a Skip-gram model on only randomly selected negatives, however, ignores implicit contextual signals that we have found to be indicative of user preference in other contexts. For example, if a user clicks on the second item for a search query, the user most likely saw, but did not like, the first item that showed up in the search results. We extend the Skip-gram loss function by appending these implicit negative signals to the Skip-gram loss directly.

This is quite interesting because they considered the ordering of items (words/clicks) and thus both concurrence and ordering are considered in the loss function.

Both articles give concrete examples of how embeddings improved its performance on search ranking and similar items recommendations. I highly encouraged interested readers to check them out because you do not want to spoil your fun with my post.

What I find interesting in this paper is how engineers from both firms adopted this model from natural language processing and used in to serve its own customers. I have been so narrow minded about using NLP algorithms for NLP problems only. More than that, due to their respective needs they modified the loss function. I really enjoyed thinking through why they needed this adaptations.


Sunday Reading Notes – Bayesian Optimization

For this week’s Sunday Reading Notes, I am switching topics towards bayesian computations and machine learning. This week’s paper  is ‘Practice Bayesian Optimization of Machine Learning Algorithms‘ by Jasper Snoek, Hugo Larochelle and Ryan Adams, and it appeared on NIPS 2012.

On the high level, Bayesian optimization is about fitting Gaussian Process(GP) regression on data currently observed about some black-box function f, and choosing the next point x to get f(x) with result of the GP regression. The premise of such a procedure is that the black-box function f that we want to maximize is very expensive to evaluate. In this case, it make sense that we choose where to evaluate this function smartly based on current information. This is an interesting case for the exploration-exploitation trade-off.

When I first read the paper, I was asking myself: why do we need Bayesian optimization? How does it compare to other optimization methods we learned, for example Gradient Descent? Wouldn’t a grid search on the domain give a much better optimization result? I think to answer these questions, we have to keep remind ourselves that we are optimizing some black-box function, whose gradient is unknown. And what’s more, this function is very expensive to evaluate and we can not afford to perform a grid-search on it. As an example, think about tuning parameters for a neural network. Admittedly Bayesian optimization can be expensive as well, therefore we would choose to use Bayesian optimization over grid-search when all the integration and maximization involved in choosing x_{n+1} is much less expensive then evaluating f(x_{n+1}).

We start with putting on GP prior on f and assume each observation is some noisy realization of the true function value: y_n \sim \mathcal{N}(f(x_n),\nu). The posterior distribution f|\{x_n,y_n\},\theta is fully characterized by a predictive mean function \nu(x; \{x_n,y_n\},\theta) and the predictive variance function \sigma^2(x; \{x_n,y_n\},\theta)for every x on the domain of f: x\sim \mathcal{N}(\mu(x;\{x_n,y_n\},\theta), \sigma^2(x;\{x_n,y_n\},\theta). For more details about Gaussian Process regression, you can read the book Gaussian Processes for Machine Learning by Rasmussen and Williams.

If the current best value is x_{min} = \arg\min_{x_i}f(x_i), then for every x on the domain, the probability of f(x) smaller then f(x_{min}) is \alpha_{PI}(x;\{x_n,y_n\},\theta) = \Phi(\gamma(x)) where \gamma(x) = \frac{f(x_{min} - \mu(x; \{x_n,y_n\},\theta)}{\sigma(x;\{x_n,y_n\},\theta)}. In Bayesian optimization, we call this probability of improvement an acquisition function. There are other acquisition functions such as expectation of improvement (EI) and lower confidence bounds (LCB). We optimize the acquisition function to choose the next point x_{n+1} \gets argmax \ \alpha_{PI}.

The most interesting section in this paper is Section 3: practical considerations, where the authors goes through

  1. choice of covariance function;
  2. treatment of hyper-parameter;
  3. modeling cost;
  4. parallelization.

The first two issues also appear in the GPML book. Basically in this paper the authors  recommend the Matern 5/2 kernel because it only assumes twice-twice-differentiabiltiy. The squared exponential  kernel is the default choice for GP regression, but its smoothness assumption is unrealistic more most machine learning algorithms.

For choosing hyper-parameters, we can choose hyper-parameters that maximize the marginal likelihood function. For a full Bayesian treatment, we have to marginalize over the hyper-parameters and work with the integrated acquisition function \hat{\alpha}(x;\{x_n,y_n\}) = \int \alpha(x;\{x_n,y_n\},\theta) p(\theta|\{x_n,y_n\} d\theta. To approximate this integral, the authors used a slice-sampler with step-in and step-out procedure. The details of this slice-sampler is described in Slice sampling covariance hyperparameters of latent Gaussian models by Murray and Adams. There are some tricks like operating on a long MCMC chain so that we do not waste too many samples, but this could still be an expensive computation. However, the cost is justified by lower cost comparing to evaluating f:

As both opti- mization and Markov chain Monte Carlo are computationally dominated by the cubic cost of solving an N -dimensional linear system (and our function evaluations are assumed to be much more expen- sive anyway), the fully-Bayesian treatment is sensible and our empirical evaluations bear this out.

I have played with this algorithm myself on some simple functions like Brainin-Hoo function. I’d like to try how this function works on more complicated functions, like online LDA, Latent Structured SVM, and convolutional neural nets.

Lastly, to get some sense of how expensive the computations are, I want to show Figure 4 in the paper.

Screen Shot 2018-07-15 at 10.22.56 PM

If we look at (4b) on the axis, the unit of time is days! In terms of Function evaluations, the Bayesian optimization algorithm dominates a random grid search from a very early stage. In (4a) GP EI MCMC, which uses the least parallelizations is the most efficient in terms of function evaluations, but when we look at (4b) it could take longer time (measured by days!).