SRN – Informed proposals for local MCMC in discrete spaces by Zanella (Part I)

This week I am reading ‘Informed proposals for local MCMC in discrete spaces‘ by Giacomo Zanella. This paper is about designing MCMC algorithms for discrete-values high-dimensional parameters, and the goal is similar to the papers discussed in previous posts (Hamming ball sampler & auxiliary-variable HMC). I decide to split the Sunday Reading Notes on this paper into two parts, because I find many interesting ideas in this paper.

In this paper, Zanella come up with locally-balanced proposals. Suppose \pi(x) is the target density and K_{\sigma}(x,dy) is an uninformed proposal. We assume that as \sigma \to 0 the kernel K_{\sigma}(x,dy) converges to the delta measure. Zanella seeks to modify this uninformed proposal so that it incorporates information about the target \pi and is biased towards areas with higher density. An example of locally-balanced proposals is Q_{\sqrt{\pi}} (x,dy) = \frac{\sqrt{\pi(y) }K_{\sigma}(x,dy)}{(\sqrt{\pi} * K_{\sigma})(x)}. This kernel is reversible with respect to \sqrt{\pi(x)}(\sqrt{\pi} * K_{\sigma})(x), which converges to \pi(x)dx as x \to 0. [Note the normalizing constatn is the convolution \sqrt{\pi(x)}* K_{\sigma} = \int \sqrt{\pi(y)} K_{\sigma}(x,dy)].]

More generally, Zanella considers a class of pointwise informed proposals that has the structure Q_{g,\sigma} = \frac{1}{Z_{g}}\cdot g\left(\frac{\pi(y)}{\pi(x)}\right) K_{\sigma}(x,dy). It is suggested that the function g satisfy g(t) = t g(1/t).

I will save the discussion on locally-balanced proposals and Peskun optimality to Part II. In this part, I want to discuss Section 5: Connection to MALA and gradient-based MCMC. In continuous space, the point-wise informed proposal Q_{g,\sigma} would be infeasible to sample from because of the term g\left(\frac{\pi(y)}{\pi(x)}\right) . If we take a first-order Taylor expansion, we would have Q_{g,\sigma}^{(1)} \propto g \left( \exp ( \nabla  \log \pi(x) (y-x)) \right) K_{\sigma}(x,dy). If we choose g(t) = \sqrt{t} and K_{\sigma}(x,\cdot) =N(x,\sigma^2), this is the MALA proposal.

I find this connection very interesting, although I do not have a good intuition about where this connection comes from. One way to explain it is that gradient-based MCMC in continuous space is using local information to design informed proposals. In the conclusions, the author mentions that this connection should improve robustness of gradient-based MCMC schemes and help with parameter tuning.

References:(x)

  •  Zanella, G. (2017). Informed proposals for local MCMC in discrete spaces. arXiv preprint arXiv:1711.07424.

SRN – Auxiliary-variable Exact HMC samplers for Binary Distributions by Pakman and Paninski

It’s time for Sunday Reading Notes again! This week I am discussing another computational statistics paper: ‘Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions’ by Ari Pakman and Liam Paninski from Columbia University. This paper is published at NIPS 2013.

In the previous Hamming Ball sampler SRN post, the algorithm uses data augmentation to sample from discrete distributions. In this week’s paper, the goal is to sample from generic binary distributions with data augmentation into continuous variables.

Let’s say we want to sample from the distribution p(s) defined over s \in \{\pm 1 \}^d given an un-normalized density f(s). The authors propose augmenting with a continuous variable y \in \mathbb{R}^d with joint density p(s,y) = p(s)p(y|s) where p(y|s) is a density we can design but it must satisfy s_i = \mathrm{sgn}(y_i) for all i = 1,\cdots,d. The marginal distribution of $y$ is $p(y) = p(s)p(y|s)$ as a result of this constraint. It turns out that at this point we transformed a d-dimentional binary  problem on s into a d-dimensional continuous problem on y.

To sample from y, the authors suggest using Hamiltonian Monte Carlo, the potential energy is U(y) = - \log p(y) = -\log p(y|s) - log f(s) and the kinetic energy terms is K(q) = <q,q>/2. The HMC sampler involves simulating a trajectory of y that preserves the Hamiltonian H(y,q) = U(y) + K(q) and typically leap-frog simulation is used. With the constraint in p(y|s), the potential function is defined only piece-wise and we need to be careful when the trajectory crosses regions. o this end, the authors insist we choose p(y|s) such that \log p(y|s) is quadratic, so that the trajectory is deterministic and approximations methods are not necessary.

Because U(y) has a jump at y_i = 0, the value of momentum q_i should change when we cross boundaries. This is, in my opinion, the most interesting part of the paper. Suppose at time t_j we have y_j = 0, then a change in trajectory must happen and let’s say the momentum just before and after y_j = 0 are q_j(t_j^-) and q_j(t_j^+). Conservation of energy says we must have q_j^2(t_j^+)/2+ U(y_j = 0, s_j = -1) = q_j^2(t_j^-)/ 2+ U(y_j = 0, s_j = +1) if y_j <0 before the y_j = 0. From this equation, if q_j^2(t_j^+)>0 then we continue the trajectory with q_j(t_j^+) =q_j(t_j^-); however, if q_j^2(t_j^+)<0  then the particle is reflected from a wall at y_j = 0 and the trajectory gets reflected with q_j(t_j^+) = - q_j(t_j^-).

The two augmentations mentioned in this paper are Gaussian augmentation and exponential augmentation. Both results in quadratic log likelihood. The Gaussian augmentation is very interesting because there is a fixed order that each coordinate y_j reaches zero and the successive hits occur at t_j + n \pi. The authors makes an observation that:

Interestingly, the rate at which wall y_j = 0 is crossed coincides with the acceptance rate in a Metropolis algorithm that samples uniformly a value for i and makes a proposal of flipping the binary variable s_i.

To me this is a sanity check rather than a surprise because each coordinate hits the boundary the same number of times and making a decision to continue or to bounce back in y_j is the same as deciding whether we should flip the sign of s_i. But I think the authors give a very help comment pointing out that although the acceptance probability is the same,  the method proposed is still different from Metropolis because

in HMC the order in which the walls are hit is fixed given the initial velocity, and the values of q_i^2 at successive hits of y_i = 0 within the same iteration are not independent.

What’s interesting for the exponential augmentation method is that

particles moves away faster from areas of lower probability.

This is certainly a nice feature to have so that the sample mixes well.

In the simulation examples, the authors compared Gaussian HMC and Metropolis on 1d and 2d ising models and showed that:

  1. ‘the HMC sampler explores faster the samples space once chain has reached equilibrium distribution.’
  2. ‘the HMC sampler is faster in reaching the equilibrium distribution.’

I think the take away from this paper is the continuous data augmentation to sample discrete variables and their dealing with piece-wise defined potential function.

 

Reference:

  • Pakman, A., & Paninski, L. (2013). Auxiliary-variable exact Hamiltonian Monte Carlo samplers for binary distributions. In Advances in neural information processing systems (pp. 2490-2498).
  •  Neal, R. M. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo2(11), 2.

SRN – The Bayesian Lasso by Park and Casella

‘The Bayesian Lasso’ paper by Park and Casella has been on my desk for a while. Thanks to the Sunday Reading Notes (SRN) series, I finally sit down to read and think about this paper.

Lasso (Least absolute shrinkage and selection operator) is a regression method where comparing to ordinary least squares some estimators shrinks to zero while others are set to zero. Lasso is able to do both variable selection (done with step-wise regression before Lasso) and shrinkage to avoid overfitting (for example with ridge regression) at the same time. The objective function for Lasso is \min_{\beta} (y-X\beta)^T(y-X\beta) + \lambda \sum_{j=1}^p |\beta_j|. In the 1996 paper, Tibshirani points out that Lasso estimates can be interpreted as posterior mode estimates (MAP) when we put i.i.d. Lapalace priors (\pi(\beta) \propto \prod_{j=1}^p \exp(-\lambda |\beta_j|).

When I took the Bayesian Data Analysis class (STAT 220 at Harvard), our textbook BDA3 contains an exercise on Lasso regularization and it claims that ‘a fully Bayesian lasso is the same model as lasso but assigning a hyperprior to \lambda. ‘ But ‘The Bayesian Lasso’ paper sets up the hierarchy with a different route:

graphicalmodel

The full conditional distributions on \beta, \sigma^2 and \vec{\tau^2} have closed-form solutions and are provided in Section 2 of the paper. In Section 3, the authors discuss ways of choosing the Lasso parameter: 1) Empirical Bayes by marginal maximum likelihood: iteratively update \lambda^{(k)} = \sqrt{\frac{2p}{\sum_{j=1}^p \mathbb{E}_{\lambda^{(k-1)}} [\tau_j^2|y] }}; 2) hyper-priors on \lambda^2.

The examples in this paper come from the diabetes dataset from the Tibshirani 1996 paper and can be found in the ‘lars‘ R package. On this dataset the results from a Bayesian lasso are very similar to a regular Lasso, as seen in Figure 1 & 2 from the paper or the plot below where I borrowed code from the Rdocumentation of the ‘monomvn‘ R package.

BayesianLassoVisual

In the plot above, we can see the shrinkage effect and the variable selection happening. Compared to ordinary least squares (green), the both the lasso estimates and samples from Bayesian lasso shrink towards zero (except for b.3) and the lasso estimate fixes some parameters (b.1,2,5,6,8,10) at zero. Comparing to Lasso where we do not have uncertainty estimates unless with bootstrap, the Bayesian lasso is providing these uncertainty estimates.

References:

Hastie, T., & Efron, B. (2013). lars: Least angle regression, lasso and forward stagewise, 2013.

Gramacy, R. B., & Gramacy, M. R. B. (2013). Package ‘monomvn’.

Park, T., & Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association103(482), 681-686.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267-288.

SRN – The Hamming Ball Sampler by Titsias and Yau

My Sunday Reading Notes (SRN) this semester will mostly be about Bayesian Computations. This week’s post is on The Hamming Ball Sampler proposed by Titsias and Yau.The hamming ball sampler is a MCMC algorithm for high-dimensional discrete-valued vectors or matrices.  While reading this paper, I also found a blog post about it from Xi’an’s OG, which provided some high-level intuitions and background knowledge.

The paper considers a state space model with discrete hidden space X with parameters \theta and observations y.  Factorial Hidden Markov Model (fHMM) is an example of such a model. In state space models,  the complete data likelihood can be factorized with p(y,X,\theta) =  p(X,\theta) \prod_{i=1}^N p(y_i|X,\theta). Given some prior, we want to sample from the posterior distribution X,\theta | y.

When the dimension of X is large, we would suffer from ‘the curse of dimensionality’. Using a Gibbs sampler, we can iteratively sample \theta  \sim \cdot | X,y and $\theta X \sim \cdot | \theta,y$. Because the dimension of $X$ is high, we should also consider blocked Gibbs sampling on X by for example updating one row (or column) of X at a time. While this is conceptually straightforward and potentially also easy to implement, as the authors pointed out:

Conditional sampling may lead to an inability to escape from local modes in the posterior distribution particularly if the elements of X exhibit strong correlations with each other and together with \theta.

The Hamming Ball Sampler (HBS) introduces an auxiliary variable U that has the same dimension as the latent space X. The augmented joint probability can be factorized with as p(y,X,\theta,U) = p(U|X) p(y,X,\theta). The conditional distribution p(U|X) is chosen to be uniform over a neighborhood set \mathcal{H}_m(X). This set \mathcal{H}_m(X) is a Hamming Ball and it basically says that if U,X are K \times N matrices, then U and X can be different on at most m positions among the K rows. With the auxiliary variable U, the Hamming Ball Sampler alternate between the steps U \gets p(U|X) and (\theta, X) \gets p(\theta,X|U,y).

The Hamming Ball Sampler is like slice-sampling in discrete spaces, and each Hamming Ball \mathcal{H}_m(X) is a slice. Introducing the slice introduces random exploration, and makes it easier to escape from local modes. For the simplest example where X is a K \times N matrix and the hamming distance is defined the the number of different elements each column, if we set m = K/2 then we can potentially change all elements of X in one update. But when m is large, the algorithm complexity also increases.

In this paper the authors provided several numerical examples comparing the Hamming Ball Sampler with block Gibbs Samplers. In the fHMM examples (Figure 4 in the paper) we can see that HBS with m = 2 or 3 achieves having joint posterior density much faster than the block Gibbs Samplers. They also conclude that HB-2 is best balances computational time and sampling efficiency.

Reference:

Titsias, M. K., & Yau, C. (2017). The Hamming ball sampler. Journal of the American Statistical Association112(520), 1598-1611.

Ghahramani, Z., & Jordan, M. I. (1996). Factorial hidden Markov models. In Advances in Neural Information Processing Systems (pp. 472-478).