# SRN – Auxiliary-variable Exact HMC samplers for Binary Distributions by Pakman and Paninski

It’s time for Sunday Reading Notes again! This week I am discussing another computational statistics paper: ‘Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions’ by Ari Pakman and Liam Paninski from Columbia University. This paper is published at NIPS 2013.

In the previous Hamming Ball sampler SRN post, the algorithm uses data augmentation to sample from discrete distributions. In this week’s paper, the goal is to sample from generic binary distributions with data augmentation into continuous variables.

Let’s say we want to sample from the distribution $p(s)$ defined over $s \in \{\pm 1 \}^d$ given an un-normalized density $f(s).$ The authors propose augmenting with a continuous variable $y \in \mathbb{R}^d$ with joint density $p(s,y) = p(s)p(y|s)$ where $p(y|s)$ is a density we can design but it must satisfy $s_i = \mathrm{sgn}(y_i)$ for all $i = 1,\cdots,d.$ The marginal distribution of $y$ is $p(y) = p(s)p(y|s)$ as a result of this constraint. It turns out that at this point we transformed a d-dimentional binary  problem on $s$ into a d-dimensional continuous problem on $y$.

To sample from $y$, the authors suggest using Hamiltonian Monte Carlo, the potential energy is $U(y) = - \log p(y) = -\log p(y|s) - log f(s)$ and the kinetic energy terms is $K(q) = /2.$ The HMC sampler involves simulating a trajectory of $y$ that preserves the Hamiltonian $H(y,q) = U(y) + K(q)$ and typically leap-frog simulation is used. With the constraint in $p(y|s)$, the potential function is defined only piece-wise and we need to be careful when the trajectory crosses regions. o this end, the authors insist we choose $p(y|s)$ such that $\log p(y|s)$ is quadratic, so that the trajectory is deterministic and approximations methods are not necessary.

Because $U(y)$ has a jump at $y_i = 0$, the value of momentum $q_i$ should change when we cross boundaries. This is, in my opinion, the most interesting part of the paper. Suppose at time $t_j$ we have $y_j = 0$, then a change in trajectory must happen and let’s say the momentum just before and after $y_j = 0$ are $q_j(t_j^-)$ and $q_j(t_j^+).$ Conservation of energy says we must have $q_j^2(t_j^+)/2+ U(y_j = 0, s_j = -1) = q_j^2(t_j^-)/ 2+ U(y_j = 0, s_j = +1)$ if $y_j <0$ before the $y_j = 0.$ From this equation, if $q_j^2(t_j^+)>0$ then we continue the trajectory with $q_j(t_j^+) =q_j(t_j^-)$; however, if $q_j^2(t_j^+)<0$  then the particle is reflected from a wall at $y_j = 0$ and the trajectory gets reflected with $q_j(t_j^+) = - q_j(t_j^-).$

The two augmentations mentioned in this paper are Gaussian augmentation and exponential augmentation. Both results in quadratic log likelihood. The Gaussian augmentation is very interesting because there is a fixed order that each coordinate $y_j$ reaches zero and the successive hits occur at $t_j + n \pi.$ The authors makes an observation that:

Interestingly, the rate at which wall $y_j = 0$ is crossed coincides with the acceptance rate in a Metropolis algorithm that samples uniformly a value for $i$ and makes a proposal of flipping the binary variable $s_i.$

To me this is a sanity check rather than a surprise because each coordinate hits the boundary the same number of times and making a decision to continue or to bounce back in $y_j$ is the same as deciding whether we should flip the sign of $s_i.$ But I think the authors give a very help comment pointing out that although the acceptance probability is the same,  the method proposed is still different from Metropolis because

in HMC the order in which the walls are hit is fixed given the initial velocity, and the values of $q_i^2$ at successive hits of $y_i = 0$ within the same iteration are not independent.

What’s interesting for the exponential augmentation method is that

particles moves away faster from areas of lower probability.

This is certainly a nice feature to have so that the sample mixes well.

In the simulation examples, the authors compared Gaussian HMC and Metropolis on 1d and 2d ising models and showed that:

1. ‘the HMC sampler explores faster the samples space once chain has reached equilibrium distribution.’
2. ‘the HMC sampler is faster in reaching the equilibrium distribution.’

I think the take away from this paper is the continuous data augmentation to sample discrete variables and their dealing with piece-wise defined potential function.

Reference:

• Pakman, A., & Paninski, L. (2013). Auxiliary-variable exact Hamiltonian Monte Carlo samplers for binary distributions. In Advances in neural information processing systems (pp. 2490-2498).
•  Neal, R. M. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo2(11), 2.