SRN – The Hamming Ball Sampler by Titsias and Yau

My Sunday Reading Notes (SRN) this semester will mostly be about Bayesian Computations. This week’s post is on The Hamming Ball Sampler proposed by Titsias and Yau.The hamming ball sampler is a MCMC algorithm for high-dimensional discrete-valued vectors or matrices.  While reading this paper, I also found a blog post about it from Xi’an’s OG, which provided some high-level intuitions and background knowledge.

The paper considers a state space model with discrete hidden space X with parameters \theta and observations y.  Factorial Hidden Markov Model (fHMM) is an example of such a model. In state space models,  the complete data likelihood can be factorized with p(y,X,\theta) =  p(X,\theta) \prod_{i=1}^N p(y_i|X,\theta). Given some prior, we want to sample from the posterior distribution X,\theta | y.

When the dimension of X is large, we would suffer from ‘the curse of dimensionality’. Using a Gibbs sampler, we can iteratively sample \theta  \sim \cdot | X,y and $\theta X \sim \cdot | \theta,y$. Because the dimension of $X$ is high, we should also consider blocked Gibbs sampling on X by for example updating one row (or column) of X at a time. While this is conceptually straightforward and potentially also easy to implement, as the authors pointed out:

Conditional sampling may lead to an inability to escape from local modes in the posterior distribution particularly if the elements of X exhibit strong correlations with each other and together with \theta.

The Hamming Ball Sampler (HBS) introduces an auxiliary variable U that has the same dimension as the latent space X. The augmented joint probability can be factorized with as p(y,X,\theta,U) = p(U|X) p(y,X,\theta). The conditional distribution p(U|X) is chosen to be uniform over a neighborhood set \mathcal{H}_m(X). This set \mathcal{H}_m(X) is a Hamming Ball and it basically says that if U,X are K \times N matrices, then U and X can be different on at most m positions among the K rows. With the auxiliary variable U, the Hamming Ball Sampler alternate between the steps U \gets p(U|X) and (\theta, X) \gets p(\theta,X|U,y).

The Hamming Ball Sampler is like slice-sampling in discrete spaces, and each Hamming Ball \mathcal{H}_m(X) is a slice. Introducing the slice introduces random exploration, and makes it easier to escape from local modes. For the simplest example where X is a K \times N matrix and the hamming distance is defined the the number of different elements each column, if we set m = K/2 then we can potentially change all elements of X in one update. But when m is large, the algorithm complexity also increases.

In this paper the authors provided several numerical examples comparing the Hamming Ball Sampler with block Gibbs Samplers. In the fHMM examples (Figure 4 in the paper) we can see that HBS with m = 2 or 3 achieves having joint posterior density much faster than the block Gibbs Samplers. They also conclude that HB-2 is best balances computational time and sampling efficiency.

Reference:

Titsias, M. K., & Yau, C. (2017). The Hamming ball sampler. Journal of the American Statistical Association112(520), 1598-1611.

Ghahramani, Z., & Jordan, M. I. (1996). Factorial hidden Markov models. In Advances in Neural Information Processing Systems (pp. 472-478).