# SRN – A Geometric Interpretation of the Metropolis-Hastings Algorithm by Billera and Diaconis

Coming back to the Sunday Reading Notes, this week I discuss the paper ‘A Geometric Interpretation of the Metropolis-Hastings Algorithm’ by Louis J. Billera and Persi Diaconis from Statistical Science. This paper is suggested to me by Joe Blitzstein.

In Section 4 of ‘Informed proposals for local MCMC in discrete spaces’ by Giacomo Zanella (see my SRN Part I and II), Zanella mentions that the Metropolis-Hasting acceptance probability function(APF) $\min\left(1,\frac{\pi(y)p(x,y)}{\pi(x)p(y,x)}\right)$ is not the only APF that makes the resulting kernel $\pi$-reversible as long as detailed-balance is satisfied. This comes first as a ‘surprise’ to me as I have never seen another APF in practice. But very quickly I realize that this fact was mentioned in both Stat 213 & Stat 220 at Harvard and I have read about it from Section 5.3 – ‘Why Does the Metropolis Algorithm Work?‘ of ‘Monte Carlo Strategies in Scientific Computing‘ by Jun S. Liu. Unfortunately, I did not pay enough attention. Joe suggested this article to me after I posted on Facebook about being upset with not knowing such a basic fact.

In this Billera and Diaconis paper, the authors focus on the finite state space case $X$ and considers the MH kernel as the projection of stochastic matrices (row sums are all 1 and all entries are non-negative, denoted by$\mathcal{s}(X)$) onto the set of $\pi$-reversible Markov chains (stochastic matrices that satisfy detailed balance $\pi(x)M(x,y) = \pi(y)M(y,x)$, denoted by $R(\pi)).$ If we introduce a metric on the stochastic matrices: $d(K,K') = \sum_{x} \sum_{x\not=y} \pi(x) |K(x,y)-K'(x,y)|$.

The key result in this paper is Theorem 1. The authors prove that the Metropolis maps $M := M(K)(x,y) = \min\left( K(x,y), \frac{\pi(y}{\pi(x)}K(y,x)\right)$ minimizes the distance $d$ from the proposal kernel $K$ to $R(\pi).$ This means that $M(K)$ is the unique closest element in $R(\pi)$ that is coordinate-wise smaller than $K$ on its off-diagonal entries. So $M$ is in a sense the closest reversible kernel to the original kernel $K$.

I think this geometric interpretation offers great intuition about how the MH algorithm works: we start with a kernel $K$ and change it to another kernel with stationary distribution $\pi$. And the change must occur as follows:

from $x$, choose $y$ from $K(x,y)$ and decide to accept $x$ or stay at $y$; this last choice may be stochastic with acceptance probabilty $F(x,y) \in [0,1]$. This gives the new chain with transition probabilities: $K(x,y) F(x,y)$, x \not =y$. The diagonal entries are changed so that each row sums to 1. Indeed the above procedure describes how the MH algorithm works. If we insist on $\pi$-reversibility, we must have $0 \leq 0 \leq \min(1,R(x,y)$ where $R(x,y) = \frac{\pi(y)K(y,x)}{\pi(x)K(x,y)}.$ So the MH choice of APF is one that maximizes the chance of moving from $x$ to $y$. The resulting MH kernel $M$ has the largest spectral gap (1 – second largest eigenvalue) and by Peksun’s theorem must have the minimum asymptotic variance estimating additive functionals. In Remark 3.2, the authors point out if we consider only APF that are functions of $R(x,y)$, then the function must satisfy $g(x) = x g(1/x)$ which is the characteristic of balancing functions in Zanella’s ‘informed proposals’ paper. This paper allows me to study Metropolis-Hastings algorithm from another angle and review facts I have neglected in my coursework. References: • Billera, L. J., & Diaconis, P. (2001). A geometric interpretation of the Metropolis-Hastings algorithm. Statistical Science, 335-339. • Zanella, G. (2017). Informed proposals for local MCMC in discrete spaces. arXiv preprint arXiv:1711.07424. • Liu, J. S. (2008). Monte Carlo strategies in scientific computing. Springer Science & Business Media. # SRN – Informed proposals for local MCMC in discrete spaces by Zanella (Part II) This is a super delayed Sunday Reading Notes post and today I come back to finish discussing the informed proposals paper by Giacomo Zanella. In Part I of my discussions, I reviewed point-wise informed proposals and locally-informed criterion. In Part II, I hope to focus on asymptotical optimality of locally informed proposals and their simulation studies. The author uses Peskun ordering to compare efficiency of MH schemes. He deduces that ‘locally-balanced proposals are asymptotically optimal in terms of Peskun ordering’ as dimensionality of the underlying state space increases. Conditions for asymptotic optimality are indeed mild (Proposition 1) for the three illustrative examples: 1) independent binary components, 2) weighted permutation and 3) Ising model. In Section 4, Zanella points out a connection between the balancing function and acceptance probability function (APF) for MH algorithms, which I find very interesting. He also shows that the optimal proposal for independent binary variables is Barker choice $g(t) = \frac{t}{1+t}.$ The proof goes by finding the limiting continuous-time process of the MH chain and finding the optimal $g$ for the limiting process. The simulation studies use the illustrative examples: weighted permutations and Ising models. The comparisons are in terms of 1) acceptance rate, 2) number of successful flips per computation time and 3) mixing of some summary statistics. The second criterion concerns the trade-off between computational cost (of calculating the informed proposal) and statistical accuracy (by producing efficient moves). For simple target distributions (such as Uniform), using locally-balanced proposals does not bring much benefits but it achieves a much higher number of flips per unit time for more complicated and ‘rough’ targets. See second row of Figure 1 of the paper. I find it interesting that globally-balanced proposals (aka ‘naively-informed’ or $g(t) = t$) are extremely sensitive to initialization. Looking at the effective sample size (ESS) per time, the chains from GB has much more stable behavior if initialized from stationarity. See GB v.s. GB( station.) in Figure 2. But in Figure 5, this phenomenon does not show up for Ising models: initializing from stationarity does not yield ESS performance comparable to that of LBs. In the simulation studies section, the author emphasizes the cost-vs-efficiency trade-off, which I find very important. I feel I have ignored this aspect of designing MCMC algorithms and should think more about it in my future studies. The author indicates that ‘computations required to sample from locally-balanced proposals are trivially parallelizable’. This is also something very interesting to me and I hope to learn more about multi-core computations during my PhD. In his discussions, the author makes reference to Multiple-Try Metropolis. The connection between LB proposals and MT proposals is not entirely obvious to me, but I intuitively agree with the author’s comment that the weight function in MT serves a very similar purpose as the balancing function term $g\left(\frac{\pi(y)}{\pi(x)}\right)$ in LB. Side note: From my downloaded version of the paper, the transition kernel of the first equation on Page 1 is wrong, because the $Q(x,dy)$ terms needs to be multiplied by its acceptance probability $a(x,y).$ References: • Zanella, G. (2017). Informed proposals for local MCMC in discrete spaces. arXiv preprint arXiv:1711.07424. • Liu, J. S., Liang, F., & Wong, W. H. (2000). The multiple-try method and local optimization in Metropolis sampling. Journal of the American Statistical Association95(449), 121-134. # SRN – Informed proposals for local MCMC in discrete spaces by Zanella (Part I) This week I am reading ‘Informed proposals for local MCMC in discrete spaces‘ by Giacomo Zanella. This paper is about designing MCMC algorithms for discrete-values high-dimensional parameters, and the goal is similar to the papers discussed in previous posts (Hamming ball sampler & auxiliary-variable HMC). I decide to split the Sunday Reading Notes on this paper into two parts, because I find many interesting ideas in this paper. In this paper, Zanella come up with locally-balanced proposals. Suppose $\pi(x)$ is the target density and $K_{\sigma}(x,dy)$ is an uninformed proposal. We assume that as $\sigma \to 0$ the kernel $K_{\sigma}(x,dy)$ converges to the delta measure. Zanella seeks to modify this uninformed proposal so that it incorporates information about the target $\pi$ and is biased towards areas with higher density. An example of locally-balanced proposals is $Q_{\sqrt{\pi}} (x,dy) = \frac{\sqrt{\pi(y) }K_{\sigma}(x,dy)}{(\sqrt{\pi} * K_{\sigma})(x)}$. This kernel is reversible with respect to $\sqrt{\pi(x)}(\sqrt{\pi} * K_{\sigma})(x)$, which converges to $\pi(x)dx$ as $x \to 0.$ [Note the normalizing constatn is the convolution $\sqrt{\pi(x)}* K_{\sigma} = \int \sqrt{\pi(y)} K_{\sigma}(x,dy)].$] More generally, Zanella considers a class of pointwise informed proposals that has the structure $Q_{g,\sigma} = \frac{1}{Z_{g}}\cdot g\left(\frac{\pi(y)}{\pi(x)}\right) K_{\sigma}(x,dy).$ It is suggested that the function $g$ satisfy $g(t) = t g(1/t).$ I will save the discussion on locally-balanced proposals and Peskun optimality to Part II. In this part, I want to discuss Section 5: Connection to MALA and gradient-based MCMC. In continuous space, the point-wise informed proposal $Q_{g,\sigma}$ would be infeasible to sample from because of the term $g\left(\frac{\pi(y)}{\pi(x)}\right) .$ If we take a first-order Taylor expansion, we would have $Q_{g,\sigma}^{(1)} \propto g \left( \exp ( \nabla \log \pi(x) (y-x)) \right) K_{\sigma}(x,dy).$ If we choose $g(t) = \sqrt{t}$ and $K_{\sigma}(x,\cdot) =N(x,\sigma^2)$, this is the MALA proposal. I find this connection very interesting, although I do not have a good intuition about where this connection comes from. One way to explain it is that gradient-based MCMC in continuous space is using local information to design informed proposals. In the conclusions, the author mentions that this connection should improve robustness of gradient-based MCMC schemes and help with parameter tuning. References:(x) • Zanella, G. (2017). Informed proposals for local MCMC in discrete spaces. arXiv preprint arXiv:1711.07424. # SRN – Auxiliary-variable Exact HMC samplers for Binary Distributions by Pakman and Paninski It’s time for Sunday Reading Notes again! This week I am discussing another computational statistics paper: ‘Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions’ by Ari Pakman and Liam Paninski from Columbia University. This paper is published at NIPS 2013. In the previous Hamming Ball sampler SRN post, the algorithm uses data augmentation to sample from discrete distributions. In this week’s paper, the goal is to sample from generic binary distributions with data augmentation into continuous variables. Let’s say we want to sample from the distribution $p(s)$ defined over $s \in \{\pm 1 \}^d$ given an un-normalized density $f(s).$ The authors propose augmenting with a continuous variable $y \in \mathbb{R}^d$ with joint density $p(s,y) = p(s)p(y|s)$ where $p(y|s)$ is a density we can design but it must satisfy $s_i = \mathrm{sgn}(y_i)$ for all $i = 1,\cdots,d.$ The marginal distribution of$y$is$p(y) = p(s)p(y|s)$as a result of this constraint. It turns out that at this point we transformed a d-dimentional binary problem on $s$ into a d-dimensional continuous problem on $y$. To sample from $y$, the authors suggest using Hamiltonian Monte Carlo, the potential energy is $U(y) = - \log p(y) = -\log p(y|s) - log f(s)$ and the kinetic energy terms is $K(q) = /2.$ The HMC sampler involves simulating a trajectory of $y$ that preserves the Hamiltonian $H(y,q) = U(y) + K(q)$ and typically leap-frog simulation is used. With the constraint in $p(y|s)$, the potential function is defined only piece-wise and we need to be careful when the trajectory crosses regions. o this end, the authors insist we choose $p(y|s)$ such that $\log p(y|s)$ is quadratic, so that the trajectory is deterministic and approximations methods are not necessary. Because $U(y)$ has a jump at $y_i = 0$, the value of momentum $q_i$ should change when we cross boundaries. This is, in my opinion, the most interesting part of the paper. Suppose at time $t_j$ we have $y_j = 0$, then a change in trajectory must happen and let’s say the momentum just before and after $y_j = 0$ are $q_j(t_j^-)$ and $q_j(t_j^+).$ Conservation of energy says we must have $q_j^2(t_j^+)/2+ U(y_j = 0, s_j = -1) = q_j^2(t_j^-)/ 2+ U(y_j = 0, s_j = +1)$ if $y_j <0$ before the $y_j = 0.$ From this equation, if $q_j^2(t_j^+)>0$ then we continue the trajectory with $q_j(t_j^+) =q_j(t_j^-)$; however, if $q_j^2(t_j^+)<0$ then the particle is reflected from a wall at $y_j = 0$ and the trajectory gets reflected with $q_j(t_j^+) = - q_j(t_j^-).$ The two augmentations mentioned in this paper are Gaussian augmentation and exponential augmentation. Both results in quadratic log likelihood. The Gaussian augmentation is very interesting because there is a fixed order that each coordinate $y_j$ reaches zero and the successive hits occur at $t_j + n \pi.$ The authors makes an observation that: Interestingly, the rate at which wall $y_j = 0$ is crossed coincides with the acceptance rate in a Metropolis algorithm that samples uniformly a value for $i$ and makes a proposal of flipping the binary variable $s_i.$ To me this is a sanity check rather than a surprise because each coordinate hits the boundary the same number of times and making a decision to continue or to bounce back in $y_j$ is the same as deciding whether we should flip the sign of $s_i.$ But I think the authors give a very help comment pointing out that although the acceptance probability is the same, the method proposed is still different from Metropolis because in HMC the order in which the walls are hit is fixed given the initial velocity, and the values of $q_i^2$ at successive hits of $y_i = 0$ within the same iteration are not independent. What’s interesting for the exponential augmentation method is that particles moves away faster from areas of lower probability. This is certainly a nice feature to have so that the sample mixes well. In the simulation examples, the authors compared Gaussian HMC and Metropolis on 1d and 2d ising models and showed that: 1. ‘the HMC sampler explores faster the samples space once chain has reached equilibrium distribution.’ 2. ‘the HMC sampler is faster in reaching the equilibrium distribution.’ I think the take away from this paper is the continuous data augmentation to sample discrete variables and their dealing with piece-wise defined potential function. Reference: • Pakman, A., & Paninski, L. (2013). Auxiliary-variable exact Hamiltonian Monte Carlo samplers for binary distributions. In Advances in neural information processing systems (pp. 2490-2498). • Neal, R. M. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo2(11), 2. # SRN – The Hamming Ball Sampler by Titsias and Yau My Sunday Reading Notes (SRN) this semester will mostly be about Bayesian Computations. This week’s post is on The Hamming Ball Sampler proposed by Titsias and Yau.The hamming ball sampler is a MCMC algorithm for high-dimensional discrete-valued vectors or matrices. While reading this paper, I also found a blog post about it from Xi’an’s OG, which provided some high-level intuitions and background knowledge. The paper considers a state space model with discrete hidden space $X$ with parameters $\theta$ and observations $y.$ Factorial Hidden Markov Model (fHMM) is an example of such a model. In state space models, the complete data likelihood can be factorized with $p(y,X,\theta) = p(X,\theta) \prod_{i=1}^N p(y_i|X,\theta).$ Given some prior, we want to sample from the posterior distribution $X,\theta | y.$ When the dimension of $X$ is large, we would suffer from ‘the curse of dimensionality’. Using a Gibbs sampler, we can iteratively sample $\theta \sim \cdot | X,y$ and$\theta X \sim \cdot | \theta,y$. Because the dimension of$X\$ is high, we should also consider blocked Gibbs sampling on $X$ by for example updating one row (or column) of $X$ at a time. While this is conceptually straightforward and potentially also easy to implement, as the authors pointed out:

Conditional sampling may lead to an inability to escape from local modes in the posterior distribution particularly if the elements of $X$ exhibit strong correlations with each other and together with $\theta$.

The Hamming Ball Sampler (HBS) introduces an auxiliary variable $U$ that has the same dimension as the latent space $X$. The augmented joint probability can be factorized with as $p(y,X,\theta,U) = p(U|X) p(y,X,\theta).$ The conditional distribution $p(U|X)$ is chosen to be uniform over a neighborhood set $\mathcal{H}_m(X).$ This set $\mathcal{H}_m(X)$ is a Hamming Ball and it basically says that if $U,X$ are $K \times N$ matrices, then $U$ and $X$ can be different on at most $m$ positions among the $K$ rows. With the auxiliary variable $U$, the Hamming Ball Sampler alternate between the steps $U \gets p(U|X)$ and $(\theta, X) \gets p(\theta,X|U,y).$

The Hamming Ball Sampler is like slice-sampling in discrete spaces, and each Hamming Ball $\mathcal{H}_m(X)$ is a slice. Introducing the slice introduces random exploration, and makes it easier to escape from local modes. For the simplest example where $X$ is a $K \times N$ matrix and the hamming distance is defined the the number of different elements each column, if we set $m = K/2$ then we can potentially change all elements of $X$ in one update. But when $m$ is large, the algorithm complexity also increases.

In this paper the authors provided several numerical examples comparing the Hamming Ball Sampler with block Gibbs Samplers. In the fHMM examples (Figure 4 in the paper) we can see that HBS with $m = 2$ or $3$ achieves having joint posterior density much faster than the block Gibbs Samplers. They also conclude that HB-2 is best balances computational time and sampling efficiency.

Reference:

Titsias, M. K., & Yau, C. (2017). The Hamming ball sampler. Journal of the American Statistical Association112(520), 1598-1611.

Ghahramani, Z., & Jordan, M. I. (1996). Factorial hidden Markov models. In Advances in Neural Information Processing Systems (pp. 472-478).

# Sunday Reading Notes – Bayesian Optimization

For this week’s Sunday Reading Notes, I am switching topics towards bayesian computations and machine learning. This week’s paper  is ‘Practice Bayesian Optimization of Machine Learning Algorithms‘ by Jasper Snoek, Hugo Larochelle and Ryan Adams, and it appeared on NIPS 2012.

On the high level, Bayesian optimization is about fitting Gaussian Process(GP) regression on data currently observed about some black-box function $f$, and choosing the next point $x$ to get $f(x)$ with result of the GP regression. The premise of such a procedure is that the black-box function $f$ that we want to maximize is very expensive to evaluate. In this case, it make sense that we choose where to evaluate this function smartly based on current information. This is an interesting case for the exploration-exploitation trade-off.

When I first read the paper, I was asking myself: why do we need Bayesian optimization? How does it compare to other optimization methods we learned, for example Gradient Descent? Wouldn’t a grid search on the domain give a much better optimization result? I think to answer these questions, we have to keep remind ourselves that we are optimizing some black-box function, whose gradient is unknown. And what’s more, this function is very expensive to evaluate and we can not afford to perform a grid-search on it. As an example, think about tuning parameters for a neural network. Admittedly Bayesian optimization can be expensive as well, therefore we would choose to use Bayesian optimization over grid-search when all the integration and maximization involved in choosing $x_{n+1}$ is much less expensive then evaluating $f(x_{n+1}).$

We start with putting on GP prior on $f$ and assume each observation is some noisy realization of the true function value: $y_n \sim \mathcal{N}(f(x_n),\nu).$ The posterior distribution $f|\{x_n,y_n\},\theta$ is fully characterized by a predictive mean function $\nu(x; \{x_n,y_n\},\theta)$ and the predictive variance function $\sigma^2(x; \{x_n,y_n\},\theta)$for every $x$ on the domain of $f$: $x\sim \mathcal{N}(\mu(x;\{x_n,y_n\},\theta), \sigma^2(x;\{x_n,y_n\},\theta)$. For more details about Gaussian Process regression, you can read the book Gaussian Processes for Machine Learning by Rasmussen and Williams.

If the current best value is $x_{min} = \arg\min_{x_i}f(x_i)$, then for every $x$ on the domain, the probability of $f(x)$ smaller then $f(x_{min})$ is $\alpha_{PI}(x;\{x_n,y_n\},\theta) = \Phi(\gamma(x))$ where $\gamma(x) = \frac{f(x_{min} - \mu(x; \{x_n,y_n\},\theta)}{\sigma(x;\{x_n,y_n\},\theta)}$. In Bayesian optimization, we call this probability of improvement an acquisition function. There are other acquisition functions such as expectation of improvement (EI) and lower confidence bounds (LCB). We optimize the acquisition function to choose the next point $x_{n+1} \gets argmax \ \alpha_{PI}.$

The most interesting section in this paper is Section 3: practical considerations, where the authors goes through

1. choice of covariance function;
2. treatment of hyper-parameter;
3. modeling cost;
4. parallelization.

The first two issues also appear in the GPML book. Basically in this paper the authors  recommend the Matern 5/2 kernel because it only assumes twice-twice-differentiabiltiy. The squared exponential  kernel is the default choice for GP regression, but its smoothness assumption is unrealistic more most machine learning algorithms.

For choosing hyper-parameters, we can choose hyper-parameters that maximize the marginal likelihood function. For a full Bayesian treatment, we have to marginalize over the hyper-parameters and work with the integrated acquisition function $\hat{\alpha}(x;\{x_n,y_n\}) = \int \alpha(x;\{x_n,y_n\},\theta) p(\theta|\{x_n,y_n\} d\theta.$ To approximate this integral, the authors used a slice-sampler with step-in and step-out procedure. The details of this slice-sampler is described in Slice sampling covariance hyperparameters of latent Gaussian models by Murray and Adams. There are some tricks like operating on a long MCMC chain so that we do not waste too many samples, but this could still be an expensive computation. However, the cost is justified by lower cost comparing to evaluating $f$:

As both opti- mization and Markov chain Monte Carlo are computationally dominated by the cubic cost of solving an N -dimensional linear system (and our function evaluations are assumed to be much more expen- sive anyway), the fully-Bayesian treatment is sensible and our empirical evaluations bear this out.

I have played with this algorithm myself on some simple functions like Brainin-Hoo function. I’d like to try how this function works on more complicated functions, like online LDA, Latent Structured SVM, and convolutional neural nets.

Lastly, to get some sense of how expensive the computations are, I want to show Figure 4 in the paper.

If we look at (4b) on the axis, the unit of time is days! In terms of Function evaluations, the Bayesian optimization algorithm dominates a random grid search from a very early stage. In (4a) GP EI MCMC, which uses the least parallelizations is the most efficient in terms of function evaluations, but when we look at (4b) it could take longer time (measured by days!).