SRN – Markovian Score Climbing by Naesseth et al. from NeurIPS 2020

This Sunday Reading Notes post is about a recent article on variational inference (VI).

In variational inference, we can approximate a posterior distribution ${p(z \mid x)}$ by finding a distribution ${q(z ; \lambda^{\star})}$ that is the closest’ to ${p(z \mid x)}$ among a collection of functions ${Q = \{q(z;\lambda)\}}$. Once a divergence between ${p}$ and ${q}$ has been chosen, we can rely on optimization algorithms such as stochastic gradient descent to find ${\lambda^{\star}.}$

The exclusive’ Kullback-Leiber (KL) divergence has been popular in VI, due to the ease of working with an expectation with respect to the approximating distribution ${q}$. This article, however, considers the inclusive’ KL

$\displaystyle \mathbb{E}(p \| q) = \mathbb{E}_{p(z \mid x)} \left[ \log \frac{p(z \mid x)}{q(z ; \lambda)} \right].$

Minimizing ${\mathrm{KL}(p\| q)}$ is equivalent to minimizing the cross entropy ${L_{\mathrm{KL}} = \mathbb{E}_{p(z \mid )}[ - \log q(z ; \lambda)],}$ whose gradient is
$\displaystyle g_{\mathrm{KL}}(\lambda) := \nabla L_{KL}(\lambda) = \mathbb{E}_{p(z \mid x)}\left[- \nabla_{\lambda} \log q(z; \lambda)\right].$

If we can find unbiased estimates of ${g_{\mathrm{KL}}(\lambda)}$, then with a Robbins-Monroe schedule ${\{\varepsilon_k\}_{k=1}^{\infty}}$, we can use stochastic gradient descent to approximate ${\lambda^{\star}.}$

This article propose Markovian Score Climbing (MSC) as another way to approximate ${\lambda^{\star}}$. Given an Markov kernel ${M(z' \mid z;\lambda)}$ that leases the posterior distribution ${p(z \mid x)}$ invariant, one step of the MSC iterations operates as follows.

(*) Sample ${z_k \sim M( \cdot \mid z_{k-1}; \lambda_{k-1})}$.
(*) Compute the gradient ${\nabla \log q(z_k; \lambda_{k-1}).}$
(*) Set ${\lambda_{k} = \lambda_{k-1} + \varepsilon_k \nabla \log q(z_k; \lambda_{k-1}).}$

The authors prove that ${\lambda_k \to \lambda^{\star}}$ almost surely and illustrate it on the skew normal distribution. One advantage of MSC is that only one sample is required per ${\lambda}$ update. Also, the Markov kernel ${M(z' \mid z;\lambda)}$ provides a systematic way of incorporating information from current sample ${z_k}$ and current parameter ${\lambda_k}$. As the authors point out, one example of such a proposal is a conditional SMC update [Section 2.4.3 of Andrieu et al., 2010].

While this article definitely provides a general purpose VI method, I am more intrigued by the MCMC samples ${z_k}$. What can we say about the samples ${\{z_k\}}$? Can we make use of them?

References:

Naesseth, C. A., Lindsten, F., & Blei, D. (2020). Markovian score climbing: Variational inference with KL (p|| q). arXiv preprint arXiv:2003.10374.

Andrieu, C., Doucet, A., & Holenstein, R. (2010). Particle markov chain monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3), 269-342.

SRN – about the magical 0.234 acceptance rate

Sunday Reading Notes series is back : Let’s understand the magical rule of ‘tuning your MH algorithm so that the acceptance rate is roughly 25%’ together!

‘Tune your MH algorithm so that the acceptance rate is roughly 25%’ has been general advice given to students in Bayesian statistics classes. It has been almost 4 years since I first read about it from the book Bayesian Data Analysis, but I never read the original paper where this result first appeared. This Christmas, I decided to read the paper ‘Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms’ by Roberts, Gelman and Gilks and to resume by Sunday Reading Notes series with a short exposition of this paper.

In Roberts, Gelman and Gilk (1997), the authors obtain a weak convergence result for the sequence of algorithms targeting the sequence of distributions ${\pi_d(x^d) = \prod_{i=1}^{d} f(x_i^d)}$ converging to a Langevin diffusion. The asymptotic optimal scaling problem becomes a matter optimizing the speed of the Langevin diffusion, and it is related to the asymptotic acceptance rate of proposed moves.

A one-sentence summary of the paper would be

if you have a d-dimensional target that is independent in each coordinate, then choose the step size of random walk kernel to be 2.38 / sqrt(d) or tune your acceptance rate to be around 1/4.

Unfortunately, in practice the ‘if’ condition is often overlooked and people are tuning the acceptance rate to be 0.25 as long as the proposal is random walk, no matter what the target distribution is. It has been 20 years since the publication of the 0.234 result and we are witnessing the use of MCMC algorithms on more complicated target distributions, for example parameter inference for state-space models. I feel that this is good time that we revisit and appreciate the classical results while re-educating ourselves on their limitations.

Reference:

Roberts, G. O., Gelman, A., & Gilks, W. R. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. The annals of applied probability7(1), 110-120.

——–TECHNICAL EXPOSITION——-

Assumption 1 The marginal density of each component $f$ is such that ${f'/f}$ is Lipschitz continuous and

$\displaystyle \mathbb{E}_f\left[\left(\frac{f'(X)}{f(X)}\right)^8\right] = M < \infty, \ \ \ \ \ (1)$

$\displaystyle \mathbb{E}_f\left[\left(\frac{f''(X)}{f(X)}\right)^4\right] < \infty. \ \ \ \ \ (2)$

Roberts et al. (1997) considers random walk proposal ${y^d - x^d \sim \mathcal{N}(0,\sigma_d I_d)}$ where ${\sigma_d^2 = l^2 / (d-1).}$ We use ${X^d = (X_0^d,X_1^d,\ldots)}$ to denote the Markov chain and define another Markov process ${(Z^d)}$ with ${Z_t^d = X_{[dt]}^d}$, which is the speed-up version of ${X^d}$. Let ${[a]}$ denote the floor of ${a \in \mathbb{R}}$. Define ${U^d_t= X^d_{[dt],1}}$, the first component of ${X_{[dt]}^d = Z^d_t}$.

Theorem 1 (diffusion limit of first component) Suppose ${f}$ is positive and in ${\mathbb{C}^2}$ and that (1)-(2) hold. Let ${X_0^{\infty} = (X^1_{0,1},X^{2}_{0,2},\ldots)}$ be such that all components are distributed according to ${f}$ and assume ${X^{i}_{0,j} = X^{j}_{0,j}}$ for all ${i \le j}$. Then as ${d \to \infty}$,
$\displaystyle U^d \to U.$

The ${U_0 \sim f}$ and ${U}$ satisfies the Langevin SDE
$\displaystyle dU_t = (h(l))^{1/2}dB_t + h(l)\frac{f'(U_t)}{2f(U_t)}dt \ \ \ \ \ (3)$

and
$\displaystyle h(l) = 2 l^2 \Phi(-l\sqrt{I}/2)$

with ${\Phi}$ being the standard normal cdf and
$\displaystyle I = \mathbb{E}_f\left[\left(f'(X)/ f(X)\right)^2\right].$

Here ${h(l)}$ is the speed measure of the diffusion process and the most efficient’ asymptotic diffusion has the largest speed measure. ${I}$ measures the roughness’ of ${f}$.

Example 1 If ${f}$ is normal, then ${f(x) = (2\pi\sigma^2_f)^{-1/2}\exp(-x^2/(2\sigma_f^2)).}$
$\displaystyle I = \mathbb{E}_f\left[\left(f'(x) / f(x) \right)^2\right] = (\sigma_f)^{-4}\mathbb{E}_f\left[x^2\right] = 1/\sigma^2_f.$

So when the target density ${f}$ is normal, then the optimal value of ${l}$ is scaled by ${1 / \sqrt{I}}$, which coincides with the standard deviation of ${f}$.

Proof: (of Theorem 1.1) This is a proof sketch. The strategy is to prove that the generator of ${Z^n}$, defined by

$\displaystyle G_n V(x^n) = n \mathbb{E}\left[\left(V(Y^n) - V(x^n)\right) \left( 1 \wedge \frac{\pi_n(Y^n)}{\pi_n(x^n)}\right)\right].$

converges to the generator of the limiting Langevin diffusion, defined by
$\displaystyle GV(x) = h(l) \left[\frac{1}{2} V''(x) + \frac{1}{2} \frac{d}{dx}(\log f)(x) V'(x)\right].$

Here the function ${V}$ is a function of the first component only.
First define a set

$\displaystyle F_d = \{|R_d(x_2,\ldots,x_d) - I| < d^{-1/8}\} \cap \{|S_d(x_2,\ldots,x_d) - I| < d^{-1/8}\},$

where
$\displaystyle R_d(x_2,\ldots,x_d) = (d-1)^{-1} \sum_{i=2}^d \left[(\log f(x_i))'\right]^2$

and
$\displaystyle S_d(x_2,\ldots,x_d) = - (d-1)^{-1} \sum_{i=2}^d \left[(\log f(x_i))''\right].$

For fixed ${t}$, one can show that ${\mathbb{P}\left(Z^d_s \in F_d , 0 \le s \le t\right)}$ goes to 1 as ${d \to \infty}$. On these sets ${\{F_d\}}$, we have
$\displaystyle \sup_{x^d \in F_d} |G_d V(x^d) - G V(x_1)| \to 0 \quad \text{as } d \to \infty ,$

which essentially says ${G_d \to G}$, because we have uniform convergence for vectors contained in a set of limiting probability 1.
$\Box$

Corollary 2 (heuristics for RWMH) Let
$\displaystyle a_d(l) = \int \int \pi_d(x^d)\alpha(x^d,y^d)q_d(x^d,y^d)dx^d dy^d$

be the average acceptance rate of the random walk MH in ${d}$ dimensions.
We must have ${lim_{d\to\infty} a_d(l) \to a(l)}$ where ${a(l) = 2 \Phi(-l\sqrt{I}/2)}$.
${h(l)}$ is maximized by ${l = \hat{l} = 2.38 / \sqrt{I}}$ and ${a(\hat{l}) = 0.23}$ and ${h(\hat{l}) = 1.3 / I.}$
The authors consider two extensions of the target density ${\pi_d}$, where the convergence and optimal scaling properties will still hold. The first extension concerns the case where ${f_i}$‘s are different, but there is an law of large numbers on these density functions. Another extension concerns the case ${\pi_d(x^d) = f_1(x_1) \prod_{i=2}^{d} P(x_{i-1}, x_{i})}$, with some conditions on ${P}$.

[TM] Change of Variables in MCMC

This post is about change of variables in Markov chain Monte Carlo (MCMC), which is used quite often when the target distribution is supported on a subset of ${\mathbb{R}^n}$. For example, the Exponential distribution and the Log-Normal distribution are only supported on positive reals.

Consider a target distribution ${\pi(x)}$ that is supported on a subset ${S \subset \mathbb{R}^n}$. If we use a random walk proposal ${q_X(x' \mid x) = \mathrm{MVN}(x' ; x,\Sigma)}$, then we might end up with a proposal ${x'}$ such that ${\pi(x') = 0}$ and, this might cause too few acceptance in the MCMC chain. If we can find a transformation ${h:D \to \mathbb{R}^n}$ that is one-to-one, differentiable and spans ${\mathbb{R}^n}$, then we can consider a proposal ${x' = h^{-1}(y')}$ where ${y' \sim q_Y(\cdot \mid y = h(x))}$. This proposal always yields a proposal ${x'}$ such that ${\pi(x') > 0.}$

Of course, when we employ such a transformation in the proposal kernel, we need to be careful about evaluating the proposal densities. We know that the acceptance probability is ${\alpha(x,x') = 1 \wedge \frac{\pi(x') q(x \mid x')}{\pi(x) q(x' \mid x)}}$, and it should be no surprise that ${q_X(x' \mid x) \not= q_Y(y' \mid y)}$ unless ${h}$ is the identity map.

Let’s work out the acceptance ratio together carefully. Recall that change of variables proceeds as follows: when ${Y \sim f_Y(y) }$ and we consider the transformation ${ y = h^{-1}(x)}$, the pdf of ${X = h^{-1}(Y)}$ is

$\displaystyle f_X(x) = f_Y(h(x))|J_{h}(x)|.$

When we apply this to the kernels ${q_Y}$ and ${q_X}$ we get that

$\displaystyle q_X(x' \mid x ) = q_Y(h(x') \mid h(x)) \cdot |J_{h}(x')|.$

Example 1 {Symmetric proposal on transformed space} If ${q_Y(y' \mid y)}$ is a symmetric proposal, then the acceptance probability becomes

$\displaystyle \alpha(x,x') = 1 \wedge \frac{\pi(x') |J_{h}(x)|}{\pi(x)|J_{h}(x')|} .$

Here are two common transformations.

Example 2 (Log-transformation for ${x}$ supported on ${\mathbb{R}_+}$)

If ${h = \log}$, then ${|J_{h}(x)| = 1 / x}$ and acceptance probability is

$\displaystyle \alpha(x,x') = 1 \wedge \frac{\pi(x')x'}{\pi(x)x}.$

Example 3 (Logit transformation for ${x}$ supported on ${(0,1)}$) If ${h(x) = \log(\frac{x}{1 - x})}$,then the inverse transformation is ${h^{-1}(y) = \frac{exp(y)}{1 + \exp(y)}.}$ The acceptance probability is

$\displaystyle \alpha(x,x') = 1 \wedge \frac{\pi(x')x'(1-x')}{\pi(x)x(1-x)}.$

This post is the first one in the category trivial matters’, where I formally write down some notes to myself about tricks and facts that I repeatedly use but (unfortunately) need to re-derive everytime I use them.

SRN – “Scalable Metropolis-Hastings for Exact Bayesian Inference with Large Datasets” by Cornish et al.

When faced with large datasets, standard MCMC methods have a cost of O(n) per step due to evaluating the log-likelihood for n data points. Can we make it faster?

My COVID-19 self-isolation plan includes trying to finish as many to-read articles as possible and to blog my progress on this blog as much as possible. This weekend, I read the paper “Scalable Metropolis-Hastings for Exact Bayesian Inference with Large Datasets” by Robert Cornish, Paul Vanetti, Alexandre Bouchard-Cȏté, George Deligiannidis, and Arnaud Doucet. This is an ICML 2019 paper and it is a dense one. It attempts to solve one problem. When faced with large datasets,  standard MCMC methods have a cost of O(n) per step due to evaluating the log-likelihood for n data points. Can we make it faster? In a typical MH step (with symmetric proposals), we accept the proposal with probability $\alpha(\theta,\theta') = \min(1, \frac{\pi(\theta')}{\pi(\theta)})$. This is typically implemented with

log_alpha <- log_post(theta_proposal) - log_post(theta_current);
if (log(runif(1)) < log_alpha ){
theta_current <- theta_proposal;
}


The authors realize that it is wasteful having to evaluate the log-likelihood at all data points before rejecting the proposal. Their solution, Salable Metropolis-Hastings (SMH), is based on a combination of three ideas: (1) factorization of posterior density (trivial factorization is 1 + n factors), (2) fast simulation of rejection using thinning of a Poisson point process, and (3) posterior concentration around mode when n is large.

I first get excited about this paper seeing the fast simulation of Bernoulli random variables (in Algorithm 1) because there is a very similar problem on simulating a Poisson process with inhomogenous rates in Stat 210, for which I was a Teaching Fellow (TF). I am very happy to see this result being useful in cutting-edge research. The fast simulation relies on access to a lower bound for each factor in the posterior distribution and the tighter this lower bound the more efficient this implementation. Suppose the target distribution has m factors, then once we pay a once-off O(m) cost, each acceptance step can be implemented in O(upper bound of negative-log-posterior) number of steps on average.

Another highlight of the paper is the smart application of the Bernstein-von Mises (BvM) phenomenon, that which states that posterior distribution concentrates around the mode with rate $1/\sqrt{n}$ as n goes to infinity. BvM was an important topic in Stat 213, which is another class that I TFed. I have seen BvM used (1) to select step-size in random-walk MH, (2) to initialize an MH chain, and (3) to motivate Laplace approximation of the posterior distribution. In this paper, it motivates a Taylor expansion of the potential function near posterior mode, which subsequently leads to the factorization behind the SMH kernel. The SMH kernel leaves the target distribution invariant, and, if not mistaken by me, this property explains the use of “exact” in the title. More importantly, since the acceptance probability in SMH is factorizable, we can implement the acceptance step with the aforementioned fast simulation, provided that log-density can be differentiated enough times. Intuitively, the higher the differentiability, the closer the Taylor expansion, the tighter the bound for acceptance rate and the fewer the number of evaluations we need. In fact, as later justified with Theorem 3.1, the average computational cost decays geometrically with respect to the order of Taylor expansion. With a second-order Taylor expansion, which requires 3rd-order differentiability, the expected computational cost behaves like $O(1/\sqrt{n})$. Hooray! I am always a big fan of “the bigger the problem, the easier the solution”.

I learn a lot from this paper, especially from the opening paragraph of Section 3 and Section 3.2. But there is a question that I keep asking myself: if n is very large and we have an estimate of posterior mode that is $O(1/\sqrt{n})$ distance from the true mode, how long does it take for SMH to be better than that declaring the posterior is Gaussian by invoking BvM and sleeping on the result?

SRN – A Framework for Adaptive MCMC Targeting Multimodal Distribution by Pompe et al.

After a very long break, I come back to the Sunday Reading Notes and write about the paper “A Frame for Adaptive MCMC Targeting Multimodal Distributions” by Emilia Pompe et al. This paper proposes a Monte Carlo method to sample from multimodal distributions.

After a very long break, I come back to the Sunday Reading Notes and write about the paper “A Frame for Adaptive MCMC Targeting Multimodal Distributions” by Emilia Pompe et al. This paper proposes a Monte Carlo method to sample from multimodal distributions. The “divide and conquer” idea separates the sampling task into (1) identifying the modes, (2) sample efficiently within modes and (3) moving between the modes. While the paper emphasizes more heavily on addressing task (2) and (3) and establishing ergodicity , it provides a framework that unites all three tasks.

The main algorithm in this paper is Jumping Adaptive Multimodal Sampler (JAMS). It has a burn-in algorithm that uses optimization-based procedure to identify modes, to merge them and to estimate covariance matrices around the modes. After task (1) is solved by the burn-in algorithm, JAMS alternates between the later two tasks with local moves that samples around a mode and jump moves that facilitate crossing the low density regions. The local moves are adaptive in the sense that it relies on local kernels with parameters that are learned on the fly. The jump moves propose a new mode and choose a new point around the new mode.

This paper proves ergodicity of JAMS by considering it as a case of a general class of Auxiliary Variable Adaptive MCMC. It also provides extensive simulation studies on multivariate Gaussian (d = 200) , mixtures of t-distributions and bananas and a Bayesian model for sensor network localisation. The results are compared against Adaptive Parallel Tempering (APT). In all the cases tested, JAMS out performs APT in terms of root mean squared error scaled by dimension of the problem. JAMS is also more robust to initialization in term of recovering all the modes.

I usually think of Monte Carlo integration and optimization as very similar tasks and even neglect the area of optimization. This paper is a wake up call for me, as I believe the current JAMS algorithm relies on a successful mode identifying stage. The discussions in Section 6 also suggest mode finding as the bottle neck.

SRN – Bayesian Calibration of Microsimulation Models by Rutter et al.

The Sunday Reading Notes paper for this week is ‘Bayesian Calibration of Microsimulation Models’ by Carolyn Rutter, Diana Miglioretti and James Savarino. This is a 2009 JASA Applications and Case Studies paper.

According to a 2012 Review paper by Rutter et al,

Microsimulation models (MSMs) for health outcomes simulate individual event histories associated with key components of a disease process; these simulated life histories can be aggregated to estimate population-level effects of treatment on disease outcomes and the comparative effectiveness of treatments. Although MSMs are used to address a wide range of research questions, methodological improvements in MSM approaches have been slowed by the lack of communication among modelers. In addition, there are few resources to guide individuals who may wish to use MSM projections to inform decisions.

In this paper, the authors propose a Bayesian method to calibrate microsimulations models, using Markov Chain Monte Carlo. The case study in this paper is the history of colorectal cancer (CRC). In this paper, the authors assume all CRCs arise from an adenoma and the history of CRC consists of four components: 1) adenoma risk, 2) adenoma growth, 3) transition from adenoma to preclinical cancer and 4) transition from preclinical cancel to clinical cancer. These four components are not observed directly and the calibration data consists of prevalence of adenomas and preclinical cancers and the size and/or number of adenomas from many (independent) studies from different years and about different subpopulations and using different colonoscopy methods.

The authors use $\theta$ to denote MSM parameters and these parameters can be separated from $K$ components with independent priors: $\pi(\theta) = \prod_{i=1}^k \pi_i(\theta_i).$ The calibration data come from $N$ independent sources and each follows some distribution  $y_j|\theta \sim f_j(g_j(\theta))$ parametrized by some unknown functions of the MSM parameters. The likelihood is therefore $f(Y | \theta) = f(Y|g(\theta) )= \prod_{j=1}^n f_j(y_j|g_j(\theta))$.

What makes the calibration problem hard is the unknown function $g(\theta)$. Suppose we want to simulate from the posterior distribution of $\theta|Y$ using a Metropolis-Hastings(MH) algorithm, we need to know $r(\theta,\theta^*) = \frac{\pi(\theta)f(y|g(\theta)}{\pi(\theta^*)f(y|g(\theta^*)}$. But because $g$ is unknown, the MH step cannot be performed.

So the authors propose an approximate MH algorithm that includes a step to estimate $g(\theta)$ for both the current value $\theta$ and the proposed value $\theta^*$. To me this feels like an ‘EM’ step: simulate $m$ copies of $\tilde{y}_{ji}$ from $f_j(y_j| g_j(\theta))$ and calculate the MLE $\hat{g}_j(\theta).$ With this approximation, the resulting transition probability function $\hat{\alpha}(\theta,\theta^*)$for the Metropolis-within-Gibbs step on $\theta_i$ is based on $\frac{\pi(\theta_i)}{\pi(\theta_i^*)}\cdot \frac{\prod_{j=1}^N f_j(y_j|\hat{g}_j(\theta^*_i, \theta_{(-i)})}{\prod_{j=1}^n f_j(y_j|\hat{g}_j(\theta))}.$ In the Appendix, the authors prove that this approximation satisfies the detailed-balanced condition in the limit of $m$ goes to infinity.

I think this paper provides an interesting example of how to incorporate data from multiple sources. As the authors point out,

how closely the model should calibrate to observed data is unclear, especially when calibration data are variable and many provide conflicting interest. […] It depends on how modelers trade-off concerns about possibly overparameterizing and overfitting calibration data relative to the importance of exactly replicating observed or expected results.

References:

• Rutter, C. M., Miglioretti, D. L., & Savarino, J. E. (2009). Bayesian calibration of microsimulation models. Journal of the American Statistical Association104(488), 1338-1350.
• Rutter, C. M., Zaslavsky, A. M., & Feuer, E. J. (2011). Dynamic microsimulation models for health outcomes: a review. Medical Decision Making31(1), 10-18.

SRN – A Geometric Interpretation of the Metropolis-Hastings Algorithm by Billera and Diaconis

Coming back to the Sunday Reading Notes, this week I discuss the paper ‘A Geometric Interpretation of the Metropolis-Hastings Algorithm’ by Louis J. Billera and Persi Diaconis from Statistical Science. This paper is suggested to me by Joe Blitzstein.

In Section 4 of ‘Informed proposals for local MCMC in discrete spaces’ by Giacomo Zanella (see my SRN Part I and II), Zanella mentions that the Metropolis-Hasting acceptance probability function(APF) $\min\left(1,\frac{\pi(y)p(x,y)}{\pi(x)p(y,x)}\right)$ is not the only APF that makes the resulting kernel $\pi$-reversible as long as detailed-balance is satisfied. This comes first as a ‘surprise’ to me as I have never seen another APF in practice. But very quickly I realize that this fact was mentioned in both Stat 213 & Stat 220 at Harvard and I have read about it from Section 5.3 – ‘Why Does the Metropolis Algorithm Work?‘ of ‘Monte Carlo Strategies in Scientific Computing‘ by Jun S. Liu. Unfortunately, I did not pay enough attention. Joe suggested this article to me after I posted on Facebook about being upset with not knowing such a basic fact.

In this Billera and Diaconis paper, the authors focus on the finite state space case $X$ and considers the MH kernel as the projection of stochastic matrices (row sums are all 1 and all entries are non-negative, denoted by$\mathcal{s}(X)$) onto the set of $\pi$-reversible Markov chains (stochastic matrices that satisfy detailed balance $\pi(x)M(x,y) = \pi(y)M(y,x)$, denoted by $R(\pi)).$ If we introduce a metric on the stochastic matrices: $d(K,K') = \sum_{x} \sum_{x\not=y} \pi(x) |K(x,y)-K'(x,y)|$.

The key result in this paper is Theorem 1. The authors prove that the Metropolis maps $M := M(K)(x,y) = \min\left( K(x,y), \frac{\pi(y}{\pi(x)}K(y,x)\right)$ minimizes the distance $d$ from the proposal kernel $K$ to $R(\pi).$ This means that $M(K)$ is the unique closest element in $R(\pi)$ that is coordinate-wise smaller than $K$ on its off-diagonal entries. So $M$ is in a sense the closest reversible kernel to the original kernel $K$.

I think this geometric interpretation offers great intuition about how the MH algorithm works: we start with a kernel $K$ and change it to another kernel with stationary distribution $\pi$. And the change must occur as follows:

from $x$, choose $y$ from $K(x,y)$ and decide to accept $x$ or stay at $y$; this last choice may be stochastic with acceptance probabilty $F(x,y) \in [0,1]$. This gives the new chain with transition probabilities: $K(x,y) F(x,y)$, x \not =y$. The diagonal entries are changed so that each row sums to 1. Indeed the above procedure describes how the MH algorithm works. If we insist on $\pi$-reversibility, we must have $0 \leq 0 \leq \min(1,R(x,y)$ where $R(x,y) = \frac{\pi(y)K(y,x)}{\pi(x)K(x,y)}.$ So the MH choice of APF is one that maximizes the chance of moving from $x$ to $y$. The resulting MH kernel $M$ has the largest spectral gap (1 – second largest eigenvalue) and by Peksun’s theorem must have the minimum asymptotic variance estimating additive functionals. In Remark 3.2, the authors point out if we consider only APF that are functions of $R(x,y)$, then the function must satisfy $g(x) = x g(1/x)$ which is the characteristic of balancing functions in Zanella’s ‘informed proposals’ paper. This paper allows me to study Metropolis-Hastings algorithm from another angle and review facts I have neglected in my coursework. References: • Billera, L. J., & Diaconis, P. (2001). A geometric interpretation of the Metropolis-Hastings algorithm. Statistical Science, 335-339. • Zanella, G. (2017). Informed proposals for local MCMC in discrete spaces. arXiv preprint arXiv:1711.07424. • Liu, J. S. (2008). Monte Carlo strategies in scientific computing. Springer Science & Business Media. SRN – Informed proposals for local MCMC in discrete spaces by Zanella (Part II) This is a super delayed Sunday Reading Notes post and today I come back to finish discussing the informed proposals paper by Giacomo Zanella. In Part I of my discussions, I reviewed point-wise informed proposals and locally-informed criterion. In Part II, I hope to focus on asymptotical optimality of locally informed proposals and their simulation studies. The author uses Peskun ordering to compare efficiency of MH schemes. He deduces that ‘locally-balanced proposals are asymptotically optimal in terms of Peskun ordering’ as dimensionality of the underlying state space increases. Conditions for asymptotic optimality are indeed mild (Proposition 1) for the three illustrative examples: 1) independent binary components, 2) weighted permutation and 3) Ising model. In Section 4, Zanella points out a connection between the balancing function and acceptance probability function (APF) for MH algorithms, which I find very interesting. He also shows that the optimal proposal for independent binary variables is Barker choice $g(t) = \frac{t}{1+t}.$ The proof goes by finding the limiting continuous-time process of the MH chain and finding the optimal $g$ for the limiting process. The simulation studies use the illustrative examples: weighted permutations and Ising models. The comparisons are in terms of 1) acceptance rate, 2) number of successful flips per computation time and 3) mixing of some summary statistics. The second criterion concerns the trade-off between computational cost (of calculating the informed proposal) and statistical accuracy (by producing efficient moves). For simple target distributions (such as Uniform), using locally-balanced proposals does not bring much benefits but it achieves a much higher number of flips per unit time for more complicated and ‘rough’ targets. See second row of Figure 1 of the paper. I find it interesting that globally-balanced proposals (aka ‘naively-informed’ or $g(t) = t$) are extremely sensitive to initialization. Looking at the effective sample size (ESS) per time, the chains from GB has much more stable behavior if initialized from stationarity. See GB v.s. GB( station.) in Figure 2. But in Figure 5, this phenomenon does not show up for Ising models: initializing from stationarity does not yield ESS performance comparable to that of LBs. In the simulation studies section, the author emphasizes the cost-vs-efficiency trade-off, which I find very important. I feel I have ignored this aspect of designing MCMC algorithms and should think more about it in my future studies. The author indicates that ‘computations required to sample from locally-balanced proposals are trivially parallelizable’. This is also something very interesting to me and I hope to learn more about multi-core computations during my PhD. In his discussions, the author makes reference to Multiple-Try Metropolis. The connection between LB proposals and MT proposals is not entirely obvious to me, but I intuitively agree with the author’s comment that the weight function in MT serves a very similar purpose as the balancing function term $g\left(\frac{\pi(y)}{\pi(x)}\right)$ in LB. Side note: From my downloaded version of the paper, the transition kernel of the first equation on Page 1 is wrong, because the $Q(x,dy)$ terms needs to be multiplied by its acceptance probability $a(x,y).$ References: • Zanella, G. (2017). Informed proposals for local MCMC in discrete spaces. arXiv preprint arXiv:1711.07424. • Liu, J. S., Liang, F., & Wong, W. H. (2000). The multiple-try method and local optimization in Metropolis sampling. Journal of the American Statistical Association95(449), 121-134. SRN – Informed proposals for local MCMC in discrete spaces by Zanella (Part I) This week I am reading ‘Informed proposals for local MCMC in discrete spaces‘ by Giacomo Zanella. This paper is about designing MCMC algorithms for discrete-values high-dimensional parameters, and the goal is similar to the papers discussed in previous posts (Hamming ball sampler & auxiliary-variable HMC). I decide to split the Sunday Reading Notes on this paper into two parts, because I find many interesting ideas in this paper. In this paper, Zanella come up with locally-balanced proposals. Suppose $\pi(x)$ is the target density and $K_{\sigma}(x,dy)$ is an uninformed proposal. We assume that as $\sigma \to 0$ the kernel $K_{\sigma}(x,dy)$ converges to the delta measure. Zanella seeks to modify this uninformed proposal so that it incorporates information about the target $\pi$ and is biased towards areas with higher density. An example of locally-balanced proposals is $Q_{\sqrt{\pi}} (x,dy) = \frac{\sqrt{\pi(y) }K_{\sigma}(x,dy)}{(\sqrt{\pi} * K_{\sigma})(x)}$. This kernel is reversible with respect to $\sqrt{\pi(x)}(\sqrt{\pi} * K_{\sigma})(x)$, which converges to $\pi(x)dx$ as $x \to 0.$ [Note the normalizing constatn is the convolution $\sqrt{\pi(x)}* K_{\sigma} = \int \sqrt{\pi(y)} K_{\sigma}(x,dy)].$] More generally, Zanella considers a class of pointwise informed proposals that has the structure $Q_{g,\sigma} = \frac{1}{Z_{g}}\cdot g\left(\frac{\pi(y)}{\pi(x)}\right) K_{\sigma}(x,dy).$ It is suggested that the function $g$ satisfy $g(t) = t g(1/t).$ I will save the discussion on locally-balanced proposals and Peskun optimality to Part II. In this part, I want to discuss Section 5: Connection to MALA and gradient-based MCMC. In continuous space, the point-wise informed proposal $Q_{g,\sigma}$ would be infeasible to sample from because of the term $g\left(\frac{\pi(y)}{\pi(x)}\right) .$ If we take a first-order Taylor expansion, we would have $Q_{g,\sigma}^{(1)} \propto g \left( \exp ( \nabla \log \pi(x) (y-x)) \right) K_{\sigma}(x,dy).$ If we choose $g(t) = \sqrt{t}$ and $K_{\sigma}(x,\cdot) =N(x,\sigma^2)$, this is the MALA proposal. I find this connection very interesting, although I do not have a good intuition about where this connection comes from. One way to explain it is that gradient-based MCMC in continuous space is using local information to design informed proposals. In the conclusions, the author mentions that this connection should improve robustness of gradient-based MCMC schemes and help with parameter tuning. References:(x) • Zanella, G. (2017). Informed proposals for local MCMC in discrete spaces. arXiv preprint arXiv:1711.07424. SRN – The Hamming Ball Sampler by Titsias and Yau My Sunday Reading Notes (SRN) this semester will mostly be about Bayesian Computations. This week’s post is on The Hamming Ball Sampler proposed by Titsias and Yau.The hamming ball sampler is a MCMC algorithm for high-dimensional discrete-valued vectors or matrices. While reading this paper, I also found a blog post about it from Xi’an’s OG, which provided some high-level intuitions and background knowledge. The paper considers a state space model with discrete hidden space $X$ with parameters $\theta$ and observations $y.$ Factorial Hidden Markov Model (fHMM) is an example of such a model. In state space models, the complete data likelihood can be factorized with $p(y,X,\theta) = p(X,\theta) \prod_{i=1}^N p(y_i|X,\theta).$ Given some prior, we want to sample from the posterior distribution $X,\theta | y.$ When the dimension of $X$ is large, we would suffer from ‘the curse of dimensionality’. Using a Gibbs sampler, we can iteratively sample $\theta \sim \cdot | X,y$ and$\theta X \sim \cdot | \theta,y$. Because the dimension of$X\$ is high, we should also consider blocked Gibbs sampling on $X$ by for example updating one row (or column) of $X$ at a time. While this is conceptually straightforward and potentially also easy to implement, as the authors pointed out:

Conditional sampling may lead to an inability to escape from local modes in the posterior distribution particularly if the elements of $X$ exhibit strong correlations with each other and together with $\theta$.

The Hamming Ball Sampler (HBS) introduces an auxiliary variable $U$ that has the same dimension as the latent space $X$. The augmented joint probability can be factorized with as $p(y,X,\theta,U) = p(U|X) p(y,X,\theta).$ The conditional distribution $p(U|X)$ is chosen to be uniform over a neighborhood set $\mathcal{H}_m(X).$ This set $\mathcal{H}_m(X)$ is a Hamming Ball and it basically says that if $U,X$ are $K \times N$ matrices, then $U$ and $X$ can be different on at most $m$ positions among the $K$ rows. With the auxiliary variable $U$, the Hamming Ball Sampler alternate between the steps $U \gets p(U|X)$ and $(\theta, X) \gets p(\theta,X|U,y).$

The Hamming Ball Sampler is like slice-sampling in discrete spaces, and each Hamming Ball $\mathcal{H}_m(X)$ is a slice. Introducing the slice introduces random exploration, and makes it easier to escape from local modes. For the simplest example where $X$ is a $K \times N$ matrix and the hamming distance is defined the the number of different elements each column, if we set $m = K/2$ then we can potentially change all elements of $X$ in one update. But when $m$ is large, the algorithm complexity also increases.

In this paper the authors provided several numerical examples comparing the Hamming Ball Sampler with block Gibbs Samplers. In the fHMM examples (Figure 4 in the paper) we can see that HBS with $m = 2$ or $3$ achieves having joint posterior density much faster than the block Gibbs Samplers. They also conclude that HB-2 is best balances computational time and sampling efficiency.

Reference:

Titsias, M. K., & Yau, C. (2017). The Hamming ball sampler. Journal of the American Statistical Association112(520), 1598-1611.

Ghahramani, Z., & Jordan, M. I. (1996). Factorial hidden Markov models. In Advances in Neural Information Processing Systems (pp. 472-478).