SRN – the magical 0.574

Hello 2021! This week, I keep reading about optimal scaling of Markov chain Monte Carlo algorithms and I move on to a new class of algorithms: MALA. This time, the magical number is 0.574. 

Hello 2021! Here is the first post in the Sunday Reading Notes series in the year of 2021. This week, I keep reading about optimal scaling of Markov chain Monte Carlo algorithms and I move on to a new class of algorithms: MALA.

Roberts and Rosenthal (1998, JRSSB) concerns the optimal scaling of Metropolis-Adjusted Langevin algorithm (MALA) for target distributions of the form {\pi^d(x) = \prod_{i=1}^d f(x_i)}. Unlike random walk Metropolis algorithm, in MALA, the proposal distribution uses the gradient of {\pi^d} at {x_t}:

\displaystyle x^{\star} = x_t + \sigma_n Z_{t+1} + \frac{\sigma^2}{2} \nabla \log (\pi^d(x_t)).

Here {\sigma_n} is the step variance and {Z_{t+1} \sim \mathcal{N}(0,I_d)} is a d-dimensional standard normal. Following this proposal, we perform an accept / reject step, so that the resulting Markov chain has the desired stationary distribution:
\displaystyle \alpha_d(x_t,x^{\star}) = \min \left\{ 1, \frac{\pi^d(x^{\star})}{\pi^d(x_t)} \frac{q_d(x^{\star},x_t)}{q_d(x_t,y_{t+1})}\right\}.

There are some smoothness assumptions on the log-density {g = \log f}: absolute value of the first 8 order derivatives are bounded, {g'} is Lipschitz, and all moments exist for {f}.

Theorem 3 (diffusion limit of first component)

Consider the chain {\{X\}} starting from stationarity and following the MALA algorithm with variance {\sigma_d^2 = l^2 n^{-1/3}} for some {l > 0}. Let {\{J\}} be a Poisson process with rate {n^{1/3}}, {\Gamma^n_t = X_{J_t}} and {U^d} is the first component of {\Gamma^n}, i.e. {U^d_t = X^d_{J_t,1}}. We must have, as {d \to \infty}, the process {U^d \to U}, where {U} is a Langevin diffusion defined by
\displaystyle dU_t = \sqrt{h(l)} dB_t + \frac{1}{2} h(l) \frac{d}{dx} \log (\pi_1(U_t))dt.

Here {h(l) = 2 l^2 \Phi(-Kl^3/2)} is the speed measure and {K = \sqrt{\mathbb{E} \left[\frac{5g'''(X)^3 - 3g''(X)}{48}\right]} > 0}.
The skeleton of the proof is similar to that of proving diffusion limit of random walk Metropolis. We will find sets {F_d^* \subseteq \mathbb{R}^d} such that {\mathbb{P}\left(\Gamma_s^d \in F_d^{\star} \ \forall \ 0 \le s \le t\right) \to 1} and on these sets the generators converges uniformly.

Theorem 4 (convergence of acceptance rate)
\displaystyle \lim_{d \to \infty} \alpha_d(l) = \alpha(l) where {a(l) = 2 \Phi(-Kl^3/2).}
As a result, when {h(l)} is maximized, the acceptance rate {\alpha(l) \approx 0.574}. Also the step variance should be tuned to be {\sigma_d^2 = (\hat{l}_d)^2 n^{-1/3}.}

MALA enjoys a faster convergence rate of order {n^{1/3}} compared to order {n} of RWM. But let’s also keep in mind that MALA requires an extra step to compute the gradients {\log \pi(x) = \sum_{i=1}^n g'(x)}, which takes {n} steps to compute.

Reference:

Roberts, G. O., & Rosenthal, J. S. (1998). Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology)60(1), 255-268.

SRN – about the magical 0.234 acceptance rate

Sunday Reading Notes series is back : Let’s understand the magical rule of ‘tuning your MH algorithm so that the acceptance rate is roughly 25%’ together!

‘Tune your MH algorithm so that the acceptance rate is roughly 25%’ has been general advice given to students in Bayesian statistics classes. It has been almost 4 years since I first read about it from the book Bayesian Data Analysis, but I never read the original paper where this result first appeared. This Christmas, I decided to read the paper ‘Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms’ by Roberts, Gelman and Gilks and to resume by Sunday Reading Notes series with a short exposition of this paper.

In Roberts, Gelman and Gilk (1997), the authors obtain a weak convergence result for the sequence of algorithms targeting the sequence of distributions {\pi_d(x^d) = \prod_{i=1}^{d} f(x_i^d)} converging to a Langevin diffusion. The asymptotic optimal scaling problem becomes a matter optimizing the speed of the Langevin diffusion, and it is related to the asymptotic acceptance rate of proposed moves.

A one-sentence summary of the paper would be

if you have a d-dimensional target that is independent in each coordinate, then choose the step size of random walk kernel to be 2.38 / sqrt(d) or tune your acceptance rate to be around 1/4.

Unfortunately, in practice the ‘if’ condition is often overlooked and people are tuning the acceptance rate to be 0.25 as long as the proposal is random walk, no matter what the target distribution is. It has been 20 years since the publication of the 0.234 result and we are witnessing the use of MCMC algorithms on more complicated target distributions, for example parameter inference for state-space models. I feel that this is good time that we revisit and appreciate the classical results while re-educating ourselves on their limitations.

Reference:

Roberts, G. O., Gelman, A., & Gilks, W. R. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. The annals of applied probability7(1), 110-120.

——–TECHNICAL EXPOSITION——-

Assumption 1 The marginal density of each component f is such that {f'/f} is Lipschitz continuous and

\displaystyle \mathbb{E}_f\left[\left(\frac{f'(X)}{f(X)}\right)^8\right] = M < \infty, \ \ \ \ \ (1)

\displaystyle \mathbb{E}_f\left[\left(\frac{f''(X)}{f(X)}\right)^4\right] < \infty. \ \ \ \ \ (2)

Roberts et al. (1997) considers random walk proposal {y^d - x^d \sim \mathcal{N}(0,\sigma_d I_d)} where {\sigma_d^2 = l^2 / (d-1).} We use {X^d = (X_0^d,X_1^d,\ldots)} to denote the Markov chain and define another Markov process {(Z^d)} with {Z_t^d = X_{[dt]}^d}, which is the speed-up version of {X^d}. Let {[a]} denote the floor of {a \in \mathbb{R}}. Define {U^d_t= X^d_{[dt],1}}, the first component of {X_{[dt]}^d = Z^d_t}.

Theorem 1 (diffusion limit of first component) Suppose {f} is positive and in {\mathbb{C}^2} and that (1)-(2) hold. Let {X_0^{\infty} = (X^1_{0,1},X^{2}_{0,2},\ldots)} be such that all components are distributed according to {f} and assume {X^{i}_{0,j} = X^{j}_{0,j}} for all {i \le j}. Then as {d \to \infty},
\displaystyle U^d \to U.

The {U_0 \sim f} and {U} satisfies the Langevin SDE
\displaystyle dU_t = (h(l))^{1/2}dB_t + h(l)\frac{f'(U_t)}{2f(U_t)}dt \ \ \ \ \ (3)

and
\displaystyle h(l) = 2 l^2 \Phi(-l\sqrt{I}/2)

with {\Phi} being the standard normal cdf and
\displaystyle I = \mathbb{E}_f\left[\left(f'(X)/ f(X)\right)^2\right].

Here {h(l)} is the speed measure of the diffusion process and the `most efficient’ asymptotic diffusion has the largest speed measure. {I} measures the `roughness’ of {f}.

Example 1 If {f} is normal, then {f(x) = (2\pi\sigma^2_f)^{-1/2}\exp(-x^2/(2\sigma_f^2)).}
\displaystyle I = \mathbb{E}_f\left[\left(f'(x) / f(x) \right)^2\right] = (\sigma_f)^{-4}\mathbb{E}_f\left[x^2\right] = 1/\sigma^2_f.

So when the target density {f} is normal, then the optimal value of {l} is scaled by {1 / \sqrt{I}}, which coincides with the standard deviation of {f}.

Proof: (of Theorem 1.1) This is a proof sketch. The strategy is to prove that the generator of {Z^n}, defined by

\displaystyle G_n V(x^n) = n \mathbb{E}\left[\left(V(Y^n) - V(x^n)\right) \left( 1 \wedge \frac{\pi_n(Y^n)}{\pi_n(x^n)}\right)\right].

converges to the generator of the limiting Langevin diffusion, defined by
\displaystyle GV(x) = h(l) \left[\frac{1}{2} V''(x) + \frac{1}{2} \frac{d}{dx}(\log f)(x) V'(x)\right].

Here the function {V} is a function of the first component only.
First define a set

\displaystyle F_d = \{|R_d(x_2,\ldots,x_d) - I| < d^{-1/8}\} \cap \{|S_d(x_2,\ldots,x_d) - I| < d^{-1/8}\},

where
\displaystyle R_d(x_2,\ldots,x_d) = (d-1)^{-1} \sum_{i=2}^d \left[(\log f(x_i))'\right]^2

and
\displaystyle S_d(x_2,\ldots,x_d) = - (d-1)^{-1} \sum_{i=2}^d \left[(\log f(x_i))''\right].

For fixed {t}, one can show that {\mathbb{P}\left(Z^d_s \in F_d , 0 \le s \le t\right)} goes to 1 as {d \to \infty}. On these sets {\{F_d\}}, we have
\displaystyle \sup_{x^d \in F_d} |G_d V(x^d) - G V(x_1)| \to 0 \quad \text{as } d \to \infty ,

which essentially says {G_d \to G}, because we have uniform convergence for vectors contained in a set of limiting probability 1.
\Box

Corollary 2 (heuristics for RWMH) Let
\displaystyle a_d(l) = \int \int \pi_d(x^d)\alpha(x^d,y^d)q_d(x^d,y^d)dx^d dy^d

be the average acceptance rate of the random walk MH in {d} dimensions.
We must have {lim_{d\to\infty} a_d(l) \to a(l)} where {a(l) = 2 \Phi(-l\sqrt{I}/2)}.
{h(l)} is maximized by {l = \hat{l} = 2.38 / \sqrt{I}} and {a(\hat{l}) = 0.23} and {h(\hat{l}) = 1.3 / I.}
The authors consider two extensions of the target density {\pi_d}, where the convergence and optimal scaling properties will still hold. The first extension concerns the case where {f_i}‘s are different, but there is an law of large numbers on these density functions. Another extension concerns the case {\pi_d(x^d) = f_1(x_1) \prod_{i=2}^{d} P(x_{i-1}, x_{i})}, with some conditions on {P}.