SRN – the magical 0.574

Hello 2021! This week, I keep reading about optimal scaling of Markov chain Monte Carlo algorithms and I move on to a new class of algorithms: MALA. This time, the magical number is 0.574. 

Hello 2021! Here is the first post in the Sunday Reading Notes series in the year of 2021. This week, I keep reading about optimal scaling of Markov chain Monte Carlo algorithms and I move on to a new class of algorithms: MALA.

Roberts and Rosenthal (1998, JRSSB) concerns the optimal scaling of Metropolis-Adjusted Langevin algorithm (MALA) for target distributions of the form {\pi^d(x) = \prod_{i=1}^d f(x_i)}. Unlike random walk Metropolis algorithm, in MALA, the proposal distribution uses the gradient of {\pi^d} at {x_t}:

\displaystyle x^{\star} = x_t + \sigma_n Z_{t+1} + \frac{\sigma^2}{2} \nabla \log (\pi^d(x_t)).

Here {\sigma_n} is the step variance and {Z_{t+1} \sim \mathcal{N}(0,I_d)} is a d-dimensional standard normal. Following this proposal, we perform an accept / reject step, so that the resulting Markov chain has the desired stationary distribution:
\displaystyle \alpha_d(x_t,x^{\star}) = \min \left\{ 1, \frac{\pi^d(x^{\star})}{\pi^d(x_t)} \frac{q_d(x^{\star},x_t)}{q_d(x_t,y_{t+1})}\right\}.

There are some smoothness assumptions on the log-density {g = \log f}: absolute value of the first 8 order derivatives are bounded, {g'} is Lipschitz, and all moments exist for {f}.

Theorem 3 (diffusion limit of first component)

Consider the chain {\{X\}} starting from stationarity and following the MALA algorithm with variance {\sigma_d^2 = l^2 n^{-1/3}} for some {l > 0}. Let {\{J\}} be a Poisson process with rate {n^{1/3}}, {\Gamma^n_t = X_{J_t}} and {U^d} is the first component of {\Gamma^n}, i.e. {U^d_t = X^d_{J_t,1}}. We must have, as {d \to \infty}, the process {U^d \to U}, where {U} is a Langevin diffusion defined by
\displaystyle dU_t = \sqrt{h(l)} dB_t + \frac{1}{2} h(l) \frac{d}{dx} \log (\pi_1(U_t))dt.

Here {h(l) = 2 l^2 \Phi(-Kl^3/2)} is the speed measure and {K = \sqrt{\mathbb{E} \left[\frac{5g'''(X)^3 - 3g''(X)}{48}\right]} > 0}.
The skeleton of the proof is similar to that of proving diffusion limit of random walk Metropolis. We will find sets {F_d^* \subseteq \mathbb{R}^d} such that {\mathbb{P}\left(\Gamma_s^d \in F_d^{\star} \ \forall \ 0 \le s \le t\right) \to 1} and on these sets the generators converges uniformly.

Theorem 4 (convergence of acceptance rate)
\displaystyle \lim_{d \to \infty} \alpha_d(l) = \alpha(l) where {a(l) = 2 \Phi(-Kl^3/2).}
As a result, when {h(l)} is maximized, the acceptance rate {\alpha(l) \approx 0.574}. Also the step variance should be tuned to be {\sigma_d^2 = (\hat{l}_d)^2 n^{-1/3}.}

MALA enjoys a faster convergence rate of order {n^{1/3}} compared to order {n} of RWM. But let’s also keep in mind that MALA requires an extra step to compute the gradients {\log \pi(x) = \sum_{i=1}^n g'(x)}, which takes {n} steps to compute.

Reference:

Roberts, G. O., & Rosenthal, J. S. (1998). Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology)60(1), 255-268.

Author: PhyllisWithData

Statistics PhD student at Harvard University.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s