‘Tune your MH algorithm so that the acceptance rate is roughly 25%’ has been general advice given to students in Bayesian statistics classes. It has been almost 4 years since I first read about it from the book Bayesian Data Analysis, but I never read the original paper where this result first appeared. This Christmas, I decided to read the paper ‘Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms’ by Roberts, Gelman and Gilks and to resume by Sunday Reading Notes series with a short exposition of this paper.
In Roberts, Gelman and Gilk (1997), the authors obtain a weak convergence result for the sequence of algorithms targeting the sequence of distributions converging to a Langevin diffusion. The asymptotic optimal scaling problem becomes a matter optimizing the speed of the Langevin diffusion, and it is related to the asymptotic acceptance rate of proposed moves.
A one-sentence summary of the paper would be
if you have a d-dimensional target that is independent in each coordinate, then choose the step size of random walk kernel to be 2.38 / sqrt(d) or tune your acceptance rate to be around 1/4.
Unfortunately, in practice the ‘if’ condition is often overlooked and people are tuning the acceptance rate to be 0.25 as long as the proposal is random walk, no matter what the target distribution is. It has been 20 years since the publication of the 0.234 result and we are witnessing the use of MCMC algorithms on more complicated target distributions, for example parameter inference for state-space models. I feel that this is good time that we revisit and appreciate the classical results while re-educating ourselves on their limitations.
Reference:
Roberts, G. O., Gelman, A., & Gilks, W. R. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. The annals of applied probability, 7(1), 110-120.
——–TECHNICAL EXPOSITION——-
Assumption 1 The marginal density of each component is such that
is Lipschitz continuous and
Roberts et al. (1997) considers random walk proposal where
We use
to denote the Markov chain and define another Markov process
with
, which is the speed-up version of
. Let
denote the floor of
. Define
, the first component of
.
Theorem 1 (diffusion limit of first component) Suppose is positive and in
and that (1)-(2) hold. Let
be such that all components are distributed according to
and assume
for all
. Then as
,
The and
satisfies the Langevin SDE
and
with being the standard normal cdf and
Here is the speed measure of the diffusion process and the `most efficient’ asymptotic diffusion has the largest speed measure.
measures the `roughness’ of
.
Example 1 If is normal, then
So when the target density is normal, then the optimal value of
is scaled by
, which coincides with the standard deviation of
.
Proof: (of Theorem 1.1) This is a proof sketch. The strategy is to prove that the generator of , defined by
converges to the generator of the limiting Langevin diffusion, defined by
Here the function is a function of the first component only.
First define a set
where
and
For fixed , one can show that
goes to 1 as
. On these sets
, we have
which essentially says , because we have uniform convergence for vectors contained in a set of limiting probability 1.
Corollary 2 (heuristics for RWMH) Let
be the average acceptance rate of the random walk MH in dimensions.
We must have where
.
is maximized by
and
and
The authors consider two extensions of the target density , where the convergence and optimal scaling properties will still hold. The first extension concerns the case where
‘s are different, but there is an law of large numbers on these density functions. Another extension concerns the case
, with some conditions on
.