Category Archives: Sunday Reading Notes

SRN – “Scalable Metropolis-Hastings for Exact Bayesian Inference with Large Datasets” by Cornish et al.

My COVID-19 self-isolation plan includes trying to finish as many to-read articles as possible and to blog my progress on this blog as much as possible. This weekend, I read the paper “Scalable Metropolis-Hastings for Exact Bayesian Inference with Large Datasets” by Robert Cornish, Paul Vanetti, Alexandre Bouchard-Cȏté, George Deligiannidis, and Arnaud Doucet. This is an ICML 2019 paper and it is a dense one. It attempts to solve one problem. When faced with large datasets,  standard MCMC methods have a cost of O(n) per step due to evaluating the log-likelihood for n data points. Can we make it faster? In a typical MH step (with symmetric proposals), we accept the proposal with probability \alpha(\theta,\theta') = \min(1, \frac{\pi(\theta')}{\pi(\theta)}). This is typically implemented with

log_alpha <- log_post(theta_proposal) - log_post(theta_current);
if (log(runif(1)) < log_alpha ){
    theta_current <- theta_proposal;
}

The authors realize that it is wasteful having to evaluate the log-likelihood at all data points before rejecting the proposal. Their solution, Salable Metropolis-Hastings (SMH), is based on a combination of three ideas: (1) factorization of posterior density (trivial factorization is 1 + n factors), (2) fast simulation of rejection using thinning of a Poisson point process, and (3) posterior concentration around mode when n is large. 

I first get excited about this paper seeing the fast simulation of Bernoulli random variables (in Algorithm 1) because there is a very similar problem on simulating a Poisson process with inhomogenous rates in Stat 210, for which I was a Teaching Fellow (TF). I am very happy to see this result being useful in cutting-edge research. The fast simulation relies on access to a lower bound for each factor in the posterior distribution and the tighter this lower bound the more efficient this implementation. Suppose the target distribution has m factors, then once we pay a once-off O(m) cost, each acceptance step can be implemented in O(upper bound of negative-log-posterior) number of steps on average. 

Another highlight of the paper is the smart application of the Bernstein-von Mises (BvM) phenomenon, that which states that posterior distribution concentrates around the mode with rate 1/\sqrt{n} as n goes to infinity. BvM was an important topic in Stat 213, which is another class that I TFed. I have seen BvM used (1) to select step-size in random-walk MH, (2) to initialize an MH chain, and (3) to motivate Laplace approximation of the posterior distribution. In this paper, it motivates a Taylor expansion of the potential function near posterior mode, which subsequently leads to the factorization behind the SMH kernel. The SMH kernel leaves the target distribution invariant, and, if not mistaken by me, this property explains the use of “exact” in the title. More importantly, since the acceptance probability in SMH is factorizable, we can implement the acceptance step with the aforementioned fast simulation, provided that log-density can be differentiated enough times. Intuitively, the higher the differentiability, the closer the Taylor expansion, the tighter the bound for acceptance rate and the fewer the number of evaluations we need. In fact, as later justified with Theorem 3.1, the average computational cost decays geometrically with respect to the order of Taylor expansion. With a second-order Taylor expansion, which requires 3rd-order differentiability, the expected computational cost behaves like O(1/\sqrt{n}). Hooray! I am always a big fan of “the bigger the problem, the easier the solution”. 

I learn a lot from this paper, especially from the opening paragraph of Section 3 and Section 3.2. But there is a question that I keep asking myself: if n is very large and we have an estimate of posterior mode that is O(1/\sqrt{n}) distance from the true mode, how long does it take for SMH to be better than that declaring the posterior is Gaussian by invoking BvM and sleeping on the result?

SRN – my second attempt to understand ‘Controlled SMC’ by Heng et al.

It has been two years since I last tried to understand the paper ‘Controlled Sequential Monte Carlo’ by Jeremy Heng et al. Over the past two years, I picked this paper up many times, and every time I read it, I understood it a little better. I am finally ready to share my understanding of this paper on this blog.

This paper opens with an introduction to the Feynman-Kac path measure. A triple of (1) initial distribution, (2) a collection of Markov kernels, and (3) potential functions determines a Feynman-Kac path measure. For readers familiar with state-space models, this is not something new: the triple can correspond to (1) the initial distribution of hidden process, (2) Markov kernels of the hidden process and (3) observation densities.

The Markov chain on path space
Feynman-Kac path measure defined by the Markov chain and the potentials.

For each Feynman-Kac path measure, there is a sequential Monte Carlo algorithm (aka particle filters) that provides a particle approximation of the path measure. Furthermore, Feynman-Kac path measures can be twisted. Although all the kernels and the potentials are changed, the path measure is the same at the terminal time and the normalizing constant at the terminal time Z_T stays the same.

The Markov chain twisted by policy \phi
Eq. (13): The twisted potentials. Eq. (12): rewrite the path measure with the twisted potentials and twisted kernels.

To infer state-space models, we are interested in the smoothing distribution x_{0:T} given y_{0:T} or the marginal likelihood p(Y_{0:T}). The smoothing distribution has Feynman-Kac measure representations, with the marginal likelihood being the normalizing constant.  The most natural one is described above, and it corresponds to the bootstrap particle filter. But it is not unique. For example, the path measure associated with a 1-step look ahead kernel corresponds to the popular ‘fully-adapted auxiliary particle filter’. As mentioned above, each of these path measures can be twisted. Proposition 1 defines an optimal policy with respect to each path measure. Optimality is defined in the sense of approximating the normalizing constant almost surely. The optimal policy with respect to the path measure corresponding to the bootstrap particle filter is the backward information filter \phi^{\star}(x_t) = p(y_{t:T}|x_t)

The optimal policies \phi^{\star} are usually intractable, but they can be approximated. After running any SMC algorithm, we have a particle approximation of the path measure, which gives us a particle approximation of the optimal policy. We can use regression to learn a simple policy \phi that minimized the difference between \log\phi and particle approximation of \log\phi^{\star}. This regression is done recursively starting at t = T and ending at t = 0, and it is called approximate dynamic programming in the paper.  Given this policy, we have a new path measure, which means that we can run another particle filter according to it and find the optimal policy with respect to the new measure. This process is called policy refinement. According to the authors, in practice, only a few iterations of policy refinement is necessary. They also offer performance monitoring criteria based on effective sample size and regression residuals. 

What fascinates (and puzzles) me is that errors do not accumulate over either iteration of policy refinement or approximate dynamic programming. I hope to read this paper is greater detail and be able to report back to my readers soon. 

SRN – Further and stronger analogy between sampling and optimization by Arnak Dalalyan

Lately, I have been pondering the connection between Monte Carlo based sampling methods and optimation methods. It brings me to read a paper that I have been wanting to read in detail long ago. In this week’s Sunday Reading Notes series on Phyllis with Data, I discuss the paper “Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent” by Arnak Dalalyan. 

This paper focuses on the Langevin Monte Carlo algorithm, which I believe is also called the Unadjusted Langevin algorithm (ULA) in contrast to the Metropolis adjusted Langevin algorithm (MALA). Using the Wasserstein distance as the metric, this paper establishes an upper bound of the Wasserstein distance, more precisely it shows that one needs O(\frac{d}{\epsilon^2}\log(\frac{d}{\epsilon})) number of iterations to reach the precision level . Considering the close connection between ULA and gradient descent algorithm for optimization, the author proves an upper bound for the L2 distance for gradient descent by cleverly utilizing a tempering argument (details in section 3 of the paper). The well-known result for optimization is that it takes O(\log(\frac{1}{\epsilon})) iterations to reach precision in gradient descent. Intuitively sampling should be a more difficult task than optimization because it requires an exploration of the whole parameter space, which can be of high-dimension. The bounds in this paper confirm this intuition. More importantly, it points out how these two problems are “continuously” connected (as the temperature converges to 0) and that this connection naturally leads to the deduction of the optimization bound from the sampling bound. 

Admittedly this is not a new paper as it was first arxived in 2016, there have been many papers tackling similar problems in sampling, for example using the KL-divergence (Cheng and Bartleet 2019) or extending the results to MALA (Dwivedi et al. 2019). When I first picked up this paper two years ago as a second-year Ph.D. student, I was intimidated by the proof, which in fact was only 6-pages long. This time, as I sat down and went through the proof line by line, I found it quite accessible and elegant. I have been equipped with the mathematical knowledge required to understand the proof in college. This time, I was defeated by my own fear. For me, this paper is not only a delightful read, but also a wake-up call to trust myself. 

SRN – Controlled Sequential Monte Carlo by Heng et al. (Part 1)

Returning to my Sunday Reading Notes series from Thanksgiving recess, I want to discuss the paper ‘Controlled Sequential Monte Carlo’ by Jeremy Heng, Adrian N. Bishop, George Deligiannidis and Arnaud Doucet. You can read Xi’an’s blog post discussing this paper.

As pointed out in Xi’an’s blog, this is a massive paper, and I have only managed to go through Section 3. This paper starts with introducing Feynman-Kac models and twisted Feynman-Kac models, which is motivated by designing observations dependent kernels for sequential Monte Carlo (SMC) methods. At the terminal time T, the twisted model and original model are the same and at all times t = 0,1,2,\cdots, T the two models have the same normalizing constant which we try to approximate with SMC. With a current policy \psi, there is a policy refinement \psi^\star = \psi \cdot \phi^\star that is a admissible policy and \psi^star is referred to as the optimal policy by the authors because of its connection with Kullback-Leibler optimal control problem.

The policy refinements can be found using backward recursion \psi^\star_T = G_T^\phi and \psi^\star_t = Q_t^\phi \phi^{\star}_{t+1} for t = 0,\cdots, T-1 where Q are Bellman operators. Because the policies \psi^\star are intractable, the authors introduce logarithmic projection as a means of approximation. This is the Approximate Dynamic Programming (ADP) algorithm that the authors outlined in Algorithm 2. The final algorithm (Algorithm 3) alternates between performing a twisted SMC and ADP to find the next policy refinement.


References:

Heng, J., Bishop, A. N., Deligiannidis, G., & Doucet, A. (2017). Controlled Sequential Monte Carlo. arXiv preprint arXiv:1708.08396.

SRN – The Iterated Auxiliary Particle Filter by Guarniero et al.

In this week’s Sunday Reading Post, I want to discuss the paper ‘The Iterated Auxiliary Particle Filter’ (iAPF) by Pieralberto Guarniero, Adam Johansen and Antony Lee.  

The algorithm proposed, iAPF, approximates an idealized particle filter, which would have zero variance in terms of estimating the marginal likelihood of a hidden Markov model. The motivation is that if we were to use an idealized particle filter, we would only need one particle instead of a system of particles for marginal likelihood estimations. The idealized particle filter would be the backward-smoothing provided a perfect forward-filtering step. 

This paper builds upon previous studies of particle filters, including twisted particle filter and the Auxiliary Particle Filter (APF). The idea is to introduce a family of ‘twisted’ models through functions \psi in order to define a new model that would have the same marginal likelihood as the original model. Because the optimal policy \psi^\star,which corresponds to the idealized particle filter, is intractable in general cases, the authors use a series of functional approximations \psi^0,\psi^1, \cdots, \psi^l to approximate \psi^\star.

The key the question to me is how to and how well can we approximate \psi^\star. The ‘how to’ qustions is addressed in Section 3.3. The series of backward recursions that are used to approximate \psi^\star relies on the iterative relationship \psi^\star(x_t) = p(y_t|x_t) f(x_t,\psi^\star_{t+1}) = p(y_t|x_t)\int_X f(x_{t},x_{t+1})\psi^{\star}_{t+1}(x_{t+1}) dx_{t+1}. In Algorithms 3, the authors suggest that we ‘choose \psi^{t} on the basis of \psi_{t}^{1:N} and \psi_t^{1:N}. Later in Section 5, they mention that this step is implemented with a parametric optimization step for their experiments. In the ‘Discussions’ section, the authors mention alternatives such as nonparametric estimates which would require much higher cost. More importantly is the ‘how well’ part, because it is also possible that the class of functions \Psi that we consider does not contain the optimal function \psi. I think this is a very interesting and importantly case and the authors report ‘fairly well’ performance of iAPF in this case. 

As demonstrated in the ‘Applications and Examples’ section, “iAPF can provide substantially better estimates of the marginal likelihood than the Bootstrap Particle Filter (BPS) at the same computational cost.

Log-likelihood estimates using BPF with N = 1000,10000 particles and iAPF with N = 100 particles. We can see from the width of the box plots that iAPF has smaller variances compared BPF by using fewer particles.  Source: (Guarniero et al 2017)

References:

  • Guarniero, P., Johansen, A. M., & Lee, A. (2017). The iterated auxiliary particle filter. Journal of the American Statistical Association112(520), 1636-1647.
  • Pitt, M. K., & Shephard, N. (1999). Filtering via simulation: Auxiliary particle filters. Journal of the American statistical association94(446), 590-599
  • Whiteley, N., & Lee, A. (2014). Twisted particle filters. The Annals of Statistics42(1), 115-141.

SRN – Racing Thompson by Zhou et al.

While browsing the accepted papers list of ICML 2018, I discovered this paper ‘Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors‘ by Zhou, Zhu, and Zhuo. Thompson sampling is a popular algorithm for exploration-exploitation tradeoff problems and is also known as Bayesian bandits. I decided to write my Sunday Reading Notes post on this paper because have been interested in the exploration-exploitation tradeoff for a while and explored this topic through Bayesian optimization and my WSDM’19 paper on sequential A/B testing.

Suppose we want to identify the best arm among K arms and we have some prior knowledge about their rewards \mu \sim \pi. Thompson sampling (TS) balances exploring unexplored arms getting rewards from arms already yielding high rewards by choosing the kth arm according to its the posterior probability of being the optimal arm P_{it} = \pi\left( \mu_i = \max_j \mu_j \right). The computational challenge is to compute the probabilities $P_{it}$. Because TS is often used as an online algorithm, efficient calculation of the posterior probabilities is very important. In the conjugate prior case, this calculation is done in O(K). With non-conjugate priors, I have seen in the literature that people using Markov Chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC). The authors recognize this probability as an expectation and propose to use an Importance Sampling (IS) step combined with Gumbel-Max trick, which transforms the sampling problem to an optimization problem, to sample k at time k according to probabilities \pi\left( \mu_i = \max_j \mu_j \right) =  \mathbb{E}_{\mu\sim \pi(\cdot|X(1:t))}\left[\mathbb{I}[\mu_k = \max_j {\mu_j}] \right] = \mathbb{E}_{\mu\sim B_t}\left[\mathbb{I}[\mu_k = \max_j {\mu_j}] \frac{\pi(\mu|X(1:t)}{B_t(\mu}\right].

The benefits of this IS step comes from flexibility to choose $B_t$ at each time step and also the authors leveraged the stopping rule of racing algorithms to deterime the number of IS samples needed to approximate the expectation.

The resulting algorithm, which combines benefits from Importance Sampling, Gumble-Max trick, and the racing algorithm, is proved to be (\delta,\sigma)-PAC, which is asymptotic good in the sense the total variance distance between the true value P_{it} and its estimate converges to zero.

Screen Shot 2018-11-11 at 3.47.24 PM.png

The regret of bandits with (b) Bernoulli bandits & non-conjugate prior and (c) Gaussian bandits & non-conjugate prior. Source: Figure 2 from Zhou et al.

What I find very interesting from the regret analysis section is the fact that the racing TS in this paper can provide much lower regret compared to Thompson sampling and prior-swapping (PS) even though it uses much few particles than SMC and PS. It is not intuitive to me why this should happen. But upon a little further investigation, I found that the priors used for TS and PS & Racing are different in both plots. For ease of implementation, the authors have chosen a conjugate prior for TS. This leaves me wondering what the results would be if we were to use MCMC or SMC with more particles as the baseline for the regret analysis.

References:

  • Zhou, Y., Zhu, J. & Zhuo, J.. (2018). Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors. Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:6000-6008

SRN – Scalable Bayesian Inference for Excitatory Point Process Networks by Linderman and Adams

The topic of this week’s SRN is latent network discovery. The paper ‘Scalable bayesian inference for excitatory point process networks‘ by Scott Linderman and Ryan Adams is suggested to me by Gonzalo Mena.

example_hawkes_poisson.png

Hawkes process v.s. Poisson process, by Steven Mores. Source: https://stmorse.github.io/journal/Hawkes-python.html.

 

In a multivariate Hawkes process, there are K individual point processes and events on one constituent process are more likely to happen right after an event from a neighboring process. To put it in other words, events are ‘induce’ subsequent events according to a network describing their interactions. If we imagine a network of neurons, if neuron A and B are connected and when neuron A fires it can excite neuron B and neuron B has a higher chance of firing. When we want to infer connectivity, if neuron B consistently after neuron A fires, we might  infer that there is likely an excitatory connection between them.

In the work by Linderman and Adams, they extend previous works on continuous time Hawkes process and study the discrete time network Hawkes model. This discretization is motivated by computational concerns. In Simma and Jordan (2010), the authors invoke auxiliary variable z_{k,n} that which denotes the origin of the n-th event on process k. So the number of auxiliary variables needed is large for highly active networks, which could be a computational bottleneck. But working in the discrete model assuming a time scale \Delta t, we can ‘bin’ events and ignore interactions within the same bin. This reduction in number of auxiliary variables improves the run time of inference algorithms such as Gibbs sampler.

Screen Shot 2018-11-04 at 4.43.49 PM.png

Comparison of run time per Gibbs sweep. Source: Fig 2 of Linderman and Adams (2015) https://arxiv.org/pdf/1507.03228.pdf

Besides Gibbs sampler, Linderman and Adams also derived stochastic variational inference algorithm for the discrete time network Hawkes model. This work focuses on scaling to long duration T and the problem of scaling with size of network K remains open.

References:

  • Linderman, S. W., & Adams, R. P. (2015). Scalable bayesian inference for excitatory point process networks. arXiv preprint arXiv:1507.03228.
  • Linderman, S., & Adams, R. (2014, January). Discovering latent network structure in point process data. In International Conference on Machine Learning (pp. 1413-1421).
  • Simma, A., & Jordan, M. I. (2012). Modeling events with cascades of Poisson processes. arXiv preprint arXiv:1203.3516.
  • Steven Morse (2017, June). Python class for Hawkes Process. [Blog post] https://stmorse.github.io/journal/Hawkes-python.html.

SRN – Bayesian Calibration of Microsimulation Models by Rutter et al.

The Sunday Reading Notes paper for this week is ‘Bayesian Calibration of Microsimulation Models’ by Carolyn Rutter, Diana Miglioretti and James Savarino. This is a 2009 JASA Applications and Case Studies paper.

According to a 2012 Review paper by Rutter et al,

Microsimulation models (MSMs) for health outcomes simulate individual event histories associated with key components of a disease process; these simulated life histories can be aggregated to estimate population-level effects of treatment on disease outcomes and the comparative effectiveness of treatments. Although MSMs are used to address a wide range of research questions, methodological improvements in MSM approaches have been slowed by the lack of communication among modelers. In addition, there are few resources to guide individuals who may wish to use MSM projections to inform decisions.

In this paper, the authors propose a Bayesian method to calibrate microsimulations models, using Markov Chain Monte Carlo. The case study in this paper is the history of colorectal cancer (CRC). In this paper, the authors assume all CRCs arise from an adenoma and the history of CRC consists of four components: 1) adenoma risk, 2) adenoma growth, 3) transition from adenoma to preclinical cancer and 4) transition from preclinical cancel to clinical cancer. These four components are not observed directly and the calibration data consists of prevalence of adenomas and preclinical cancers and the size and/or number of adenomas from many (independent) studies from different years and about different subpopulations and using different colonoscopy methods.

The authors use \theta to denote MSM parameters and these parameters can be separated from K components with independent priors: \pi(\theta) = \prod_{i=1}^k \pi_i(\theta_i). The calibration data come from N independent sources and each follows some distribution  y_j|\theta \sim f_j(g_j(\theta)) parametrized by some unknown functions of the MSM parameters. The likelihood is therefore f(Y | \theta) = f(Y|g(\theta) )= \prod_{j=1}^n f_j(y_j|g_j(\theta)).

What makes the calibration problem hard is the unknown function g(\theta). Suppose we want to simulate from the posterior distribution of \theta|Y using a Metropolis-Hastings(MH) algorithm, we need to know r(\theta,\theta^*) = \frac{\pi(\theta)f(y|g(\theta)}{\pi(\theta^*)f(y|g(\theta^*)}. But because g is unknown, the MH step cannot be performed.

So the authors propose an approximate MH algorithm that includes a step to estimate $g(\theta)$ for both the current value \theta and the proposed value \theta^*. To me this feels like an ‘EM’ step: simulate m copies of \tilde{y}_{ji} from f_j(y_j| g_j(\theta)) and calculate the MLE \hat{g}_j(\theta). With this approximation, the resulting transition probability function \hat{\alpha}(\theta,\theta^*)for the Metropolis-within-Gibbs step on \theta_i is based on \frac{\pi(\theta_i)}{\pi(\theta_i^*)}\cdot \frac{\prod_{j=1}^N f_j(y_j|\hat{g}_j(\theta^*_i, \theta_{(-i)})}{\prod_{j=1}^n f_j(y_j|\hat{g}_j(\theta))}. In the Appendix, the authors prove that this approximation satisfies the detailed-balanced condition in the limit of m goes to infinity.

I think this paper provides an interesting example of how to incorporate data from multiple sources. As the authors point out,

how closely the model should calibrate to observed data is unclear, especially when calibration data are variable and many provide conflicting interest. […] It depends on how modelers trade-off concerns about possibly overparameterizing and overfitting calibration data relative to the importance of exactly replicating observed or expected results.

 

References:

  • Rutter, C. M., Miglioretti, D. L., & Savarino, J. E. (2009). Bayesian calibration of microsimulation models. Journal of the American Statistical Association104(488), 1338-1350.
  • Rutter, C. M., Zaslavsky, A. M., & Feuer, E. J. (2011). Dynamic microsimulation models for health outcomes: a review. Medical Decision Making31(1), 10-18.

SRN – A Geometric Interpretation of the Metropolis-Hastings Algorithm by Billera and Diaconis

Coming back to the Sunday Reading Notes, this week I discuss the paper ‘A Geometric Interpretation of the Metropolis-Hastings Algorithm’ by Louis J. Billera and Persi Diaconis from Statistical Science. This paper is suggested to me by Joe Blitzstein.

In Section 4 of ‘Informed proposals for local MCMC in discrete spaces’ by Giacomo Zanella (see my SRN Part I and II), Zanella mentions that the Metropolis-Hasting acceptance probability function(APF) \min\left(1,\frac{\pi(y)p(x,y)}{\pi(x)p(y,x)}\right) is not the only APF that makes the resulting kernel \pi-reversible as long as detailed-balance is satisfied. This comes first as a ‘surprise’ to me as I have never seen another APF in practice. But very quickly I realize that this fact was mentioned in both Stat 213 & Stat 220 at Harvard and I have read about it from Section 5.3 – ‘Why Does the Metropolis Algorithm Work?‘ of ‘Monte Carlo Strategies in Scientific Computing‘ by Jun S. Liu. Unfortunately, I did not pay enough attention. Joe suggested this article to me after I posted on Facebook about being upset with not knowing such a basic fact.

In this Billera and Diaconis paper, the authors focus on the finite state space case X and considers the MH kernel as the projection of stochastic matrices (row sums are all 1 and all entries are non-negative, denoted by\mathcal{s}(X)) onto the set of \pi-reversible Markov chains (stochastic matrices that satisfy detailed balance \pi(x)M(x,y) = \pi(y)M(y,x), denoted by R(\pi)). If we introduce a metric on the stochastic matrices: d(K,K') = \sum_{x} \sum_{x\not=y} \pi(x) |K(x,y)-K'(x,y)|.

The key result in this paper is Theorem 1. The authors prove that the Metropolis maps M := M(K)(x,y) = \min\left( K(x,y), \frac{\pi(y}{\pi(x)}K(y,x)\right) minimizes the distance d from the proposal kernel K to R(\pi). This means that M(K) is the unique closest element in R(\pi) that is coordinate-wise smaller than K on its off-diagonal entries. So M is in a sense the closest reversible kernel to the original kernel K.

I think this geometric interpretation offers great intuition about how the MH algorithm works: we start with a kernel K and change it to another kernel with stationary distribution \pi. And the change must occur as follows:

from x, choose y from K(x,y) and decide to accept x or stay at y; this last choice may be stochastic with acceptance probabilty F(x,y) \in [0,1]. This gives the new chain with transition probabilities: K(x,y) F(x,y), x \not =y$. The diagonal entries are changed so that each row sums to 1.

Indeed the above procedure describes how the MH algorithm works. If we insist on \pi-reversibility, we must have 0 \leq 0 \leq \min(1,R(x,y) where R(x,y) =  \frac{\pi(y)K(y,x)}{\pi(x)K(x,y)}. So the MH choice of APF is one that maximizes the chance of moving from x to y. The resulting MH kernel M has the largest spectral gap (1 – second largest eigenvalue) and by Peksun’s theorem must have the minimum asymptotic variance estimating additive functionals.

In Remark 3.2, the authors point out if we consider only APF that are functions of R(x,y), then the function must satisfy g(x) = x g(1/x) which is the characteristic of balancing functions in Zanella’s ‘informed proposals’ paper.

This paper allows me to study Metropolis-Hastings algorithm from another angle and review facts I have neglected in my coursework.

References:

  • Billera, L. J., & Diaconis, P. (2001). A geometric interpretation of the Metropolis-Hastings algorithm. Statistical Science, 335-339.
  • Zanella, G. (2017). Informed proposals for local MCMC in discrete spaces. arXiv preprint arXiv:1711.07424.
  • Liu, J. S. (2008). Monte Carlo strategies in scientific computing. Springer Science & Business Media.

SRN – Informed proposals for local MCMC in discrete spaces by Zanella (Part II)

This is a super delayed Sunday Reading Notes post and today I come back to finish discussing the informed proposals paper by Giacomo Zanella. In Part I of my discussions, I reviewed point-wise informed proposals and locally-informed criterion. In Part II, I hope to focus on asymptotical optimality of locally informed proposals and their simulation studies.

The author uses Peskun ordering to compare efficiency of MH schemes. He deduces that ‘locally-balanced proposals are asymptotically optimal in terms of Peskun ordering’ as dimensionality of the underlying state space increases. Conditions for asymptotic optimality are indeed mild (Proposition 1) for the three illustrative examples: 1) independent binary components, 2) weighted permutation and 3) Ising model.

In Section 4, Zanella points out a connection between the balancing function and acceptance probability function (APF) for MH algorithms, which I find very interesting. He also shows that the optimal proposal for independent binary variables is Barker choice g(t) = \frac{t}{1+t}. The proof goes by finding the limiting continuous-time process of the MH chain and finding the optimal g for the limiting process.

The simulation studies use the illustrative examples: weighted permutations and Ising models. The comparisons are in terms of 1) acceptance rate, 2) number of successful flips per computation time and 3) mixing of some summary statistics. The second criterion concerns the trade-off between computational cost (of calculating the informed proposal) and statistical accuracy (by producing efficient moves). For simple target distributions (such as Uniform), using locally-balanced proposals does not bring much benefits but it achieves a much higher number of flips per unit time for more complicated and ‘rough’ targets. See second row of Figure 1 of the paper.

Screen Shot 2018-10-14 at 5.37.28 PM

I find it interesting that globally-balanced proposals (aka ‘naively-informed’ or g(t) = t) are extremely sensitive to initialization. Looking at the effective sample size (ESS) per time, the chains from GB has much more stable behavior if initialized from stationarity. See GB v.s. GB( station.) in Figure 2. But in Figure 5, this phenomenon does not show up for Ising models: initializing from stationarity does not yield ESS performance comparable to that of LBs.

Screen Shot 2018-10-14 at 5.44.23 PM.png

In the simulation studies section, the author emphasizes the cost-vs-efficiency trade-off, which I find very important. I feel I have ignored this aspect of designing MCMC algorithms and should think more about it in my future studies. The author indicates that ‘computations required to sample from locally-balanced proposals are trivially parallelizable’. This is also something very interesting to me and I hope to learn more about multi-core computations during my PhD.

In his discussions, the author makes reference to Multiple-Try Metropolis. The connection between LB proposals and MT proposals is not entirely obvious to me, but I intuitively agree with the author’s comment that the weight function in MT serves a very similar purpose as the balancing function term g\left(\frac{\pi(y)}{\pi(x)}\right) in LB.

Side note:
From my downloaded version of the paper, the transition kernel of the first equation on Page 1 is wrong, because the Q(x,dy) terms needs to be multiplied by its acceptance probability a(x,y).

References:

  •  Zanella, G. (2017). Informed proposals for local MCMC in discrete spaces. arXiv preprint arXiv:1711.07424.
  • Liu, J. S., Liang, F., & Wong, W. H. (2000). The multiple-try method and local optimization in Metropolis sampling. Journal of the American Statistical Association95(449), 121-134.