SRN – Bayesian Calibration of Microsimulation Models by Rutter et al.

The Sunday Reading Notes paper for this week is ‘Bayesian Calibration of Microsimulation Models’ by Carolyn Rutter, Diana Miglioretti and James Savarino. This is a 2009 JASA Applications and Case Studies paper.

According to a 2012 Review paper by Rutter et al,

Microsimulation models (MSMs) for health outcomes simulate individual event histories associated with key components of a disease process; these simulated life histories can be aggregated to estimate population-level effects of treatment on disease outcomes and the comparative effectiveness of treatments. Although MSMs are used to address a wide range of research questions, methodological improvements in MSM approaches have been slowed by the lack of communication among modelers. In addition, there are few resources to guide individuals who may wish to use MSM projections to inform decisions.

In this paper, the authors propose a Bayesian method to calibrate microsimulations models, using Markov Chain Monte Carlo. The case study in this paper is the history of colorectal cancer (CRC). In this paper, the authors assume all CRCs arise from an adenoma and the history of CRC consists of four components: 1) adenoma risk, 2) adenoma growth, 3) transition from adenoma to preclinical cancer and 4) transition from preclinical cancel to clinical cancer. These four components are not observed directly and the calibration data consists of prevalence of adenomas and preclinical cancers and the size and/or number of adenomas from many (independent) studies from different years and about different subpopulations and using different colonoscopy methods.

The authors use $\theta$ to denote MSM parameters and these parameters can be separated from $K$ components with independent priors: $\pi(\theta) = \prod_{i=1}^k \pi_i(\theta_i).$ The calibration data come from $N$ independent sources and each follows some distribution  $y_j|\theta \sim f_j(g_j(\theta))$ parametrized by some unknown functions of the MSM parameters. The likelihood is therefore $f(Y | \theta) = f(Y|g(\theta) )= \prod_{j=1}^n f_j(y_j|g_j(\theta))$.

What makes the calibration problem hard is the unknown function $g(\theta)$. Suppose we want to simulate from the posterior distribution of $\theta|Y$ using a Metropolis-Hastings(MH) algorithm, we need to know $r(\theta,\theta^*) = \frac{\pi(\theta)f(y|g(\theta)}{\pi(\theta^*)f(y|g(\theta^*)}$. But because $g$ is unknown, the MH step cannot be performed.

So the authors propose an approximate MH algorithm that includes a step to estimate $g(\theta)$ for both the current value $\theta$ and the proposed value $\theta^*$. To me this feels like an ‘EM’ step: simulate $m$ copies of $\tilde{y}_{ji}$ from $f_j(y_j| g_j(\theta))$ and calculate the MLE $\hat{g}_j(\theta).$ With this approximation, the resulting transition probability function $\hat{\alpha}(\theta,\theta^*)$for the Metropolis-within-Gibbs step on $\theta_i$ is based on $\frac{\pi(\theta_i)}{\pi(\theta_i^*)}\cdot \frac{\prod_{j=1}^N f_j(y_j|\hat{g}_j(\theta^*_i, \theta_{(-i)})}{\prod_{j=1}^n f_j(y_j|\hat{g}_j(\theta))}.$ In the Appendix, the authors prove that this approximation satisfies the detailed-balanced condition in the limit of $m$ goes to infinity.

I think this paper provides an interesting example of how to incorporate data from multiple sources. As the authors point out,

how closely the model should calibrate to observed data is unclear, especially when calibration data are variable and many provide conflicting interest. […] It depends on how modelers trade-off concerns about possibly overparameterizing and overfitting calibration data relative to the importance of exactly replicating observed or expected results.

References:

• Rutter, C. M., Miglioretti, D. L., & Savarino, J. E. (2009). Bayesian calibration of microsimulation models. Journal of the American Statistical Association104(488), 1338-1350.
• Rutter, C. M., Zaslavsky, A. M., & Feuer, E. J. (2011). Dynamic microsimulation models for health outcomes: a review. Medical Decision Making31(1), 10-18.

SRN – A Geometric Interpretation of the Metropolis-Hastings Algorithm by Billera and Diaconis

Coming back to the Sunday Reading Notes, this week I discuss the paper ‘A Geometric Interpretation of the Metropolis-Hastings Algorithm’ by Louis J. Billera and Persi Diaconis from Statistical Science. This paper is suggested to me by Joe Blitzstein.

In Section 4 of ‘Informed proposals for local MCMC in discrete spaces’ by Giacomo Zanella (see my SRN Part I and II), Zanella mentions that the Metropolis-Hasting acceptance probability function(APF) $\min\left(1,\frac{\pi(y)p(x,y)}{\pi(x)p(y,x)}\right)$ is not the only APF that makes the resulting kernel $\pi$-reversible as long as detailed-balance is satisfied. This comes first as a ‘surprise’ to me as I have never seen another APF in practice. But very quickly I realize that this fact was mentioned in both Stat 213 & Stat 220 at Harvard and I have read about it from Section 5.3 – ‘Why Does the Metropolis Algorithm Work?‘ of ‘Monte Carlo Strategies in Scientific Computing‘ by Jun S. Liu. Unfortunately, I did not pay enough attention. Joe suggested this article to me after I posted on Facebook about being upset with not knowing such a basic fact.

In this Billera and Diaconis paper, the authors focus on the finite state space case $X$ and considers the MH kernel as the projection of stochastic matrices (row sums are all 1 and all entries are non-negative, denoted by$\mathcal{s}(X)$) onto the set of $\pi$-reversible Markov chains (stochastic matrices that satisfy detailed balance $\pi(x)M(x,y) = \pi(y)M(y,x)$, denoted by $R(\pi)).$ If we introduce a metric on the stochastic matrices: $d(K,K') = \sum_{x} \sum_{x\not=y} \pi(x) |K(x,y)-K'(x,y)|$.

The key result in this paper is Theorem 1. The authors prove that the Metropolis maps $M := M(K)(x,y) = \min\left( K(x,y), \frac{\pi(y}{\pi(x)}K(y,x)\right)$ minimizes the distance $d$ from the proposal kernel $K$ to $R(\pi).$ This means that $M(K)$ is the unique closest element in $R(\pi)$ that is coordinate-wise smaller than $K$ on its off-diagonal entries. So $M$ is in a sense the closest reversible kernel to the original kernel $K$.

I think this geometric interpretation offers great intuition about how the MH algorithm works: we start with a kernel $K$ and change it to another kernel with stationary distribution $\pi$. And the change must occur as follows:

from $x$, choose $y$ from $K(x,y)$ and decide to accept $x$ or stay at $y$; this last choice may be stochastic with acceptance probabilty $F(x,y) \in [0,1]$. This gives the new chain with transition probabilities: $K(x,y) F(x,y)$, x \not =y\$. The diagonal entries are changed so that each row sums to 1.

Indeed the above procedure describes how the MH algorithm works. If we insist on $\pi$-reversibility, we must have $0 \leq 0 \leq \min(1,R(x,y)$ where $R(x,y) = \frac{\pi(y)K(y,x)}{\pi(x)K(x,y)}.$ So the MH choice of APF is one that maximizes the chance of moving from $x$ to $y$. The resulting MH kernel $M$ has the largest spectral gap (1 – second largest eigenvalue) and by Peksun’s theorem must have the minimum asymptotic variance estimating additive functionals.

In Remark 3.2, the authors point out if we consider only APF that are functions of $R(x,y)$, then the function must satisfy $g(x) = x g(1/x)$ which is the characteristic of balancing functions in Zanella’s ‘informed proposals’ paper.

This paper allows me to study Metropolis-Hastings algorithm from another angle and review facts I have neglected in my coursework.

References:

• Billera, L. J., & Diaconis, P. (2001). A geometric interpretation of the Metropolis-Hastings algorithm. Statistical Science, 335-339.
• Zanella, G. (2017). Informed proposals for local MCMC in discrete spaces. arXiv preprint arXiv:1711.07424.
• Liu, J. S. (2008). Monte Carlo strategies in scientific computing. Springer Science & Business Media.

SRN – Informed proposals for local MCMC in discrete spaces by Zanella (Part II)

This is a super delayed Sunday Reading Notes post and today I come back to finish discussing the informed proposals paper by Giacomo Zanella. In Part I of my discussions, I reviewed point-wise informed proposals and locally-informed criterion. In Part II, I hope to focus on asymptotical optimality of locally informed proposals and their simulation studies.

The author uses Peskun ordering to compare efficiency of MH schemes. He deduces that ‘locally-balanced proposals are asymptotically optimal in terms of Peskun ordering’ as dimensionality of the underlying state space increases. Conditions for asymptotic optimality are indeed mild (Proposition 1) for the three illustrative examples: 1) independent binary components, 2) weighted permutation and 3) Ising model.

In Section 4, Zanella points out a connection between the balancing function and acceptance probability function (APF) for MH algorithms, which I find very interesting. He also shows that the optimal proposal for independent binary variables is Barker choice $g(t) = \frac{t}{1+t}.$ The proof goes by finding the limiting continuous-time process of the MH chain and finding the optimal $g$ for the limiting process.

The simulation studies use the illustrative examples: weighted permutations and Ising models. The comparisons are in terms of 1) acceptance rate, 2) number of successful flips per computation time and 3) mixing of some summary statistics. The second criterion concerns the trade-off between computational cost (of calculating the informed proposal) and statistical accuracy (by producing efficient moves). For simple target distributions (such as Uniform), using locally-balanced proposals does not bring much benefits but it achieves a much higher number of flips per unit time for more complicated and ‘rough’ targets. See second row of Figure 1 of the paper.

I find it interesting that globally-balanced proposals (aka ‘naively-informed’ or $g(t) = t$) are extremely sensitive to initialization. Looking at the effective sample size (ESS) per time, the chains from GB has much more stable behavior if initialized from stationarity. See GB v.s. GB( station.) in Figure 2. But in Figure 5, this phenomenon does not show up for Ising models: initializing from stationarity does not yield ESS performance comparable to that of LBs.

In the simulation studies section, the author emphasizes the cost-vs-efficiency trade-off, which I find very important. I feel I have ignored this aspect of designing MCMC algorithms and should think more about it in my future studies. The author indicates that ‘computations required to sample from locally-balanced proposals are trivially parallelizable’. This is also something very interesting to me and I hope to learn more about multi-core computations during my PhD.

In his discussions, the author makes reference to Multiple-Try Metropolis. The connection between LB proposals and MT proposals is not entirely obvious to me, but I intuitively agree with the author’s comment that the weight function in MT serves a very similar purpose as the balancing function term $g\left(\frac{\pi(y)}{\pi(x)}\right)$ in LB.

Side note:
From my downloaded version of the paper, the transition kernel of the first equation on Page 1 is wrong, because the $Q(x,dy)$ terms needs to be multiplied by its acceptance probability $a(x,y).$

References:

•  Zanella, G. (2017). Informed proposals for local MCMC in discrete spaces. arXiv preprint arXiv:1711.07424.
• Liu, J. S., Liang, F., & Wong, W. H. (2000). The multiple-try method and local optimization in Metropolis sampling. Journal of the American Statistical Association95(449), 121-134.