SRN – Markovian Score Climbing by Naesseth et al. from NeurIPS 2020

This Sunday Reading Notes post is about a recent article on variational inference (VI). 

In variational inference, we can approximate a posterior distribution {p(z \mid x)} by finding a distribution {q(z ; \lambda^{\star})} that is the `closest’ to {p(z \mid x)} among a collection of functions {Q = \{q(z;\lambda)\}}. Once a divergence between {p} and {q} has been chosen, we can rely on optimization algorithms such as stochastic gradient descent to find {\lambda^{\star}.}

The `exclusive’ Kullback-Leiber (KL) divergence has been popular in VI, due to the ease of working with an expectation with respect to the approximating distribution {q}. This article, however, considers the `inclusive’ KL

\displaystyle \mathbb{E}(p \| q) = \mathbb{E}_{p(z \mid x)} \left[ \log \frac{p(z \mid x)}{q(z ; \lambda)} \right].

Minimizing {\mathrm{KL}(p\| q)} is equivalent to minimizing the cross entropy {L_{\mathrm{KL}} = \mathbb{E}_{p(z \mid )}[ - \log q(z ; \lambda)],} whose gradient is
\displaystyle g_{\mathrm{KL}}(\lambda) := \nabla L_{KL}(\lambda) = \mathbb{E}_{p(z \mid x)}\left[- \nabla_{\lambda} \log q(z; \lambda)\right].

If we can find unbiased estimates of {g_{\mathrm{KL}}(\lambda)}, then with a Robbins-Monroe schedule {\{\varepsilon_k\}_{k=1}^{\infty}}, we can use stochastic gradient descent to approximate {\lambda^{\star}.}

This article propose Markovian Score Climbing (MSC) as another way to approximate {\lambda^{\star}}. Given an Markov kernel {M(z' \mid z;\lambda)} that leases the posterior distribution {p(z \mid x)} invariant, one step of the MSC iterations operates as follows.

(*) Sample {z_k \sim M( \cdot \mid z_{k-1}; \lambda_{k-1})}.
(*) Compute the gradient {\nabla \log q(z_k; \lambda_{k-1}).}
(*) Set {\lambda_{k} = \lambda_{k-1} + \varepsilon_k \nabla \log q(z_k; \lambda_{k-1}).}

The authors prove that {\lambda_k \to \lambda^{\star}} almost surely and illustrate it on the skew normal distribution. One advantage of MSC is that only one sample is required per {\lambda} update. Also, the Markov kernel {M(z' \mid z;\lambda)} provides a systematic way of incorporating information from current sample {z_k} and current parameter {\lambda_k}. As the authors point out, one example of such a proposal is a conditional SMC update [Section 2.4.3 of Andrieu et al., 2010].

While this article definitely provides a general purpose VI method, I am more intrigued by the MCMC samples {z_k}. What can we say about the samples {\{z_k\}}? Can we make use of them?


Naesseth, C. A., Lindsten, F., & Blei, D. (2020). Markovian score climbing: Variational inference with KL (p|| q). arXiv preprint arXiv:2003.10374.

Andrieu, C., Doucet, A., & Holenstein, R. (2010). Particle markov chain monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3), 269-342.

Online workshop: Measuring the quality of MCMC output

“Quality control” for MCMC methods is such an important but overlooked topic. Look forward to this workshop!


Yes, we need a better logo.

Hi all,

With Leah South from QUT we are organizing an online workshop on the topic of “Measuring the quality of MCMC output”. The event website is here with more info:

This is part of ISBABayesComp section’s efforts to organize activities while waiting for the next “big” in-person meeting, hopefully in 2023. The event benefits from the generous support of QUT Centre for Data Science. The event’s website will be regularly updated between now and the event in October 2021, with three live sessions:

  • 11am-2pm UTC on Wednesday 6th October,
  • 1pm-4pm UTC on Thursday 14th October,
  • 3pm-6pm UTC on Friday 22nd October.

Registration is free but compulsory (form here) as we want to make sure the live sessions remain convivial and focused; hence the rather specific theme, but it’s an exciting topic with lots of very much open questions…

View original post 74 more words