# SRN – Markovian Score Climbing by Naesseth et al. from NeurIPS 2020

This Sunday Reading Notes post is about a recent article on variational inference (VI).

In variational inference, we can approximate a posterior distribution ${p(z \mid x)}$ by finding a distribution ${q(z ; \lambda^{\star})}$ that is the closest’ to ${p(z \mid x)}$ among a collection of functions ${Q = \{q(z;\lambda)\}}$. Once a divergence between ${p}$ and ${q}$ has been chosen, we can rely on optimization algorithms such as stochastic gradient descent to find ${\lambda^{\star}.}$

The exclusive’ Kullback-Leiber (KL) divergence has been popular in VI, due to the ease of working with an expectation with respect to the approximating distribution ${q}$. This article, however, considers the `inclusive’ KL

$\displaystyle \mathbb{E}(p \| q) = \mathbb{E}_{p(z \mid x)} \left[ \log \frac{p(z \mid x)}{q(z ; \lambda)} \right].$

Minimizing ${\mathrm{KL}(p\| q)}$ is equivalent to minimizing the cross entropy ${L_{\mathrm{KL}} = \mathbb{E}_{p(z \mid )}[ - \log q(z ; \lambda)],}$ whose gradient is
$\displaystyle g_{\mathrm{KL}}(\lambda) := \nabla L_{KL}(\lambda) = \mathbb{E}_{p(z \mid x)}\left[- \nabla_{\lambda} \log q(z; \lambda)\right].$

If we can find unbiased estimates of ${g_{\mathrm{KL}}(\lambda)}$, then with a Robbins-Monroe schedule ${\{\varepsilon_k\}_{k=1}^{\infty}}$, we can use stochastic gradient descent to approximate ${\lambda^{\star}.}$

This article propose Markovian Score Climbing (MSC) as another way to approximate ${\lambda^{\star}}$. Given an Markov kernel ${M(z' \mid z;\lambda)}$ that leases the posterior distribution ${p(z \mid x)}$ invariant, one step of the MSC iterations operates as follows.

(*) Sample ${z_k \sim M( \cdot \mid z_{k-1}; \lambda_{k-1})}$.
(*) Compute the gradient ${\nabla \log q(z_k; \lambda_{k-1}).}$
(*) Set ${\lambda_{k} = \lambda_{k-1} + \varepsilon_k \nabla \log q(z_k; \lambda_{k-1}).}$

The authors prove that ${\lambda_k \to \lambda^{\star}}$ almost surely and illustrate it on the skew normal distribution. One advantage of MSC is that only one sample is required per ${\lambda}$ update. Also, the Markov kernel ${M(z' \mid z;\lambda)}$ provides a systematic way of incorporating information from current sample ${z_k}$ and current parameter ${\lambda_k}$. As the authors point out, one example of such a proposal is a conditional SMC update [Section 2.4.3 of Andrieu et al., 2010].

While this article definitely provides a general purpose VI method, I am more intrigued by the MCMC samples ${z_k}$. What can we say about the samples ${\{z_k\}}$? Can we make use of them?

References:

Naesseth, C. A., Lindsten, F., & Blei, D. (2020). Markovian score climbing: Variational inference with KL (p|| q). arXiv preprint arXiv:2003.10374.

Andrieu, C., Doucet, A., & Holenstein, R. (2010). Particle markov chain monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3), 269-342.

## Author: PhyllisWithData

Statistics PhD student at Harvard University.