SRN – Markovian Score Climbing by Naesseth et al. from NeurIPS 2020

This Sunday Reading Notes post is about a recent article on variational inference (VI). 

In variational inference, we can approximate a posterior distribution {p(z \mid x)} by finding a distribution {q(z ; \lambda^{\star})} that is the `closest’ to {p(z \mid x)} among a collection of functions {Q = \{q(z;\lambda)\}}. Once a divergence between {p} and {q} has been chosen, we can rely on optimization algorithms such as stochastic gradient descent to find {\lambda^{\star}.}

The `exclusive’ Kullback-Leiber (KL) divergence has been popular in VI, due to the ease of working with an expectation with respect to the approximating distribution {q}. This article, however, considers the `inclusive’ KL

\displaystyle \mathbb{E}(p \| q) = \mathbb{E}_{p(z \mid x)} \left[ \log \frac{p(z \mid x)}{q(z ; \lambda)} \right].

Minimizing {\mathrm{KL}(p\| q)} is equivalent to minimizing the cross entropy {L_{\mathrm{KL}} = \mathbb{E}_{p(z \mid )}[ - \log q(z ; \lambda)],} whose gradient is
\displaystyle g_{\mathrm{KL}}(\lambda) := \nabla L_{KL}(\lambda) = \mathbb{E}_{p(z \mid x)}\left[- \nabla_{\lambda} \log q(z; \lambda)\right].

If we can find unbiased estimates of {g_{\mathrm{KL}}(\lambda)}, then with a Robbins-Monroe schedule {\{\varepsilon_k\}_{k=1}^{\infty}}, we can use stochastic gradient descent to approximate {\lambda^{\star}.}

This article propose Markovian Score Climbing (MSC) as another way to approximate {\lambda^{\star}}. Given an Markov kernel {M(z' \mid z;\lambda)} that leases the posterior distribution {p(z \mid x)} invariant, one step of the MSC iterations operates as follows.

(*) Sample {z_k \sim M( \cdot \mid z_{k-1}; \lambda_{k-1})}.
(*) Compute the gradient {\nabla \log q(z_k; \lambda_{k-1}).}
(*) Set {\lambda_{k} = \lambda_{k-1} + \varepsilon_k \nabla \log q(z_k; \lambda_{k-1}).}

The authors prove that {\lambda_k \to \lambda^{\star}} almost surely and illustrate it on the skew normal distribution. One advantage of MSC is that only one sample is required per {\lambda} update. Also, the Markov kernel {M(z' \mid z;\lambda)} provides a systematic way of incorporating information from current sample {z_k} and current parameter {\lambda_k}. As the authors point out, one example of such a proposal is a conditional SMC update [Section 2.4.3 of Andrieu et al., 2010].

While this article definitely provides a general purpose VI method, I am more intrigued by the MCMC samples {z_k}. What can we say about the samples {\{z_k\}}? Can we make use of them?

References:

Naesseth, C. A., Lindsten, F., & Blei, D. (2020). Markovian score climbing: Variational inference with KL (p|| q). arXiv preprint arXiv:2003.10374.

Andrieu, C., Doucet, A., & Holenstein, R. (2010). Particle markov chain monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3), 269-342.

Author: PhyllisWithData

Statistics PhD student at Harvard University.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: