SRN – Further and stronger analogy between sampling and optimization by Arnak Dalalyan

Lately, I have been pondering the connection between Monte Carlo based sampling methods and optimation methods. It brings me to read a paper that I have been wanting to read in detail long ago. In this week’s Sunday Reading Notes series on Phyllis with Data, I discuss the paper “Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent” by Arnak Dalalyan.

Lately, I have been pondering the connection between Monte Carlo based sampling methods and optimation methods. It brings me to read a paper that I have been wanting to read in detail long ago. In this week’s Sunday Reading Notes series on Phyllis with Data, I discuss the paper “Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent” by Arnak Dalalyan. 

This paper focuses on the Langevin Monte Carlo algorithm, which I believe is also called the Unadjusted Langevin algorithm (ULA) in contrast to the Metropolis adjusted Langevin algorithm (MALA). Using the Wasserstein distance as the metric, this paper establishes an upper bound of the Wasserstein distance, more precisely it shows that one needs O(\frac{d}{\epsilon^2}\log(\frac{d}{\epsilon})) number of iterations to reach the precision level . Considering the close connection between ULA and gradient descent algorithm for optimization, the author proves an upper bound for the L2 distance for gradient descent by cleverly utilizing a tempering argument (details in section 3 of the paper). The well-known result for optimization is that it takes O(\log(\frac{1}{\epsilon})) iterations to reach precision in gradient descent. Intuitively sampling should be a more difficult task than optimization because it requires an exploration of the whole parameter space, which can be of high-dimension. The bounds in this paper confirm this intuition. More importantly, it points out how these two problems are “continuously” connected (as the temperature converges to 0) and that this connection naturally leads to the deduction of the optimization bound from the sampling bound. 

Admittedly this is not a new paper as it was first arxived in 2016, there have been many papers tackling similar problems in sampling, for example using the KL-divergence (Cheng and Bartleet 2019) or extending the results to MALA (Dwivedi et al. 2019). When I first picked up this paper two years ago as a second-year Ph.D. student, I was intimidated by the proof, which in fact was only 6-pages long. This time, as I sat down and went through the proof line by line, I found it quite accessible and elegant. I have been equipped with the mathematical knowledge required to understand the proof in college. This time, I was defeated by my own fear. For me, this paper is not only a delightful read, but also a wake-up call to trust myself. 

SRN – Distilling Importance Sampling

n this week’s Sunday Reading Notes, I discuss the paper Distilling Importance Sampling by Dennis Prangle. It aims to find the normalising flow $latex q$ that minimizes the distance to the target distribution $latex p$.

In this week’s Sunday Reading Notes, I discuss the paper Distilling Importance Sampling by Dennis Prangle. It aims to find the normalising flow q that minimizes the distance to the target distribution p. DIS uses the inclusive Kulbleck-Leiber divergence (KL divergence from the target distribution p to the approximating distribution q), in order to avoid over-concentration. The process of distilling refers to utilizing tempering distributions p_{\epsilon} that converges to p. In each step, DIS using stochastic gradient descent to minimize KL from p_{\epsilon} to q where the gradients are estimated with self-normalized importance sampling. 

Recently I came across many papers involving both Monte Carlo methods and neural networks. I learned many new concepts, among them, was normalising flow. A normalising flow is a bijective transformation from simple distributions such as normal or uniform. This paper considers non-volume preserving (NVP) normalising flow, particularly the coupling layer. It transforms input vector u into output vector v by fixing d coordinates and shifting and scaling the remaining coordinates by neural network outputs. The resulting family of densities can be sampled rapidly and their gradients are computable.

One question I have while reading this paper is about convergence guarantees. What does the distribution q^{\star} converge to? Numerical experiments suggest that DIS output is very close to the targeting distribution by comparing it to MCMC output. But as Prangle acknowledges, DIS output has less accurate tails than those of MCMC. Intuitively speaking DIS output should converge to the argmin of inclusive KL from target distribution to normalising flows. But does the distilling procedure guarantee convergence, despite the gradient estimates and tempering steps? I also keep thinking about what would be the SMC counterpart of DIS. 

Figure 3 from Prangle, 2019. Comparing MCMC output and DIS output for the M/G/1 queueing example.

Reference:

Prangle, D. (2019). Distilling importance sampling. arXiv preprint arXiv:1910.03632. https://arxiv.org/abs/1910.03632