## SRN – Markovian Score Climbing by Naesseth et al. from NeurIPS 2020

This Sunday Reading Notes post is about a recent article on variational inference (VI).

In variational inference, we can approximate a posterior distribution ${p(z \mid x)}$ by finding a distribution ${q(z ; \lambda^{\star})}$ that is the closest’ to ${p(z \mid x)}$ among a collection of functions ${Q = \{q(z;\lambda)\}}$. Once a divergence between ${p}$ and ${q}$ has been chosen, we can rely on optimization algorithms such as stochastic gradient descent to find ${\lambda^{\star}.}$

The exclusive’ Kullback-Leiber (KL) divergence has been popular in VI, due to the ease of working with an expectation with respect to the approximating distribution ${q}$. This article, however, considers the inclusive’ KL

$\displaystyle \mathbb{E}(p \| q) = \mathbb{E}_{p(z \mid x)} \left[ \log \frac{p(z \mid x)}{q(z ; \lambda)} \right].$

Minimizing ${\mathrm{KL}(p\| q)}$ is equivalent to minimizing the cross entropy ${L_{\mathrm{KL}} = \mathbb{E}_{p(z \mid )}[ - \log q(z ; \lambda)],}$ whose gradient is
$\displaystyle g_{\mathrm{KL}}(\lambda) := \nabla L_{KL}(\lambda) = \mathbb{E}_{p(z \mid x)}\left[- \nabla_{\lambda} \log q(z; \lambda)\right].$

If we can find unbiased estimates of ${g_{\mathrm{KL}}(\lambda)}$, then with a Robbins-Monroe schedule ${\{\varepsilon_k\}_{k=1}^{\infty}}$, we can use stochastic gradient descent to approximate ${\lambda^{\star}.}$

This article propose Markovian Score Climbing (MSC) as another way to approximate ${\lambda^{\star}}$. Given an Markov kernel ${M(z' \mid z;\lambda)}$ that leases the posterior distribution ${p(z \mid x)}$ invariant, one step of the MSC iterations operates as follows.

(*) Sample ${z_k \sim M( \cdot \mid z_{k-1}; \lambda_{k-1})}$.
(*) Compute the gradient ${\nabla \log q(z_k; \lambda_{k-1}).}$
(*) Set ${\lambda_{k} = \lambda_{k-1} + \varepsilon_k \nabla \log q(z_k; \lambda_{k-1}).}$

The authors prove that ${\lambda_k \to \lambda^{\star}}$ almost surely and illustrate it on the skew normal distribution. One advantage of MSC is that only one sample is required per ${\lambda}$ update. Also, the Markov kernel ${M(z' \mid z;\lambda)}$ provides a systematic way of incorporating information from current sample ${z_k}$ and current parameter ${\lambda_k}$. As the authors point out, one example of such a proposal is a conditional SMC update [Section 2.4.3 of Andrieu et al., 2010].

While this article definitely provides a general purpose VI method, I am more intrigued by the MCMC samples ${z_k}$. What can we say about the samples ${\{z_k\}}$? Can we make use of them?

References:

Naesseth, C. A., Lindsten, F., & Blei, D. (2020). Markovian score climbing: Variational inference with KL (p|| q). arXiv preprint arXiv:2003.10374.

Andrieu, C., Doucet, A., & Holenstein, R. (2010). Particle markov chain monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3), 269-342.

## SRN – evidence and cross-validation

A paper that has intrigued me recently is “On the marginal likelihood and cross-validation” by E. Fong and C.C.Holmes on Biometrika.

Model evidence appears as marginal likelihood or prior predictive in this paper:

$\displaystyle p_{\mathcal{M}}(y_{1:n}) = \int f_{\theta}(y_{1:n}) d\pi(\theta).$

As the paper seeks to establish the connection between model evidence and cross-validation (CV), which is based on predictive performance on held-out test sets, it notes that the log evidence relates to log posterior predictive probability

$\displaystyle \log p_{\mathcal{M}}(y_{1:n}) = \sum_{i=1}^{n} \log p_{\mathcal{M}}(y_i \mid y_{1:i-1}) = \sum_{i=1}^{n} \log \int f_{\theta}(y_i) d\pi(\theta \mid y_{1:i-1}).$

So log evidence can be interpreted as a predictive sequential scoring rule with score function
$\displaystyle s(y_i \mid y_{1:i-1}) = \log p_{\mathcal{M}}(y_i \mid y_{1:i-1}).$

The paper begins by arguing that model evidence is the unique scoring rule that guarantees coherency (data ordering should not change the result of inference when data are exchangeable).

Then the authors move on to show the equivalence between evidence and cumulative cross-validation scores. When we consider leave-${p}$-out CV, there are ${\binom{n}{p}}$ number of held-out test sets and each set has some predictive score when the rest of the data is used for training. The leave-${p}$-out CV score, denoted by ${S_{CV}(y_{1:n};p)}$, is the average of these predictive scores. When the log posterior predictive probability is used as the scoring rule, then we must have

$\displaystyle \log p_{\mathcal{M}}(y_{1:n}) = \sum_{p=1}^n S_{CV}(y_{1:n};p).$

So the log evidence is also the sum of leave-${p}$-out CV scores, for all values of ${p = 1,\ldots,n.}$

## SRN – Bayesian regret of Thompson Sampling

Hello 2021! This is my second Sunday Reading Notes post of the year. This week I have the chance of reading about Thompson Sampling, a topic I have been interested in since 2018. The paper “On Thompson Sampling for Smoother-than-Lipschitz Bandits” by Grant and Leslie, from Lancaster University, concerns the bounds on the regret for Thompson Sampling for continuum bandits. The regret bound is in terms of eluder dimension and ball-width function of the function class $\mathcal{F}$ of the reward function $f$.

Both eluder dimension and ball-width function are new concepts to me. I was able to read more about them in Russo and Roy (2014) and was intrigued by the concept of $\epsilon$-eluder dimension, which is a measure of complexity of a set of functions. An action $a$ is independent of actions $\{a_1,\ldots,a_n\}$ with respect to $\mathcal{F}$ if we can find two functions in $\mathcal{F}$ can have similar function values at $\{a_1,\ldots,a_n\}$ but very different function values at $a$. Based on this notion of $\epsilon$-independence, we can define the notion of $\epsilon$-eluder dimension in a similar fashion to how we define dimension based on linear independence in linear algebra.

A key assumption of this framework is that the model is well-specified. The true reward function $f$ is not only included in the function class $\mathcal{F}$, but also a sample from the posterior distribution $p_0$. Being a Bayesian statistician, I am naturally curious about the ‘miss-specified’ case. Will the regret bound decompose in terms of distance from $f$ to $\mathcal{F}$ and complexity of $\mathcal{F}$. Is the analysis feasible and what techniques should we use?

Reference:

Grant, J. & Leslie, D.. (2020). On Thompson Sampling for Smoother-than-Lipschitz Bandits. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 108:2612-2622

Russo, D., & Van Roy, B. (2014). Learning to optimize via posterior sampling. Mathematics of Operations Research39(4), 1221-1243.

## SRN – the magical 0.574

Hello 2021! This week, I keep reading about optimal scaling of Markov chain Monte Carlo algorithms and I move on to a new class of algorithms: MALA. This time, the magical number is 0.574.

Hello 2021! Here is the first post in the Sunday Reading Notes series in the year of 2021. This week, I keep reading about optimal scaling of Markov chain Monte Carlo algorithms and I move on to a new class of algorithms: MALA.

Roberts and Rosenthal (1998, JRSSB) concerns the optimal scaling of Metropolis-Adjusted Langevin algorithm (MALA) for target distributions of the form ${\pi^d(x) = \prod_{i=1}^d f(x_i)}$. Unlike random walk Metropolis algorithm, in MALA, the proposal distribution uses the gradient of ${\pi^d}$ at ${x_t}$:

$\displaystyle x^{\star} = x_t + \sigma_n Z_{t+1} + \frac{\sigma^2}{2} \nabla \log (\pi^d(x_t)).$

Here ${\sigma_n}$ is the step variance and ${Z_{t+1} \sim \mathcal{N}(0,I_d)}$ is a d-dimensional standard normal. Following this proposal, we perform an accept / reject step, so that the resulting Markov chain has the desired stationary distribution:
$\displaystyle \alpha_d(x_t,x^{\star}) = \min \left\{ 1, \frac{\pi^d(x^{\star})}{\pi^d(x_t)} \frac{q_d(x^{\star},x_t)}{q_d(x_t,y_{t+1})}\right\}.$

There are some smoothness assumptions on the log-density ${g = \log f}$: absolute value of the first 8 order derivatives are bounded, ${g'}$ is Lipschitz, and all moments exist for ${f}$.

Theorem 3 (diffusion limit of first component)

Consider the chain ${\{X\}}$ starting from stationarity and following the MALA algorithm with variance ${\sigma_d^2 = l^2 n^{-1/3}}$ for some ${l > 0}$. Let ${\{J\}}$ be a Poisson process with rate ${n^{1/3}}$, ${\Gamma^n_t = X_{J_t}}$ and ${U^d}$ is the first component of ${\Gamma^n}$, i.e. ${U^d_t = X^d_{J_t,1}}$. We must have, as ${d \to \infty}$, the process ${U^d \to U}$, where ${U}$ is a Langevin diffusion defined by
$\displaystyle dU_t = \sqrt{h(l)} dB_t + \frac{1}{2} h(l) \frac{d}{dx} \log (\pi_1(U_t))dt.$

Here ${h(l) = 2 l^2 \Phi(-Kl^3/2)}$ is the speed measure and ${K = \sqrt{\mathbb{E} \left[\frac{5g'''(X)^3 - 3g''(X)}{48}\right]} > 0}$.
The skeleton of the proof is similar to that of proving diffusion limit of random walk Metropolis. We will find sets ${F_d^* \subseteq \mathbb{R}^d}$ such that ${\mathbb{P}\left(\Gamma_s^d \in F_d^{\star} \ \forall \ 0 \le s \le t\right) \to 1}$ and on these sets the generators converges uniformly.

Theorem 4 (convergence of acceptance rate)
$\displaystyle \lim_{d \to \infty} \alpha_d(l) = \alpha(l)$ where ${a(l) = 2 \Phi(-Kl^3/2).}$
As a result, when ${h(l)}$ is maximized, the acceptance rate ${\alpha(l) \approx 0.574}$. Also the step variance should be tuned to be ${\sigma_d^2 = (\hat{l}_d)^2 n^{-1/3}.}$

MALA enjoys a faster convergence rate of order ${n^{1/3}}$ compared to order ${n}$ of RWM. But let’s also keep in mind that MALA requires an extra step to compute the gradients ${\log \pi(x) = \sum_{i=1}^n g'(x)}$, which takes ${n}$ steps to compute.

Reference:

Roberts, G. O., & Rosenthal, J. S. (1998). Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology)60(1), 255-268.

## SRN – about the magical 0.234 acceptance rate

Sunday Reading Notes series is back : Let’s understand the magical rule of ‘tuning your MH algorithm so that the acceptance rate is roughly 25%’ together!

‘Tune your MH algorithm so that the acceptance rate is roughly 25%’ has been general advice given to students in Bayesian statistics classes. It has been almost 4 years since I first read about it from the book Bayesian Data Analysis, but I never read the original paper where this result first appeared. This Christmas, I decided to read the paper ‘Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms’ by Roberts, Gelman and Gilks and to resume by Sunday Reading Notes series with a short exposition of this paper.

In Roberts, Gelman and Gilk (1997), the authors obtain a weak convergence result for the sequence of algorithms targeting the sequence of distributions ${\pi_d(x^d) = \prod_{i=1}^{d} f(x_i^d)}$ converging to a Langevin diffusion. The asymptotic optimal scaling problem becomes a matter optimizing the speed of the Langevin diffusion, and it is related to the asymptotic acceptance rate of proposed moves.

A one-sentence summary of the paper would be

if you have a d-dimensional target that is independent in each coordinate, then choose the step size of random walk kernel to be 2.38 / sqrt(d) or tune your acceptance rate to be around 1/4.

Unfortunately, in practice the ‘if’ condition is often overlooked and people are tuning the acceptance rate to be 0.25 as long as the proposal is random walk, no matter what the target distribution is. It has been 20 years since the publication of the 0.234 result and we are witnessing the use of MCMC algorithms on more complicated target distributions, for example parameter inference for state-space models. I feel that this is good time that we revisit and appreciate the classical results while re-educating ourselves on their limitations.

Reference:

Roberts, G. O., Gelman, A., & Gilks, W. R. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. The annals of applied probability7(1), 110-120.

——–TECHNICAL EXPOSITION——-

Assumption 1 The marginal density of each component $f$ is such that ${f'/f}$ is Lipschitz continuous and

$\displaystyle \mathbb{E}_f\left[\left(\frac{f'(X)}{f(X)}\right)^8\right] = M < \infty, \ \ \ \ \ (1)$

$\displaystyle \mathbb{E}_f\left[\left(\frac{f''(X)}{f(X)}\right)^4\right] < \infty. \ \ \ \ \ (2)$

Roberts et al. (1997) considers random walk proposal ${y^d - x^d \sim \mathcal{N}(0,\sigma_d I_d)}$ where ${\sigma_d^2 = l^2 / (d-1).}$ We use ${X^d = (X_0^d,X_1^d,\ldots)}$ to denote the Markov chain and define another Markov process ${(Z^d)}$ with ${Z_t^d = X_{[dt]}^d}$, which is the speed-up version of ${X^d}$. Let ${[a]}$ denote the floor of ${a \in \mathbb{R}}$. Define ${U^d_t= X^d_{[dt],1}}$, the first component of ${X_{[dt]}^d = Z^d_t}$.

Theorem 1 (diffusion limit of first component) Suppose ${f}$ is positive and in ${\mathbb{C}^2}$ and that (1)-(2) hold. Let ${X_0^{\infty} = (X^1_{0,1},X^{2}_{0,2},\ldots)}$ be such that all components are distributed according to ${f}$ and assume ${X^{i}_{0,j} = X^{j}_{0,j}}$ for all ${i \le j}$. Then as ${d \to \infty}$,
$\displaystyle U^d \to U.$

The ${U_0 \sim f}$ and ${U}$ satisfies the Langevin SDE
$\displaystyle dU_t = (h(l))^{1/2}dB_t + h(l)\frac{f'(U_t)}{2f(U_t)}dt \ \ \ \ \ (3)$

and
$\displaystyle h(l) = 2 l^2 \Phi(-l\sqrt{I}/2)$

with ${\Phi}$ being the standard normal cdf and
$\displaystyle I = \mathbb{E}_f\left[\left(f'(X)/ f(X)\right)^2\right].$

Here ${h(l)}$ is the speed measure of the diffusion process and the most efficient’ asymptotic diffusion has the largest speed measure. ${I}$ measures the roughness’ of ${f}$.

Example 1 If ${f}$ is normal, then ${f(x) = (2\pi\sigma^2_f)^{-1/2}\exp(-x^2/(2\sigma_f^2)).}$
$\displaystyle I = \mathbb{E}_f\left[\left(f'(x) / f(x) \right)^2\right] = (\sigma_f)^{-4}\mathbb{E}_f\left[x^2\right] = 1/\sigma^2_f.$

So when the target density ${f}$ is normal, then the optimal value of ${l}$ is scaled by ${1 / \sqrt{I}}$, which coincides with the standard deviation of ${f}$.

Proof: (of Theorem 1.1) This is a proof sketch. The strategy is to prove that the generator of ${Z^n}$, defined by

$\displaystyle G_n V(x^n) = n \mathbb{E}\left[\left(V(Y^n) - V(x^n)\right) \left( 1 \wedge \frac{\pi_n(Y^n)}{\pi_n(x^n)}\right)\right].$

converges to the generator of the limiting Langevin diffusion, defined by
$\displaystyle GV(x) = h(l) \left[\frac{1}{2} V''(x) + \frac{1}{2} \frac{d}{dx}(\log f)(x) V'(x)\right].$

Here the function ${V}$ is a function of the first component only.
First define a set

$\displaystyle F_d = \{|R_d(x_2,\ldots,x_d) - I| < d^{-1/8}\} \cap \{|S_d(x_2,\ldots,x_d) - I| < d^{-1/8}\},$

where
$\displaystyle R_d(x_2,\ldots,x_d) = (d-1)^{-1} \sum_{i=2}^d \left[(\log f(x_i))'\right]^2$

and
$\displaystyle S_d(x_2,\ldots,x_d) = - (d-1)^{-1} \sum_{i=2}^d \left[(\log f(x_i))''\right].$

For fixed ${t}$, one can show that ${\mathbb{P}\left(Z^d_s \in F_d , 0 \le s \le t\right)}$ goes to 1 as ${d \to \infty}$. On these sets ${\{F_d\}}$, we have
$\displaystyle \sup_{x^d \in F_d} |G_d V(x^d) - G V(x_1)| \to 0 \quad \text{as } d \to \infty ,$

which essentially says ${G_d \to G}$, because we have uniform convergence for vectors contained in a set of limiting probability 1.
$\Box$

Corollary 2 (heuristics for RWMH) Let
$\displaystyle a_d(l) = \int \int \pi_d(x^d)\alpha(x^d,y^d)q_d(x^d,y^d)dx^d dy^d$

be the average acceptance rate of the random walk MH in ${d}$ dimensions.
We must have ${lim_{d\to\infty} a_d(l) \to a(l)}$ where ${a(l) = 2 \Phi(-l\sqrt{I}/2)}$.
${h(l)}$ is maximized by ${l = \hat{l} = 2.38 / \sqrt{I}}$ and ${a(\hat{l}) = 0.23}$ and ${h(\hat{l}) = 1.3 / I.}$
The authors consider two extensions of the target density ${\pi_d}$, where the convergence and optimal scaling properties will still hold. The first extension concerns the case where ${f_i}$‘s are different, but there is an law of large numbers on these density functions. Another extension concerns the case ${\pi_d(x^d) = f_1(x_1) \prod_{i=2}^{d} P(x_{i-1}, x_{i})}$, with some conditions on ${P}$.

## SRN – “Scalable Metropolis-Hastings for Exact Bayesian Inference with Large Datasets” by Cornish et al.

When faced with large datasets, standard MCMC methods have a cost of O(n) per step due to evaluating the log-likelihood for n data points. Can we make it faster?

My COVID-19 self-isolation plan includes trying to finish as many to-read articles as possible and to blog my progress on this blog as much as possible. This weekend, I read the paper “Scalable Metropolis-Hastings for Exact Bayesian Inference with Large Datasets” by Robert Cornish, Paul Vanetti, Alexandre Bouchard-Cȏté, George Deligiannidis, and Arnaud Doucet. This is an ICML 2019 paper and it is a dense one. It attempts to solve one problem. When faced with large datasets,  standard MCMC methods have a cost of O(n) per step due to evaluating the log-likelihood for n data points. Can we make it faster? In a typical MH step (with symmetric proposals), we accept the proposal with probability $\alpha(\theta,\theta') = \min(1, \frac{\pi(\theta')}{\pi(\theta)})$. This is typically implemented with

log_alpha <- log_post(theta_proposal) - log_post(theta_current);
if (log(runif(1)) < log_alpha ){
theta_current <- theta_proposal;
}
`

The authors realize that it is wasteful having to evaluate the log-likelihood at all data points before rejecting the proposal. Their solution, Salable Metropolis-Hastings (SMH), is based on a combination of three ideas: (1) factorization of posterior density (trivial factorization is 1 + n factors), (2) fast simulation of rejection using thinning of a Poisson point process, and (3) posterior concentration around mode when n is large.

I first get excited about this paper seeing the fast simulation of Bernoulli random variables (in Algorithm 1) because there is a very similar problem on simulating a Poisson process with inhomogenous rates in Stat 210, for which I was a Teaching Fellow (TF). I am very happy to see this result being useful in cutting-edge research. The fast simulation relies on access to a lower bound for each factor in the posterior distribution and the tighter this lower bound the more efficient this implementation. Suppose the target distribution has m factors, then once we pay a once-off O(m) cost, each acceptance step can be implemented in O(upper bound of negative-log-posterior) number of steps on average.

Another highlight of the paper is the smart application of the Bernstein-von Mises (BvM) phenomenon, that which states that posterior distribution concentrates around the mode with rate $1/\sqrt{n}$ as n goes to infinity. BvM was an important topic in Stat 213, which is another class that I TFed. I have seen BvM used (1) to select step-size in random-walk MH, (2) to initialize an MH chain, and (3) to motivate Laplace approximation of the posterior distribution. In this paper, it motivates a Taylor expansion of the potential function near posterior mode, which subsequently leads to the factorization behind the SMH kernel. The SMH kernel leaves the target distribution invariant, and, if not mistaken by me, this property explains the use of “exact” in the title. More importantly, since the acceptance probability in SMH is factorizable, we can implement the acceptance step with the aforementioned fast simulation, provided that log-density can be differentiated enough times. Intuitively, the higher the differentiability, the closer the Taylor expansion, the tighter the bound for acceptance rate and the fewer the number of evaluations we need. In fact, as later justified with Theorem 3.1, the average computational cost decays geometrically with respect to the order of Taylor expansion. With a second-order Taylor expansion, which requires 3rd-order differentiability, the expected computational cost behaves like $O(1/\sqrt{n})$. Hooray! I am always a big fan of “the bigger the problem, the easier the solution”.

I learn a lot from this paper, especially from the opening paragraph of Section 3 and Section 3.2. But there is a question that I keep asking myself: if n is very large and we have an estimate of posterior mode that is $O(1/\sqrt{n})$ distance from the true mode, how long does it take for SMH to be better than that declaring the posterior is Gaussian by invoking BvM and sleeping on the result?

## SRN – my second attempt to understand ‘Controlled SMC’ by Heng et al.

It has been two years since I last tried to understand the paper ‘Controlled Sequential Monte Carlo’ by Jeremy Heng et al. Over the past two years, I picked this paper up many times, and every time I read it, I understood it a little better. I am finally ready to share my understanding of this paper on this blog.

It has been two years since I last tried to understand the paper ‘Controlled Sequential Monte Carlo’ by Jeremy Heng et al. Over the past two years, I picked this paper up many times, and every time I read it, I understood it a little better. I am finally ready to share my understanding of this paper on this blog.

This paper opens with an introduction to the Feynman-Kac path measure. A triple of (1) initial distribution, (2) a collection of Markov kernels, and (3) potential functions determines a Feynman-Kac path measure. For readers familiar with state-space models, this is not something new: the triple can correspond to (1) the initial distribution of hidden process, (2) Markov kernels of the hidden process and (3) observation densities.

For each Feynman-Kac path measure, there is a sequential Monte Carlo algorithm (aka particle filters) that provides a particle approximation of the path measure. Furthermore, Feynman-Kac path measures can be twisted. Although all the kernels and the potentials are changed, the path measure is the same at the terminal time and the normalizing constant at the terminal time $Z_T$ stays the same.

To infer state-space models, we are interested in the smoothing distribution $x_{0:T}$ given $y_{0:T}$ or the marginal likelihood $p(Y_{0:T})$. The smoothing distribution has Feynman-Kac measure representations, with the marginal likelihood being the normalizing constant.  The most natural one is described above, and it corresponds to the bootstrap particle filter. But it is not unique. For example, the path measure associated with a 1-step look ahead kernel corresponds to the popular ‘fully-adapted auxiliary particle filter’. As mentioned above, each of these path measures can be twisted. Proposition 1 defines an optimal policy with respect to each path measure. Optimality is defined in the sense of approximating the normalizing constant almost surely. The optimal policy with respect to the path measure corresponding to the bootstrap particle filter is the backward information filter $\phi^{\star}(x_t) = p(y_{t:T}|x_t)$

The optimal policies $\phi^{\star}$ are usually intractable, but they can be approximated. After running any SMC algorithm, we have a particle approximation of the path measure, which gives us a particle approximation of the optimal policy. We can use regression to learn a simple policy $\phi$ that minimized the difference between $\log\phi$ and particle approximation of $\log\phi^{\star}$. This regression is done recursively starting at $t = T$ and ending at $t = 0$, and it is called approximate dynamic programming in the paper.  Given this policy, we have a new path measure, which means that we can run another particle filter according to it and find the optimal policy with respect to the new measure. This process is called policy refinement. According to the authors, in practice, only a few iterations of policy refinement is necessary. They also offer performance monitoring criteria based on effective sample size and regression residuals.

What fascinates (and puzzles) me is that errors do not accumulate over either iteration of policy refinement or approximate dynamic programming. I hope to read this paper is greater detail and be able to report back to my readers soon.

## SRN – Further and stronger analogy between sampling and optimization by Arnak Dalalyan

Lately, I have been pondering the connection between Monte Carlo based sampling methods and optimation methods. It brings me to read a paper that I have been wanting to read in detail long ago. In this week’s Sunday Reading Notes series on Phyllis with Data, I discuss the paper “Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent” by Arnak Dalalyan.

Lately, I have been pondering the connection between Monte Carlo based sampling methods and optimation methods. It brings me to read a paper that I have been wanting to read in detail long ago. In this week’s Sunday Reading Notes series on Phyllis with Data, I discuss the paper “Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent” by Arnak Dalalyan.

This paper focuses on the Langevin Monte Carlo algorithm, which I believe is also called the Unadjusted Langevin algorithm (ULA) in contrast to the Metropolis adjusted Langevin algorithm (MALA). Using the Wasserstein distance as the metric, this paper establishes an upper bound of the Wasserstein distance, more precisely it shows that one needs $O(\frac{d}{\epsilon^2}\log(\frac{d}{\epsilon}))$ number of iterations to reach the precision level . Considering the close connection between ULA and gradient descent algorithm for optimization, the author proves an upper bound for the L2 distance for gradient descent by cleverly utilizing a tempering argument (details in section 3 of the paper). The well-known result for optimization is that it takes $O(\log(\frac{1}{\epsilon}))$ iterations to reach precision in gradient descent. Intuitively sampling should be a more difficult task than optimization because it requires an exploration of the whole parameter space, which can be of high-dimension. The bounds in this paper confirm this intuition. More importantly, it points out how these two problems are “continuously” connected (as the temperature converges to 0) and that this connection naturally leads to the deduction of the optimization bound from the sampling bound.

Admittedly this is not a new paper as it was first arxived in 2016, there have been many papers tackling similar problems in sampling, for example using the KL-divergence (Cheng and Bartleet 2019) or extending the results to MALA (Dwivedi et al. 2019). When I first picked up this paper two years ago as a second-year Ph.D. student, I was intimidated by the proof, which in fact was only 6-pages long. This time, as I sat down and went through the proof line by line, I found it quite accessible and elegant. I have been equipped with the mathematical knowledge required to understand the proof in college. This time, I was defeated by my own fear. For me, this paper is not only a delightful read, but also a wake-up call to trust myself.

## SRN – Controlled Sequential Monte Carlo by Heng et al. (Part 1)

Returning to my Sunday Reading Notes series from Thanksgiving recess, I want to discuss the paper ‘Controlled Sequential Monte Carlo’ by Jeremy Heng, Adrian N. Bishop, George Deligiannidis and Arnaud Doucet.

Returning to my Sunday Reading Notes series from Thanksgiving recess, I want to discuss the paper ‘Controlled Sequential Monte Carlo’ by Jeremy Heng, Adrian N. Bishop, George Deligiannidis and Arnaud Doucet. You can read Xi’an’s blog post discussing this paper.

As pointed out in Xi’an’s blog, this is a massive paper, and I have only managed to go through Section 3. This paper starts with introducing Feynman-Kac models and twisted Feynman-Kac models, which is motivated by designing observations dependent kernels for sequential Monte Carlo (SMC) methods. At the terminal time $T$, the twisted model and original model are the same and at all times $t = 0,1,2,\cdots, T$ the two models have the same normalizing constant which we try to approximate with SMC. With a current policy $\psi$, there is a policy refinement $\psi^\star = \psi \cdot \phi^\star$ that is a admissible policy and $\psi^star$ is referred to as the optimal policy by the authors because of its connection with Kullback-Leibler optimal control problem.

The policy refinements can be found using backward recursion $\psi^\star_T = G_T^\phi$ and $\psi^\star_t = Q_t^\phi \phi^{\star}_{t+1}$ for $t = 0,\cdots, T-1$ where $Q$ are Bellman operators. Because the policies $\psi^\star$ are intractable, the authors introduce logarithmic projection as a means of approximation. This is the Approximate Dynamic Programming (ADP) algorithm that the authors outlined in Algorithm 2. The final algorithm (Algorithm 3) alternates between performing a twisted SMC and ADP to find the next policy refinement.

References:

Heng, J., Bishop, A. N., Deligiannidis, G., & Doucet, A. (2017). Controlled Sequential Monte Carlo. arXiv preprint arXiv:1708.08396.

## SRN – The Iterated Auxiliary Particle Filter by Guarniero et al.

In this week’s Sunday Reading Post, I want to discuss the paper ‘The Iterated Auxiliary Particle Filter’ by Pieralberto Guarniero, Adam Johansen and Antony Lee.  The algorithm proposed, iAPF, approximates an idealized particle filter, which would have zero variance in terms of estimating the marginal likelihood of a hidden Markov model. It is worth mentioning that finding if we were to use an idealized particle filter, we would only need one particle instead of a system of particles for marginal likelihood estimations and that the idealized particle filter would be the backward smoothing provided a perfect forward filtering step.

In this week’s Sunday Reading Post, I want to discuss the paper ‘The Iterated Auxiliary Particle Filter’ (iAPF) by Pieralberto Guarniero, Adam Johansen and Antony Lee.

The algorithm proposed, iAPF, approximates an idealized particle filter, which would have zero variance in terms of estimating the marginal likelihood of a hidden Markov model. The motivation is that if we were to use an idealized particle filter, we would only need one particle instead of a system of particles for marginal likelihood estimations. The idealized particle filter would be the backward-smoothing provided a perfect forward-filtering step.

This paper builds upon previous studies of particle filters, including twisted particle filter and the Auxiliary Particle Filter (APF). The idea is to introduce a family of ‘twisted’ models through functions $\psi$ in order to define a new model that would have the same marginal likelihood as the original model. Because the optimal policy $\psi^\star$,which corresponds to the idealized particle filter, is intractable in general cases, the authors use a series of functional approximations $\psi^0,\psi^1, \cdots, \psi^l$ to approximate $\psi^\star.$

The key the question to me is how to and how well can we approximate $\psi^\star$. The ‘how to’ qustions is addressed in Section 3.3. The series of backward recursions that are used to approximate $\psi^\star$ relies on the iterative relationship $\psi^\star(x_t) = p(y_t|x_t) f(x_t,\psi^\star_{t+1}) = p(y_t|x_t)\int_X f(x_{t},x_{t+1})\psi^{\star}_{t+1}(x_{t+1}) dx_{t+1}.$ In Algorithms 3, the authors suggest that we ‘choose $\psi^{t}$ on the basis of $\psi_{t}^{1:N}$ and $\psi_t^{1:N}.$ Later in Section 5, they mention that this step is implemented with a parametric optimization step for their experiments. In the ‘Discussions’ section, the authors mention alternatives such as nonparametric estimates which would require much higher cost. More importantly is the ‘how well’ part, because it is also possible that the class of functions $\Psi$ that we consider does not contain the optimal function $\psi$. I think this is a very interesting and importantly case and the authors report ‘fairly well’ performance of iAPF in this case.

As demonstrated in the ‘Applications and Examples’ section, “iAPF can provide substantially better estimates of the marginal likelihood than the Bootstrap Particle Filter (BPS) at the same computational cost.

References:

• Guarniero, P., Johansen, A. M., & Lee, A. (2017). The iterated auxiliary particle filter. Journal of the American Statistical Association112(520), 1636-1647.
• Pitt, M. K., & Shephard, N. (1999). Filtering via simulation: Auxiliary particle filters. Journal of the American statistical association94(446), 590-599
• Whiteley, N., & Lee, A. (2014). Twisted particle filters. The Annals of Statistics42(1), 115-141.