## SRN – my second attempt to understand ‘Controlled SMC’ by Heng et al.

It has been two years since I last tried to understand the paper ‘Controlled Sequential Monte Carlo’ by Jeremy Heng et al. Over the past two years, I picked this paper up many times, and every time I read it, I understood it a little better. I am finally ready to share my understanding of this paper on this blog.

It has been two years since I last tried to understand the paper ‘Controlled Sequential Monte Carlo’ by Jeremy Heng et al. Over the past two years, I picked this paper up many times, and every time I read it, I understood it a little better. I am finally ready to share my understanding of this paper on this blog.

This paper opens with an introduction to the Feynman-Kac path measure. A triple of (1) initial distribution, (2) a collection of Markov kernels, and (3) potential functions determines a Feynman-Kac path measure. For readers familiar with state-space models, this is not something new: the triple can correspond to (1) the initial distribution of hidden process, (2) Markov kernels of the hidden process and (3) observation densities.

For each Feynman-Kac path measure, there is a sequential Monte Carlo algorithm (aka particle filters) that provides a particle approximation of the path measure. Furthermore, Feynman-Kac path measures can be twisted. Although all the kernels and the potentials are changed, the path measure is the same at the terminal time and the normalizing constant at the terminal time $Z_T$ stays the same.

To infer state-space models, we are interested in the smoothing distribution $x_{0:T}$ given $y_{0:T}$ or the marginal likelihood $p(Y_{0:T})$. The smoothing distribution has Feynman-Kac measure representations, with the marginal likelihood being the normalizing constant.  The most natural one is described above, and it corresponds to the bootstrap particle filter. But it is not unique. For example, the path measure associated with a 1-step look ahead kernel corresponds to the popular ‘fully-adapted auxiliary particle filter’. As mentioned above, each of these path measures can be twisted. Proposition 1 defines an optimal policy with respect to each path measure. Optimality is defined in the sense of approximating the normalizing constant almost surely. The optimal policy with respect to the path measure corresponding to the bootstrap particle filter is the backward information filter $\phi^{\star}(x_t) = p(y_{t:T}|x_t)$

The optimal policies $\phi^{\star}$ are usually intractable, but they can be approximated. After running any SMC algorithm, we have a particle approximation of the path measure, which gives us a particle approximation of the optimal policy. We can use regression to learn a simple policy $\phi$ that minimized the difference between $\log\phi$ and particle approximation of $\log\phi^{\star}$. This regression is done recursively starting at $t = T$ and ending at $t = 0$, and it is called approximate dynamic programming in the paper.  Given this policy, we have a new path measure, which means that we can run another particle filter according to it and find the optimal policy with respect to the new measure. This process is called policy refinement. According to the authors, in practice, only a few iterations of policy refinement is necessary. They also offer performance monitoring criteria based on effective sample size and regression residuals.

What fascinates (and puzzles) me is that errors do not accumulate over either iteration of policy refinement or approximate dynamic programming. I hope to read this paper is greater detail and be able to report back to my readers soon.

## SRN – Controlled Sequential Monte Carlo by Heng et al. (Part 1)

Returning to my Sunday Reading Notes series from Thanksgiving recess, I want to discuss the paper ‘Controlled Sequential Monte Carlo’ by Jeremy Heng, Adrian N. Bishop, George Deligiannidis and Arnaud Doucet.

Returning to my Sunday Reading Notes series from Thanksgiving recess, I want to discuss the paper ‘Controlled Sequential Monte Carlo’ by Jeremy Heng, Adrian N. Bishop, George Deligiannidis and Arnaud Doucet. You can read Xi’an’s blog post discussing this paper.

As pointed out in Xi’an’s blog, this is a massive paper, and I have only managed to go through Section 3. This paper starts with introducing Feynman-Kac models and twisted Feynman-Kac models, which is motivated by designing observations dependent kernels for sequential Monte Carlo (SMC) methods. At the terminal time $T$, the twisted model and original model are the same and at all times $t = 0,1,2,\cdots, T$ the two models have the same normalizing constant which we try to approximate with SMC. With a current policy $\psi$, there is a policy refinement $\psi^\star = \psi \cdot \phi^\star$ that is a admissible policy and $\psi^star$ is referred to as the optimal policy by the authors because of its connection with Kullback-Leibler optimal control problem.

The policy refinements can be found using backward recursion $\psi^\star_T = G_T^\phi$ and $\psi^\star_t = Q_t^\phi \phi^{\star}_{t+1}$ for $t = 0,\cdots, T-1$ where $Q$ are Bellman operators. Because the policies $\psi^\star$ are intractable, the authors introduce logarithmic projection as a means of approximation. This is the Approximate Dynamic Programming (ADP) algorithm that the authors outlined in Algorithm 2. The final algorithm (Algorithm 3) alternates between performing a twisted SMC and ADP to find the next policy refinement.

References:

Heng, J., Bishop, A. N., Deligiannidis, G., & Doucet, A. (2017). Controlled Sequential Monte Carlo. arXiv preprint arXiv:1708.08396.

## SRN – Racing Thompson by Zhou et al.

While browsing the accepted papers list of ICML 2018, I discovered this paper ‘Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors‘ by Zhou, Zhu, and Zhuo. Thompson sampling is a popular algorithm for exploration-exploitation tradeoff problems and is also known as Bayesian bandits. I decided to write my Sunday Reading Notes post on this paper because have been interested in the exploration-exploitation tradeoff for a while and explored this topic through Bayesian optimization and my WSDM’19 paper on sequential A/B testing.

Suppose we want to identify the best arm among $K$ arms and we have some prior knowledge about their rewards $\mu \sim \pi.$ Thompson sampling (TS) balances exploring unexplored arms getting rewards from arms already yielding high rewards by choosing the $k$th arm according to its the posterior probability of being the optimal arm $P_{it} = \pi\left( \mu_i = \max_j \mu_j \right).$ The computational challenge is to compute the probabilities $P_{it}$. Because TS is often used as an online algorithm, efficient calculation of the posterior probabilities is very important. In the conjugate prior case, this calculation is done in $O(K)$. With non-conjugate priors, I have seen in the literature that people using Markov Chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC). The authors recognize this probability as an expectation and propose to use an Importance Sampling (IS) step combined with Gumbel-Max trick, which transforms the sampling problem to an optimization problem, to sample $k$ at time $k$ according to probabilities $\pi\left( \mu_i = \max_j \mu_j \right) = \mathbb{E}_{\mu\sim \pi(\cdot|X(1:t))}\left[\mathbb{I}[\mu_k = \max_j {\mu_j}] \right] = \mathbb{E}_{\mu\sim B_t}\left[\mathbb{I}[\mu_k = \max_j {\mu_j}] \frac{\pi(\mu|X(1:t)}{B_t(\mu}\right].$

The benefits of this IS step comes from flexibility to choose $B_t$ at each time step and also the authors leveraged the stopping rule of racing algorithms to deterime the number of IS samples needed to approximate the expectation.

The resulting algorithm, which combines benefits from Importance Sampling, Gumble-Max trick, and the racing algorithm, is proved to be $(\delta,\sigma)-$PAC, which is asymptotic good in the sense the total variance distance between the true value $P_{it}$ and its estimate converges to zero.

What I find very interesting from the regret analysis section is the fact that the racing TS in this paper can provide much lower regret compared to Thompson sampling and prior-swapping (PS) even though it uses much few particles than SMC and PS. It is not intuitive to me why this should happen. But upon a little further investigation, I found that the priors used for TS and PS & Racing are different in both plots. For ease of implementation, the authors have chosen a conjugate prior for TS. This leaves me wondering what the results would be if we were to use MCMC or SMC with more particles as the baseline for the regret analysis.

References:

• Zhou, Y., Zhu, J. & Zhuo, J.. (2018). Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors. Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:6000-6008