# SRN – Controlled Sequential Monte Carlo by Heng et al. (Part 1)

Returning to my Sunday Reading Notes series from Thanksgiving recess, I want to discuss the paper ‘Controlled Sequential Monte Carlo’ by Jeremy Heng, Adrian N. Bishop, George Deligiannidis and Arnaud Doucet. You can read Xi’an’s blog post discussing this paper.

As pointed out in Xi’an’s blog, this is a massive paper, and I have only managed to go through Section 3. This paper starts with introducing Feynman-Kac models and twisted Feynman-Kac models, which is motivated by designing observations dependent kernels for sequential Monte Carlo (SMC) methods. At the terminal time $T$, the twisted model and original model are the same and at all times $t = 0,1,2,\cdots, T$ the two models have the same normalizing constant which we try to approximate with SMC. With a current policy $\psi$, there is a policy refinement $\psi^\star = \psi \cdot \phi^\star$ that is a admissible policy and $\psi^star$ is referred to as the optimal policy by the authors because of its connection with Kullback-Leibler optimal control problem.

The policy refinements can be found using backward recursion $\psi^\star_T = G_T^\phi$ and $\psi^\star_t = Q_t^\phi \phi^{\star}_{t+1}$ for $t = 0,\cdots, T-1$ where $Q$ are Bellman operators. Because the policies $\psi^\star$ are intractable, the authors introduce logarithmic projection as a means of approximation. This is the Approximate Dynamic Programming (ADP) algorithm that the authors outlined in Algorithm 2. The final algorithm (Algorithm 3) alternates between performing a twisted SMC and ADP to find the next policy refinement.

References:

Heng, J., Bishop, A. N., Deligiannidis, G., & Doucet, A. (2017). Controlled Sequential Monte Carlo. arXiv preprint arXiv:1708.08396.

# SRN – Racing Thompson by Zhou et al.

While browsing the accepted papers list of ICML 2018, I discovered this paper ‘Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors‘ by Zhou, Zhu, and Zhuo. Thompson sampling is a popular algorithm for exploration-exploitation tradeoff problems and is also known as Bayesian bandits. I decided to write my Sunday Reading Notes post on this paper because have been interested in the exploration-exploitation tradeoff for a while and explored this topic through Bayesian optimization and my WSDM’19 paper on sequential A/B testing.

Suppose we want to identify the best arm among $K$ arms and we have some prior knowledge about their rewards $\mu \sim \pi.$ Thompson sampling (TS) balances exploring unexplored arms getting rewards from arms already yielding high rewards by choosing the $k$th arm according to its the posterior probability of being the optimal arm $P_{it} = \pi\left( \mu_i = \max_j \mu_j \right).$ The computational challenge is to compute the probabilities $P_{it}$. Because TS is often used as an online algorithm, efficient calculation of the posterior probabilities is very important. In the conjugate prior case, this calculation is done in $O(K)$. With non-conjugate priors, I have seen in the literature that people using Markov Chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC). The authors recognize this probability as an expectation and propose to use an Importance Sampling (IS) step combined with Gumbel-Max trick, which transforms the sampling problem to an optimization problem, to sample $k$ at time $k$ according to probabilities $\pi\left( \mu_i = \max_j \mu_j \right) = \mathbb{E}_{\mu\sim \pi(\cdot|X(1:t))}\left[\mathbb{I}[\mu_k = \max_j {\mu_j}] \right] = \mathbb{E}_{\mu\sim B_t}\left[\mathbb{I}[\mu_k = \max_j {\mu_j}] \frac{\pi(\mu|X(1:t)}{B_t(\mu}\right].$

The benefits of this IS step comes from flexibility to choose $B_t$ at each time step and also the authors leveraged the stopping rule of racing algorithms to deterime the number of IS samples needed to approximate the expectation.

The resulting algorithm, which combines benefits from Importance Sampling, Gumble-Max trick, and the racing algorithm, is proved to be $(\delta,\sigma)-$PAC, which is asymptotic good in the sense the total variance distance between the true value $P_{it}$ and its estimate converges to zero.

The regret of bandits with (b) Bernoulli bandits & non-conjugate prior and (c) Gaussian bandits & non-conjugate prior. Source: Figure 2 from Zhou et al.

What I find very interesting from the regret analysis section is the fact that the racing TS in this paper can provide much lower regret compared to Thompson sampling and prior-swapping (PS) even though it uses much few particles than SMC and PS. It is not intuitive to me why this should happen. But upon a little further investigation, I found that the priors used for TS and PS & Racing are different in both plots. For ease of implementation, the authors have chosen a conjugate prior for TS. This leaves me wondering what the results would be if we were to use MCMC or SMC with more particles as the baseline for the regret analysis.

References:

• Zhou, Y., Zhu, J. & Zhuo, J.. (2018). Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors. Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:6000-6008