SRN – Controlled Sequential Monte Carlo by Heng et al. (Part 1)

Returning to my Sunday Reading Notes series from Thanksgiving recess, I want to discuss the paper ‘Controlled Sequential Monte Carlo’ by Jeremy Heng, Adrian N. Bishop, George Deligiannidis and Arnaud Doucet.

Returning to my Sunday Reading Notes series from Thanksgiving recess, I want to discuss the paper ‘Controlled Sequential Monte Carlo’ by Jeremy Heng, Adrian N. Bishop, George Deligiannidis and Arnaud Doucet. You can read Xi’an’s blog post discussing this paper.

As pointed out in Xi’an’s blog, this is a massive paper, and I have only managed to go through Section 3. This paper starts with introducing Feynman-Kac models and twisted Feynman-Kac models, which is motivated by designing observations dependent kernels for sequential Monte Carlo (SMC) methods. At the terminal time $T$, the twisted model and original model are the same and at all times $t = 0,1,2,\cdots, T$ the two models have the same normalizing constant which we try to approximate with SMC. With a current policy $\psi$, there is a policy refinement $\psi^\star = \psi \cdot \phi^\star$ that is a admissible policy and $\psi^star$ is referred to as the optimal policy by the authors because of its connection with Kullback-Leibler optimal control problem.

The policy refinements can be found using backward recursion $\psi^\star_T = G_T^\phi$ and $\psi^\star_t = Q_t^\phi \phi^{\star}_{t+1}$ for $t = 0,\cdots, T-1$ where $Q$ are Bellman operators. Because the policies $\psi^\star$ are intractable, the authors introduce logarithmic projection as a means of approximation. This is the Approximate Dynamic Programming (ADP) algorithm that the authors outlined in Algorithm 2. The final algorithm (Algorithm 3) alternates between performing a twisted SMC and ADP to find the next policy refinement.

References:

Heng, J., Bishop, A. N., Deligiannidis, G., & Doucet, A. (2017). Controlled Sequential Monte Carlo. arXiv preprint arXiv:1708.08396.

SRN – The Iterated Auxiliary Particle Filter by Guarniero et al.

In this week’s Sunday Reading Post, I want to discuss the paper ‘The Iterated Auxiliary Particle Filter’ by Pieralberto Guarniero, Adam Johansen and Antony Lee.  The algorithm proposed, iAPF, approximates an idealized particle filter, which would have zero variance in terms of estimating the marginal likelihood of a hidden Markov model. It is worth mentioning that finding if we were to use an idealized particle filter, we would only need one particle instead of a system of particles for marginal likelihood estimations and that the idealized particle filter would be the backward smoothing provided a perfect forward filtering step.

In this week’s Sunday Reading Post, I want to discuss the paper ‘The Iterated Auxiliary Particle Filter’ (iAPF) by Pieralberto Guarniero, Adam Johansen and Antony Lee.

The algorithm proposed, iAPF, approximates an idealized particle filter, which would have zero variance in terms of estimating the marginal likelihood of a hidden Markov model. The motivation is that if we were to use an idealized particle filter, we would only need one particle instead of a system of particles for marginal likelihood estimations. The idealized particle filter would be the backward-smoothing provided a perfect forward-filtering step.

This paper builds upon previous studies of particle filters, including twisted particle filter and the Auxiliary Particle Filter (APF). The idea is to introduce a family of ‘twisted’ models through functions $\psi$ in order to define a new model that would have the same marginal likelihood as the original model. Because the optimal policy $\psi^\star$,which corresponds to the idealized particle filter, is intractable in general cases, the authors use a series of functional approximations $\psi^0,\psi^1, \cdots, \psi^l$ to approximate $\psi^\star.$

The key the question to me is how to and how well can we approximate $\psi^\star$. The ‘how to’ qustions is addressed in Section 3.3. The series of backward recursions that are used to approximate $\psi^\star$ relies on the iterative relationship $\psi^\star(x_t) = p(y_t|x_t) f(x_t,\psi^\star_{t+1}) = p(y_t|x_t)\int_X f(x_{t},x_{t+1})\psi^{\star}_{t+1}(x_{t+1}) dx_{t+1}.$ In Algorithms 3, the authors suggest that we ‘choose $\psi^{t}$ on the basis of $\psi_{t}^{1:N}$ and $\psi_t^{1:N}.$ Later in Section 5, they mention that this step is implemented with a parametric optimization step for their experiments. In the ‘Discussions’ section, the authors mention alternatives such as nonparametric estimates which would require much higher cost. More importantly is the ‘how well’ part, because it is also possible that the class of functions $\Psi$ that we consider does not contain the optimal function $\psi$. I think this is a very interesting and importantly case and the authors report ‘fairly well’ performance of iAPF in this case.

As demonstrated in the ‘Applications and Examples’ section, “iAPF can provide substantially better estimates of the marginal likelihood than the Bootstrap Particle Filter (BPS) at the same computational cost.

References:

• Guarniero, P., Johansen, A. M., & Lee, A. (2017). The iterated auxiliary particle filter. Journal of the American Statistical Association112(520), 1636-1647.
• Pitt, M. K., & Shephard, N. (1999). Filtering via simulation: Auxiliary particle filters. Journal of the American statistical association94(446), 590-599
• Whiteley, N., & Lee, A. (2014). Twisted particle filters. The Annals of Statistics42(1), 115-141.

SRN – Racing Thompson by Zhou et al.

While browsing the accepted papers list of ICML 2018, I discovered this paper ‘Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors‘ by Zhou, Zhu, and Zhuo. Thompson sampling is a popular algorithm for exploration-exploitation tradeoff problems and is also known as Bayesian bandits. I decided to write my Sunday Reading Notes post on this paper because have been interested in the exploration-exploitation tradeoff for a while and explored this topic through Bayesian optimization and my WSDM’19 paper on sequential A/B testing.

Suppose we want to identify the best arm among $K$ arms and we have some prior knowledge about their rewards $\mu \sim \pi.$ Thompson sampling (TS) balances exploring unexplored arms getting rewards from arms already yielding high rewards by choosing the $k$th arm according to its the posterior probability of being the optimal arm $P_{it} = \pi\left( \mu_i = \max_j \mu_j \right).$ The computational challenge is to compute the probabilities $P_{it}$. Because TS is often used as an online algorithm, efficient calculation of the posterior probabilities is very important. In the conjugate prior case, this calculation is done in $O(K)$. With non-conjugate priors, I have seen in the literature that people using Markov Chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC). The authors recognize this probability as an expectation and propose to use an Importance Sampling (IS) step combined with Gumbel-Max trick, which transforms the sampling problem to an optimization problem, to sample $k$ at time $k$ according to probabilities $\pi\left( \mu_i = \max_j \mu_j \right) = \mathbb{E}_{\mu\sim \pi(\cdot|X(1:t))}\left[\mathbb{I}[\mu_k = \max_j {\mu_j}] \right] = \mathbb{E}_{\mu\sim B_t}\left[\mathbb{I}[\mu_k = \max_j {\mu_j}] \frac{\pi(\mu|X(1:t)}{B_t(\mu}\right].$

The benefits of this IS step comes from flexibility to choose $B_t$ at each time step and also the authors leveraged the stopping rule of racing algorithms to deterime the number of IS samples needed to approximate the expectation.

The resulting algorithm, which combines benefits from Importance Sampling, Gumble-Max trick, and the racing algorithm, is proved to be $(\delta,\sigma)-$PAC, which is asymptotic good in the sense the total variance distance between the true value $P_{it}$ and its estimate converges to zero.

What I find very interesting from the regret analysis section is the fact that the racing TS in this paper can provide much lower regret compared to Thompson sampling and prior-swapping (PS) even though it uses much few particles than SMC and PS. It is not intuitive to me why this should happen. But upon a little further investigation, I found that the priors used for TS and PS & Racing are different in both plots. For ease of implementation, the authors have chosen a conjugate prior for TS. This leaves me wondering what the results would be if we were to use MCMC or SMC with more particles as the baseline for the regret analysis.

References:

• Zhou, Y., Zhu, J. & Zhuo, J.. (2018). Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors. Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:6000-6008

SRN – Scalable Bayesian Inference for Excitatory Point Process Networks by Linderman and Adams

The topic of this week’s SRN is latent network discovery. The paper ‘Scalable bayesian inference for excitatory point process networks‘ by Scott Linderman and Ryan Adams is suggested to me by Gonzalo Mena.

In a multivariate Hawkes process, there are $K$ individual point processes and events on one constituent process are more likely to happen right after an event from a neighboring process. To put it in other words, events are ‘induce’ subsequent events according to a network describing their interactions. If we imagine a network of neurons, if neuron A and B are connected and when neuron A fires it can excite neuron B and neuron B has a higher chance of firing. When we want to infer connectivity, if neuron B consistently after neuron A fires, we might  infer that there is likely an excitatory connection between them.

In the work by Linderman and Adams, they extend previous works on continuous time Hawkes process and study the discrete time network Hawkes model. This discretization is motivated by computational concerns. In Simma and Jordan (2010), the authors invoke auxiliary variable $z_{k,n}$ that which denotes the origin of the $n$-th event on process $k$. So the number of auxiliary variables needed is large for highly active networks, which could be a computational bottleneck. But working in the discrete model assuming a time scale $\Delta t$, we can ‘bin’ events and ignore interactions within the same bin. This reduction in number of auxiliary variables improves the run time of inference algorithms such as Gibbs sampler.

Besides Gibbs sampler, Linderman and Adams also derived stochastic variational inference algorithm for the discrete time network Hawkes model. This work focuses on scaling to long duration $T$ and the problem of scaling with size of network $K$ remains open.

References:

• Linderman, S. W., & Adams, R. P. (2015). Scalable bayesian inference for excitatory point process networks. arXiv preprint arXiv:1507.03228.
• Linderman, S., & Adams, R. (2014, January). Discovering latent network structure in point process data. In International Conference on Machine Learning (pp. 1413-1421).
• Simma, A., & Jordan, M. I. (2012). Modeling events with cascades of Poisson processes. arXiv preprint arXiv:1203.3516.
• Steven Morse (2017, June). Python class for Hawkes Process. [Blog post] https://stmorse.github.io/journal/Hawkes-python.html.