SRN – Bayesian regret of Thompson Sampling

Hello 2021! This is my second Sunday Reading Notes post of the year. This week I have the chance of reading about Thompson Sampling, a topic I have been interested in since 2018. The paper “On Thompson Sampling for Smoother-than-Lipschitz Bandits” by Grant and Leslie, from Lancaster University, concerns the bounds on the regret for Thompson Sampling for continuum bandits. The regret bound is in terms of eluder dimension and ball-width function of the function class \mathcal{F} of the reward function f.

Both eluder dimension and ball-width function are new concepts to me. I was able to read more about them in Russo and Roy (2014) and was intrigued by the concept of \epsilon-eluder dimension, which is a measure of complexity of a set of functions. An action a is independent of actions \{a_1,\ldots,a_n\} with respect to \mathcal{F} if we can find two functions in \mathcal{F} can have similar function values at \{a_1,\ldots,a_n\} but very different function values at a. Based on this notion of \epsilon-independence, we can define the notion of \epsilon-eluder dimension in a similar fashion to how we define dimension based on linear independence in linear algebra.

A key assumption of this framework is that the model is well-specified. The true reward function f is not only included in the function class \mathcal{F}, but also a sample from the posterior distribution p_0. Being a Bayesian statistician, I am naturally curious about the ‘miss-specified’ case. Will the regret bound decompose in terms of distance from f to \mathcal{F} and complexity of \mathcal{F}. Is the analysis feasible and what techniques should we use?

Reference:

Grant, J. & Leslie, D.. (2020). On Thompson Sampling for Smoother-than-Lipschitz Bandits. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 108:2612-2622

Russo, D., & Van Roy, B. (2014). Learning to optimize via posterior sampling. Mathematics of Operations Research39(4), 1221-1243.

SRN – the magical 0.574

Hello 2021! This week, I keep reading about optimal scaling of Markov chain Monte Carlo algorithms and I move on to a new class of algorithms: MALA. This time, the magical number is 0.574. 

Hello 2021! Here is the first post in the Sunday Reading Notes series in the year of 2021. This week, I keep reading about optimal scaling of Markov chain Monte Carlo algorithms and I move on to a new class of algorithms: MALA.

Roberts and Rosenthal (1998, JRSSB) concerns the optimal scaling of Metropolis-Adjusted Langevin algorithm (MALA) for target distributions of the form {\pi^d(x) = \prod_{i=1}^d f(x_i)}. Unlike random walk Metropolis algorithm, in MALA, the proposal distribution uses the gradient of {\pi^d} at {x_t}:

\displaystyle x^{\star} = x_t + \sigma_n Z_{t+1} + \frac{\sigma^2}{2} \nabla \log (\pi^d(x_t)).

Here {\sigma_n} is the step variance and {Z_{t+1} \sim \mathcal{N}(0,I_d)} is a d-dimensional standard normal. Following this proposal, we perform an accept / reject step, so that the resulting Markov chain has the desired stationary distribution:
\displaystyle \alpha_d(x_t,x^{\star}) = \min \left\{ 1, \frac{\pi^d(x^{\star})}{\pi^d(x_t)} \frac{q_d(x^{\star},x_t)}{q_d(x_t,y_{t+1})}\right\}.

There are some smoothness assumptions on the log-density {g = \log f}: absolute value of the first 8 order derivatives are bounded, {g'} is Lipschitz, and all moments exist for {f}.

Theorem 3 (diffusion limit of first component)

Consider the chain {\{X\}} starting from stationarity and following the MALA algorithm with variance {\sigma_d^2 = l^2 n^{-1/3}} for some {l > 0}. Let {\{J\}} be a Poisson process with rate {n^{1/3}}, {\Gamma^n_t = X_{J_t}} and {U^d} is the first component of {\Gamma^n}, i.e. {U^d_t = X^d_{J_t,1}}. We must have, as {d \to \infty}, the process {U^d \to U}, where {U} is a Langevin diffusion defined by
\displaystyle dU_t = \sqrt{h(l)} dB_t + \frac{1}{2} h(l) \frac{d}{dx} \log (\pi_1(U_t))dt.

Here {h(l) = 2 l^2 \Phi(-Kl^3/2)} is the speed measure and {K = \sqrt{\mathbb{E} \left[\frac{5g'''(X)^3 - 3g''(X)}{48}\right]} > 0}.
The skeleton of the proof is similar to that of proving diffusion limit of random walk Metropolis. We will find sets {F_d^* \subseteq \mathbb{R}^d} such that {\mathbb{P}\left(\Gamma_s^d \in F_d^{\star} \ \forall \ 0 \le s \le t\right) \to 1} and on these sets the generators converges uniformly.

Theorem 4 (convergence of acceptance rate)
\displaystyle \lim_{d \to \infty} \alpha_d(l) = \alpha(l) where {a(l) = 2 \Phi(-Kl^3/2).}
As a result, when {h(l)} is maximized, the acceptance rate {\alpha(l) \approx 0.574}. Also the step variance should be tuned to be {\sigma_d^2 = (\hat{l}_d)^2 n^{-1/3}.}

MALA enjoys a faster convergence rate of order {n^{1/3}} compared to order {n} of RWM. But let’s also keep in mind that MALA requires an extra step to compute the gradients {\log \pi(x) = \sum_{i=1}^n g'(x)}, which takes {n} steps to compute.

Reference:

Roberts, G. O., & Rosenthal, J. S. (1998). Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology)60(1), 255-268.