Hello 2021! This is my second Sunday Reading Notes post of the year. This week I have the chance of reading about Thompson Sampling, a topic I have been interested in since 2018. The paper “On Thompson Sampling for Smoother-than-Lipschitz Bandits” by Grant and Leslie, from Lancaster University, concerns the bounds on the regret for Thompson Sampling for continuum bandits. The regret bound is in terms of eluder dimension and ball-width function of the function class of the reward function
.
Both eluder dimension and ball-width function are new concepts to me. I was able to read more about them in Russo and Roy (2014) and was intrigued by the concept of -eluder dimension, which is a measure of complexity of a set of functions. An action
is independent of actions
with respect to
if we can find two functions in
can have similar function values at
but very different function values at
. Based on this notion of
-independence, we can define the notion of
-eluder dimension in a similar fashion to how we define dimension based on linear independence in linear algebra.
A key assumption of this framework is that the model is well-specified. The true reward function is not only included in the function class
, but also a sample from the posterior distribution
. Being a Bayesian statistician, I am naturally curious about the ‘miss-specified’ case. Will the regret bound decompose in terms of distance from
to
and complexity of
. Is the analysis feasible and what techniques should we use?
Reference:
Grant, J. & Leslie, D.. (2020). On Thompson Sampling for Smoother-than-Lipschitz Bandits. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 108:2612-2622
Russo, D., & Van Roy, B. (2014). Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4), 1221-1243.