## SRN – Bayesian regret of Thompson Sampling

Hello 2021! This is my second Sunday Reading Notes post of the year. This week I have the chance of reading about Thompson Sampling, a topic I have been interested in since 2018. The paper “On Thompson Sampling for Smoother-than-Lipschitz Bandits” by Grant and Leslie, from Lancaster University, concerns the bounds on the regret for Thompson Sampling for continuum bandits. The regret bound is in terms of eluder dimension and ball-width function of the function class $\mathcal{F}$ of the reward function $f$.

Both eluder dimension and ball-width function are new concepts to me. I was able to read more about them in Russo and Roy (2014) and was intrigued by the concept of $\epsilon$-eluder dimension, which is a measure of complexity of a set of functions. An action $a$ is independent of actions $\{a_1,\ldots,a_n\}$ with respect to $\mathcal{F}$ if we can find two functions in $\mathcal{F}$ can have similar function values at $\{a_1,\ldots,a_n\}$ but very different function values at $a$. Based on this notion of $\epsilon$-independence, we can define the notion of $\epsilon$-eluder dimension in a similar fashion to how we define dimension based on linear independence in linear algebra.

A key assumption of this framework is that the model is well-specified. The true reward function $f$ is not only included in the function class $\mathcal{F}$, but also a sample from the posterior distribution $p_0$. Being a Bayesian statistician, I am naturally curious about the ‘miss-specified’ case. Will the regret bound decompose in terms of distance from $f$ to $\mathcal{F}$ and complexity of $\mathcal{F}$. Is the analysis feasible and what techniques should we use?

Reference:

Grant, J. & Leslie, D.. (2020). On Thompson Sampling for Smoother-than-Lipschitz Bandits. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 108:2612-2622

Russo, D., & Van Roy, B. (2014). Learning to optimize via posterior sampling. Mathematics of Operations Research39(4), 1221-1243.

## SRN – Racing Thompson by Zhou et al.

While browsing the accepted papers list of ICML 2018, I discovered this paper ‘Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors‘ by Zhou, Zhu, and Zhuo. Thompson sampling is a popular algorithm for exploration-exploitation tradeoff problems and is also known as Bayesian bandits. I decided to write my Sunday Reading Notes post on this paper because have been interested in the exploration-exploitation tradeoff for a while and explored this topic through Bayesian optimization and my WSDM’19 paper on sequential A/B testing.

Suppose we want to identify the best arm among $K$ arms and we have some prior knowledge about their rewards $\mu \sim \pi.$ Thompson sampling (TS) balances exploring unexplored arms getting rewards from arms already yielding high rewards by choosing the $k$th arm according to its the posterior probability of being the optimal arm $P_{it} = \pi\left( \mu_i = \max_j \mu_j \right).$ The computational challenge is to compute the probabilities $P_{it}$. Because TS is often used as an online algorithm, efficient calculation of the posterior probabilities is very important. In the conjugate prior case, this calculation is done in $O(K)$. With non-conjugate priors, I have seen in the literature that people using Markov Chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC). The authors recognize this probability as an expectation and propose to use an Importance Sampling (IS) step combined with Gumbel-Max trick, which transforms the sampling problem to an optimization problem, to sample $k$ at time $k$ according to probabilities $\pi\left( \mu_i = \max_j \mu_j \right) = \mathbb{E}_{\mu\sim \pi(\cdot|X(1:t))}\left[\mathbb{I}[\mu_k = \max_j {\mu_j}] \right] = \mathbb{E}_{\mu\sim B_t}\left[\mathbb{I}[\mu_k = \max_j {\mu_j}] \frac{\pi(\mu|X(1:t)}{B_t(\mu}\right].$

The benefits of this IS step comes from flexibility to choose $B_t$ at each time step and also the authors leveraged the stopping rule of racing algorithms to deterime the number of IS samples needed to approximate the expectation.

The resulting algorithm, which combines benefits from Importance Sampling, Gumble-Max trick, and the racing algorithm, is proved to be $(\delta,\sigma)-$PAC, which is asymptotic good in the sense the total variance distance between the true value $P_{it}$ and its estimate converges to zero.

What I find very interesting from the regret analysis section is the fact that the racing TS in this paper can provide much lower regret compared to Thompson sampling and prior-swapping (PS) even though it uses much few particles than SMC and PS. It is not intuitive to me why this should happen. But upon a little further investigation, I found that the priors used for TS and PS & Racing are different in both plots. For ease of implementation, the authors have chosen a conjugate prior for TS. This leaves me wondering what the results would be if we were to use MCMC or SMC with more particles as the baseline for the regret analysis.

References:

• Zhou, Y., Zhu, J. & Zhuo, J.. (2018). Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors. Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:6000-6008

## SRN – A near-optimal exploration-exploitation approach for assortment selection by Agrawal et al.

I am very interested in exploration-exploitation trade-off. In a previous Sunday Reading Notes post, I discussed bayesian optimization for learning optimal hyper-parameters for machine learning algorithms, as an example for this trade-off. Today I study another exploration-exploitation algorithm for learning the best assortment: A Near-Optimal Exploration-Exploitation Approach for Assortment Selection by Agrawal et al. It has applications in online advertisement display and recommendation systems for e-commerce.

If we take the recommendation system for e-commerce websites as an example. Suppose this website has $N$ products in total, and we can only recommend a subset $S$ to customers and this subset is at most size $K$.  This subset $S$ is an assortment and $K$ is a display constraint. When an assortment $S$ is offered to a customer, the probability of the customer purchasing the $i$-th item in the assortment is $p_i(S) \propto \nu_i$ where $i=0,...,|S|$ and $p_o(S)$ is the probability of not purchasing any item from $S$. Further more if the price of each item in this assortment is $r_i$ then the expected revenue from assortment $S$ is $R(S) = \sum_{i\in S}$. If the probabilities are known, then we will have a static assortment optimization problem of $S^\star = \max_{s \in S} R(S)$. But if the preferences are not known or if they change over time, then we would want to learn user preference among assortments (explore) and recommend the optimal assortment to them (exploitation), and this is called a (dynamic) assortment optimization problem.

In the optimization, we want to minimize the cumulative regret until time $T$ which is $reg(T) = \sum_{t=1}^T R(S^\star) - \mathbb{E}\left[R(S_t) \right]$. In this paper, the user preferences which are described through probabilities $p_i(S)$ are modeled with Multinomial Logistic model (MNL). With the MNL model, we have to estimate the model parameter $\nu_i$ with some multi-armed bandit algorithm to balance exploration and exploitation. Therefore the authors refer to this problem as bandit-MNL. The main contribution of this paper is Theorem 4.1, in which they give a regret bound of $O(\sqrt{NT} \log T + N \log^3T)$ with the proposed Algorithm 1.

Section 4 gives the proof of the main theorem and it is in my opinion very well-written. Proof outline is given in Section 4.1 and the authors marked the 3 steps of the proof, which are detailed later in Sections 4.2 – 4.4.

The exploration-exportation algorithm for bandit-MNL algorithm in Algorithm I iterations between learning the current best possible recommendation with using static assortment optimization methods. After recommending the calculated assortment to a customer, they observe purchasing decisions of customers until a no-purchase happens. Using information form user purchases, we can learn upper confidence bounds of model parameters $later \nu_{i}.$ The UCB estimates are used to find the assortment for the next iteration. This process of going back between offering assortments and observing purchase decisions continuous until $T$ steps set before starting the experiment. In  Section 4.2, the authors should that the UCBs converge to the true value with high probability. (Be careful this is not a convergence in probability!).

In my opinion, this paper lies on the theoretical end of the spectrum and their is no simulation results or real-data examples accompanying the theoretical results. It would be nice to see some visualization of the cumulative regret with this algorithm and how it compare to the theoretical lower bound presented in Section 5.

I have two idea of how this problems can be extended to accommodate for complex applied problems. For examples, 1)  item prices are assumed to be known for now. It would be interesting extension for this paper to see if we can treat $r_i$‘s as random variables or something we can learn to optimize. 2) In this algorithm we have to assume customers are homogenous. Personalized assortment optimization would be an interesting direction to explore.

References:

• Agrawal, Shipra, et al. “A near-optimal exploration-exploitation approach for assortment selection.” Proceedings of the 2016 ACM Conference on Economics and Computation. ACM, 2016.