Category Archives: Exploration-Exploitation

SRN – Racing Thompson by Zhou et al.

While browsing the accepted papers list of ICML 2018, I discovered this paper ‘Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors‘ by Zhou, Zhu, and Zhuo. Thompson sampling is a popular algorithm for exploration-exploitation tradeoff problems and is also known as Bayesian bandits. I decided to write my Sunday Reading Notes post on this paper because have been interested in the exploration-exploitation tradeoff for a while and explored this topic through Bayesian optimization and my WSDM’19 paper on sequential A/B testing.

Suppose we want to identify the best arm among K arms and we have some prior knowledge about their rewards \mu \sim \pi. Thompson sampling (TS) balances exploring unexplored arms getting rewards from arms already yielding high rewards by choosing the kth arm according to its the posterior probability of being the optimal arm P_{it} = \pi\left( \mu_i = \max_j \mu_j \right). The computational challenge is to compute the probabilities $P_{it}$. Because TS is often used as an online algorithm, efficient calculation of the posterior probabilities is very important. In the conjugate prior case, this calculation is done in O(K). With non-conjugate priors, I have seen in the literature that people using Markov Chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC). The authors recognize this probability as an expectation and propose to use an Importance Sampling (IS) step combined with Gumbel-Max trick, which transforms the sampling problem to an optimization problem, to sample k at time k according to probabilities \pi\left( \mu_i = \max_j \mu_j \right) =  \mathbb{E}_{\mu\sim \pi(\cdot|X(1:t))}\left[\mathbb{I}[\mu_k = \max_j {\mu_j}] \right] = \mathbb{E}_{\mu\sim B_t}\left[\mathbb{I}[\mu_k = \max_j {\mu_j}] \frac{\pi(\mu|X(1:t)}{B_t(\mu}\right].

The benefits of this IS step comes from flexibility to choose $B_t$ at each time step and also the authors leveraged the stopping rule of racing algorithms to deterime the number of IS samples needed to approximate the expectation.

The resulting algorithm, which combines benefits from Importance Sampling, Gumble-Max trick, and the racing algorithm, is proved to be (\delta,\sigma)-PAC, which is asymptotic good in the sense the total variance distance between the true value P_{it} and its estimate converges to zero.

Screen Shot 2018-11-11 at 3.47.24 PM.png

The regret of bandits with (b) Bernoulli bandits & non-conjugate prior and (c) Gaussian bandits & non-conjugate prior. Source: Figure 2 from Zhou et al.

What I find very interesting from the regret analysis section is the fact that the racing TS in this paper can provide much lower regret compared to Thompson sampling and prior-swapping (PS) even though it uses much few particles than SMC and PS. It is not intuitive to me why this should happen. But upon a little further investigation, I found that the priors used for TS and PS & Racing are different in both plots. For ease of implementation, the authors have chosen a conjugate prior for TS. This leaves me wondering what the results would be if we were to use MCMC or SMC with more particles as the baseline for the regret analysis.

References:

  • Zhou, Y., Zhu, J. & Zhuo, J.. (2018). Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors. Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:6000-6008

SRN – A near-optimal exploration-exploitation approach for assortment selection by Agrawal et al.

I am very interested in exploration-exploitation trade-off. In a previous Sunday Reading Notes post, I discussed bayesian optimization for learning optimal hyper-parameters for machine learning algorithms, as an example for this trade-off. Today I study another exploration-exploitation algorithm for learning the best assortment: A Near-Optimal Exploration-Exploitation Approach for Assortment Selection by Agrawal et al. It has applications in online advertisement display and recommendation systems for e-commerce.

If we take the recommendation system for e-commerce websites as an example. Suppose this website has N products in total, and we can only recommend a subset S to customers and this subset is at most size K.  This subset S is an assortment and K is a display constraint. When an assortment S is offered to a customer, the probability of the customer purchasing the i-th item in the assortment is p_i(S)  \propto \nu_i where i=0,...,|S| and p_o(S) is the probability of not purchasing any item from S. Further more if the price of each item in this assortment is r_i then the expected revenue from assortment S is R(S) = \sum_{i\in S}. If the probabilities are known, then we will have a static assortment optimization problem of S^\star  = \max_{s \in S} R(S). But if the preferences are not known or if they change over time, then we would want to learn user preference among assortments (explore) and recommend the optimal assortment to them (exploitation), and this is called a (dynamic) assortment optimization problem.  

In the optimization, we want to minimize the cumulative regret until time T which is reg(T) = \sum_{t=1}^T R(S^\star) - \mathbb{E}\left[R(S_t) \right]. In this paper, the user preferences which are described through probabilities p_i(S) are modeled with Multinomial Logistic model (MNL). With the MNL model, we have to estimate the model parameter \nu_i with some multi-armed bandit algorithm to balance exploration and exploitation. Therefore the authors refer to this problem as bandit-MNL. The main contribution of this paper is Theorem 4.1, in which they give a regret bound of O(\sqrt{NT} \log T + N \log^3T) with the proposed Algorithm 1.

Section 4 gives the proof of the main theorem and it is in my opinion very well-written. Proof outline is given in Section 4.1 and the authors marked the 3 steps of the proof, which are detailed later in Sections 4.2 – 4.4.

The exploration-exportation algorithm for bandit-MNL algorithm in Algorithm I iterations between learning the current best possible recommendation with using static assortment optimization methods. After recommending the calculated assortment to a customer, they observe purchasing decisions of customers until a no-purchase happens. Using information form user purchases, we can learn upper confidence bounds of model parameters $later \nu_{i}.$ The UCB estimates are used to find the assortment for the next iteration. This process of going back between offering assortments and observing purchase decisions continuous until T steps set before starting the experiment. In  Section 4.2, the authors should that the UCBs converge to the true value with high probability. (Be careful this is not a convergence in probability!).

In my opinion, this paper lies on the theoretical end of the spectrum and their is no simulation results or real-data examples accompanying the theoretical results. It would be nice to see some visualization of the cumulative regret with this algorithm and how it compare to the theoretical lower bound presented in Section 5.

I have two idea of how this problems can be extended to accommodate for complex applied problems. For examples, 1)  item prices are assumed to be known for now. It would be interesting extension for this paper to see if we can treat r_i‘s as random variables or something we can learn to optimize. 2) In this algorithm we have to assume customers are homogenous. Personalized assortment optimization would be an interesting direction to explore.


References:

  • Agrawal, Shipra, et al. “A near-optimal exploration-exploitation approach for assortment selection.” Proceedings of the 2016 ACM Conference on Economics and Computation. ACM, 2016.