I am very interested in exploration-exploitation trade-off. In a previous Sunday Reading Notes post, I discussed bayesian optimization for learning optimal hyper-parameters for machine learning algorithms, as an example for this trade-off. Today I study another exploration-exploitation algorithm for learning the best assortment: A Near-Optimal Exploration-Exploitation Approach for Assortment Selection by Agrawal et al. It has applications in online advertisement display and recommendation systems for e-commerce.

If we take the recommendation system for e-commerce websites as an example. Suppose this website has products in total, and we can only recommend a subset to customers and this subset is at most size . This subset is an assortment and is a display constraint. When an assortment is offered to a customer, the probability of the customer purchasing the -th item in the assortment is where and is the probability of not purchasing any item from . Further more if the price of each item in this assortment is then the expected revenue from assortment is . If the probabilities are known, then we will have a static assortment optimization problem of . But if the preferences are not known or if they change over time, then we would want to learn user preference among assortments (explore) and recommend the optimal assortment to them (exploitation), and this is called a *(dynamic) assortment optimization problem. *

In the optimization, we want to minimize the cumulative regret until time which is . In this paper, the user preferences which are described through probabilities are modeled with Multinomial Logistic model (MNL). With the MNL model, we have to estimate the model parameter with some multi-armed bandit algorithm to balance exploration and exploitation. Therefore the authors refer to this problem as *bandit-MNL*. The main contribution of this paper is Theorem 4.1, in which they give a regret bound of with the proposed Algorithm 1.

Section 4 gives the proof of the main theorem and it is in my opinion very well-written. Proof outline is given in Section 4.1 and the authors marked the 3 steps of the proof, which are detailed later in Sections 4.2 – 4.4.

The exploration-exportation algorithm for bandit-MNL algorithm in Algorithm I iterations between learning the current best possible recommendation with using static assortment optimization methods. After recommending the calculated assortment to a customer, they observe purchasing decisions of customers until a no-purchase happens. Using information form user purchases, we can learn upper confidence bounds of model parameters $later \nu_{i}.$ The UCB estimates are used to find the assortment for the next iteration. This process of going back between offering assortments and observing purchase decisions continuous until steps set before starting the experiment. In Section 4.2, the authors should that the UCBs converge to the true value with high probability. (Be careful this is not a convergence in probability!).

In my opinion, this paper lies on the theoretical end of the spectrum and their is no simulation results or real-data examples accompanying the theoretical results. It would be nice to see some visualization of the cumulative regret with this algorithm and how it compare to the theoretical lower bound presented in Section 5.

I have two idea of how this problems can be extended to accommodate for complex applied problems. For examples, 1) item prices are assumed to be known for now. It would be interesting extension for this paper to see if we can treat ‘s as random variables or something we can learn to optimize. 2) In this algorithm we have to assume customers are homogenous. Personalized assortment optimization would be an interesting direction to explore.

References:

- Agrawal, Shipra, et al. “A near-optimal exploration-exploitation approach for assortment selection.”
*Proceedings of the 2016 ACM Conference on Economics and Computation*. ACM, 2016.