## SRN – A near-optimal exploration-exploitation approach for assortment selection by Agrawal et al.

I am very interested in exploration-exploitation trade-off. In a previous Sunday Reading Notes post, I discussed bayesian optimization for learning optimal hyper-parameters for machine learning algorithms, as an example for this trade-off. Today I study another exploration-exploitation algorithm for learning the best assortment: A Near-Optimal Exploration-Exploitation Approach for Assortment Selection by Agrawal et al. It has applications in online advertisement display and recommendation systems for e-commerce.

If we take the recommendation system for e-commerce websites as an example. Suppose this website has $N$ products in total, and we can only recommend a subset $S$ to customers and this subset is at most size $K$.  This subset $S$ is an assortment and $K$ is a display constraint. When an assortment $S$ is offered to a customer, the probability of the customer purchasing the $i$-th item in the assortment is $p_i(S) \propto \nu_i$ where $i=0,...,|S|$ and $p_o(S)$ is the probability of not purchasing any item from $S$. Further more if the price of each item in this assortment is $r_i$ then the expected revenue from assortment $S$ is $R(S) = \sum_{i\in S}$. If the probabilities are known, then we will have a static assortment optimization problem of $S^\star = \max_{s \in S} R(S)$. But if the preferences are not known or if they change over time, then we would want to learn user preference among assortments (explore) and recommend the optimal assortment to them (exploitation), and this is called a (dynamic) assortment optimization problem.

In the optimization, we want to minimize the cumulative regret until time $T$ which is $reg(T) = \sum_{t=1}^T R(S^\star) - \mathbb{E}\left[R(S_t) \right]$. In this paper, the user preferences which are described through probabilities $p_i(S)$ are modeled with Multinomial Logistic model (MNL). With the MNL model, we have to estimate the model parameter $\nu_i$ with some multi-armed bandit algorithm to balance exploration and exploitation. Therefore the authors refer to this problem as bandit-MNL. The main contribution of this paper is Theorem 4.1, in which they give a regret bound of $O(\sqrt{NT} \log T + N \log^3T)$ with the proposed Algorithm 1.

Section 4 gives the proof of the main theorem and it is in my opinion very well-written. Proof outline is given in Section 4.1 and the authors marked the 3 steps of the proof, which are detailed later in Sections 4.2 – 4.4.

The exploration-exportation algorithm for bandit-MNL algorithm in Algorithm I iterations between learning the current best possible recommendation with using static assortment optimization methods. After recommending the calculated assortment to a customer, they observe purchasing decisions of customers until a no-purchase happens. Using information form user purchases, we can learn upper confidence bounds of model parameters $later \nu_{i}.$ The UCB estimates are used to find the assortment for the next iteration. This process of going back between offering assortments and observing purchase decisions continuous until $T$ steps set before starting the experiment. In  Section 4.2, the authors should that the UCBs converge to the true value with high probability. (Be careful this is not a convergence in probability!).

In my opinion, this paper lies on the theoretical end of the spectrum and their is no simulation results or real-data examples accompanying the theoretical results. It would be nice to see some visualization of the cumulative regret with this algorithm and how it compare to the theoretical lower bound presented in Section 5.

I have two idea of how this problems can be extended to accommodate for complex applied problems. For examples, 1)  item prices are assumed to be known for now. It would be interesting extension for this paper to see if we can treat $r_i$‘s as random variables or something we can learn to optimize. 2) In this algorithm we have to assume customers are homogenous. Personalized assortment optimization would be an interesting direction to explore.

References:

• Agrawal, Shipra, et al. “A near-optimal exploration-exploitation approach for assortment selection.” Proceedings of the 2016 ACM Conference on Economics and Computation. ACM, 2016.

## SRN – applications of embeddings in search ranking and recommendations

In this Sunday’s Sunday Reading Notes (which actually is posted on Monday), I am venturing into the applied machine learning world and discussing two blog posts about the application of embeddings at AirBnB and Etsy.

It is a fascinating read for me because I am deeply attracted by the versatility of the algorithm. Neither blog posts focused on the details on the training of embeddings. Instead they are written to motivate readers to understand the intuition behind the algorithms and how to adapt the loss function to each websites specific needs.

Semantic embeddings are invented in Natural Language Process fields to learn continuous low-dimensional representations of high-dimensional sparse-vectors. Needless to say, working in a lower dimension makes computations faster. These neural network based algorithms are trained with large text data sets and are based on the intuitions that words that appear together a lot are related. AirBnB and Etsy used embedding to model user behavior. Using the analogy at Nishan provided in the Etsy article, each user session is a sentence and the sequence of actions by the user are the words.

The AirBnB article provided more details on the negative sampling, the algorithm used to  train embeddings.

This diagram from the AirBnB post illustrate the training process of the embeddings. The ‘booked listing’ (in purple) is what the AirBnB team added to the algorithm. The booked listing serves as a global context token, aiding the prediction of eventual booking in the embedding training process. In addition to training the central listing with context listings and the booked listing, because the travelers often only search with-in the same market, AirBnB engineers also added a randomly selected listing from the same market of the central listing. As a result listings within the same market should be closer in the embedding space. Indeed encoding geographical information in the embedding achieved as evident from the plot below (from the AirBnB post).

What the Etsy engineer’s do differently is that they added some ‘implicit contextual information’ into the lose function:

Training a Skip-gram model on only randomly selected negatives, however, ignores implicit contextual signals that we have found to be indicative of user preference in other contexts. For example, if a user clicks on the second item for a search query, the user most likely saw, but did not like, the first item that showed up in the search results. We extend the Skip-gram loss function by appending these implicit negative signals to the Skip-gram loss directly.

This is quite interesting because they considered the ordering of items (words/clicks) and thus both concurrence and ordering are considered in the loss function.

Both articles give concrete examples of how embeddings improved its performance on search ranking and similar items recommendations. I highly encouraged interested readers to check them out because you do not want to spoil your fun with my post.

What I find interesting in this paper is how engineers from both firms adopted this model from natural language processing and used in to serve its own customers. I have been so narrow minded about using NLP algorithms for NLP problems only. More than that, due to their respective needs they modified the loss function. I really enjoyed thinking through why they needed this adaptations.

## Sunday Reading Notes – Bayesian Optimization

For this week’s Sunday Reading Notes, I am switching topics towards bayesian computations and machine learning. This week’s paper  is ‘Practice Bayesian Optimization of Machine Learning Algorithms‘ by Jasper Snoek, Hugo Larochelle and Ryan Adams, and it appeared on NIPS 2012.

On the high level, Bayesian optimization is about fitting Gaussian Process(GP) regression on data currently observed about some black-box function $f$, and choosing the next point $x$ to get $f(x)$ with result of the GP regression. The premise of such a procedure is that the black-box function $f$ that we want to maximize is very expensive to evaluate. In this case, it make sense that we choose where to evaluate this function smartly based on current information. This is an interesting case for the exploration-exploitation trade-off.

When I first read the paper, I was asking myself: why do we need Bayesian optimization? How does it compare to other optimization methods we learned, for example Gradient Descent? Wouldn’t a grid search on the domain give a much better optimization result? I think to answer these questions, we have to keep remind ourselves that we are optimizing some black-box function, whose gradient is unknown. And what’s more, this function is very expensive to evaluate and we can not afford to perform a grid-search on it. As an example, think about tuning parameters for a neural network. Admittedly Bayesian optimization can be expensive as well, therefore we would choose to use Bayesian optimization over grid-search when all the integration and maximization involved in choosing $x_{n+1}$ is much less expensive then evaluating $f(x_{n+1}).$

We start with putting on GP prior on $f$ and assume each observation is some noisy realization of the true function value: $y_n \sim \mathcal{N}(f(x_n),\nu).$ The posterior distribution $f|\{x_n,y_n\},\theta$ is fully characterized by a predictive mean function $\nu(x; \{x_n,y_n\},\theta)$ and the predictive variance function $\sigma^2(x; \{x_n,y_n\},\theta)$for every $x$ on the domain of $f$: $x\sim \mathcal{N}(\mu(x;\{x_n,y_n\},\theta), \sigma^2(x;\{x_n,y_n\},\theta)$. For more details about Gaussian Process regression, you can read the book Gaussian Processes for Machine Learning by Rasmussen and Williams.

If the current best value is $x_{min} = \arg\min_{x_i}f(x_i)$, then for every $x$ on the domain, the probability of $f(x)$ smaller then $f(x_{min})$ is $\alpha_{PI}(x;\{x_n,y_n\},\theta) = \Phi(\gamma(x))$ where $\gamma(x) = \frac{f(x_{min} - \mu(x; \{x_n,y_n\},\theta)}{\sigma(x;\{x_n,y_n\},\theta)}$. In Bayesian optimization, we call this probability of improvement an acquisition function. There are other acquisition functions such as expectation of improvement (EI) and lower confidence bounds (LCB). We optimize the acquisition function to choose the next point $x_{n+1} \gets argmax \ \alpha_{PI}.$

The most interesting section in this paper is Section 3: practical considerations, where the authors goes through

1. choice of covariance function;
2. treatment of hyper-parameter;
3. modeling cost;
4. parallelization.

The first two issues also appear in the GPML book. Basically in this paper the authors  recommend the Matern 5/2 kernel because it only assumes twice-twice-differentiabiltiy. The squared exponential  kernel is the default choice for GP regression, but its smoothness assumption is unrealistic more most machine learning algorithms.

For choosing hyper-parameters, we can choose hyper-parameters that maximize the marginal likelihood function. For a full Bayesian treatment, we have to marginalize over the hyper-parameters and work with the integrated acquisition function $\hat{\alpha}(x;\{x_n,y_n\}) = \int \alpha(x;\{x_n,y_n\},\theta) p(\theta|\{x_n,y_n\} d\theta.$ To approximate this integral, the authors used a slice-sampler with step-in and step-out procedure. The details of this slice-sampler is described in Slice sampling covariance hyperparameters of latent Gaussian models by Murray and Adams. There are some tricks like operating on a long MCMC chain so that we do not waste too many samples, but this could still be an expensive computation. However, the cost is justified by lower cost comparing to evaluating $f$:

As both opti- mization and Markov chain Monte Carlo are computationally dominated by the cubic cost of solving an N -dimensional linear system (and our function evaluations are assumed to be much more expen- sive anyway), the fully-Bayesian treatment is sensible and our empirical evaluations bear this out.

I have played with this algorithm myself on some simple functions like Brainin-Hoo function. I’d like to try how this function works on more complicated functions, like online LDA, Latent Structured SVM, and convolutional neural nets.

Lastly, to get some sense of how expensive the computations are, I want to show Figure 4 in the paper.

If we look at (4b) on the axis, the unit of time is days! In terms of Function evaluations, the Bayesian optimization algorithm dominates a random grid search from a very early stage. In (4a) GP EI MCMC, which uses the least parallelizations is the most efficient in terms of function evaluations, but when we look at (4b) it could take longer time (measured by days!).