SRN – Winner’s Curse by Lee and Shen from Airbnb

Online controlled experiments (A/B tests) has been my reading theme for this summer. This weekend I decide to read

  • Minyong R. Lee and Milan Shen. 2018. Winner’s Curse: Bias Estimation for Total E ects of Features in Online Controlled Experiments. In KDD ’18: The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 19–23, 2018, London, United Kingdom. (link)

This is the conference submission following the illuminating medium post

  • Milan Shen and Mining Lee. Selection Bias in Online Experimentation:Thinking through a method for the Winner’s Curse in A/B testing. (link)

Today websites are testing many incremental changes to the webpage at the same time on large-scale experimentation platforms. The average treatment effect of each incremental change is unbiased estimated. Experiments with statistically significant improvements to some metric will be launched to the users. But the total effect of aggregating these incremental changes is over-estimated, if we simply add up the individual estimates. This is the winner’s curse the authors describe, and they quantified the bias in estimating the total effects in this paper.

Suppose after running N experiments, the set of chosen experiments is A (which is a random set!) with true total effect T_A and estimated total effect S_A. Then it can be shown that \beta = \mathbb{E}\left[S_A - T_A\right] > 0.

The authors provides closed form solution for the bias in the simple case where each estimate follows a normal distribution with known standard deviation X_i \sim \mathcal{N}(a_i,\sigma^2_i). Let’s say b_i is some value choose by analysts running the experiments to select the set A. The authors show that \beta = \sum_{i=1}^n \sigma_i \phi\left(\frac{\sigma_i b_i - a_i}{\sigma_i}\right). The point of this equation is that all the epxeriments contribute to the bias, not just those selected through experimentation, because the sum is over all the experiments! As the Medium post pointed out:

If the set of experiments being aggregated A is fixed from the beginning, then no bias would be introduced. In the end, the bias exists because of the process of selecting the successful ones among many experiments we run, and we do this every day in a large scale experimentation platform.

The authors also provide a bootstrap confidence interval for the bias corrected total effect estimator.

Screen Shot 2018-08-05 at 11.44.08 PM.png
Fig: comparison between (left) adding up the estimated effect of 6 selected experiments, (middle) bias adjusted total effect estimate and (right) total effect estimated from a separate hold-out. Source: Medium post from Airbnb Engineering and Data Science. This is also Figure 6 from KDD paper.

I did not have time to go-through Section 5 – applications at Airbnb from the paper this time and I’d like to revisit it at another chance.

I really like the authors’ way of formulating the solving this selective inference problem. Their de-biasing method requires very little assumptions and is straightforward to calculate. I have not given must thought to it but I am wondering how would a Bayesian approach this problem. Would that be setting up a hierarchical model by pulling all the incremental changes together? I would certainly meditate on this question during running or yoga!

SRN – A near-optimal exploration-exploitation approach for assortment selection by Agrawal et al.

I am very interested in exploration-exploitation trade-off. In a previous Sunday Reading Notes post, I discussed bayesian optimization for learning optimal hyper-parameters for machine learning algorithms, as an example for this trade-off. Today I study another exploration-exploitation algorithm for learning the best assortment: A Near-Optimal Exploration-Exploitation Approach for Assortment Selection by Agrawal et al. It has applications in online advertisement display and recommendation systems for e-commerce.

If we take the recommendation system for e-commerce websites as an example. Suppose this website has N products in total, and we can only recommend a subset S to customers and this subset is at most size K.  This subset S is an assortment and K is a display constraint. When an assortment S is offered to a customer, the probability of the customer purchasing the i-th item in the assortment is p_i(S)  \propto \nu_i where i=0,...,|S| and p_o(S) is the probability of not purchasing any item from S. Further more if the price of each item in this assortment is r_i then the expected revenue from assortment S is R(S) = \sum_{i\in S}. If the probabilities are known, then we will have a static assortment optimization problem of S^\star  = \max_{s \in S} R(S). But if the preferences are not known or if they change over time, then we would want to learn user preference among assortments (explore) and recommend the optimal assortment to them (exploitation), and this is called a (dynamic) assortment optimization problem.  

In the optimization, we want to minimize the cumulative regret until time T which is reg(T) = \sum_{t=1}^T R(S^\star) - \mathbb{E}\left[R(S_t) \right]. In this paper, the user preferences which are described through probabilities p_i(S) are modeled with Multinomial Logistic model (MNL). With the MNL model, we have to estimate the model parameter \nu_i with some multi-armed bandit algorithm to balance exploration and exploitation. Therefore the authors refer to this problem as bandit-MNL. The main contribution of this paper is Theorem 4.1, in which they give a regret bound of O(\sqrt{NT} \log T + N \log^3T) with the proposed Algorithm 1.

Section 4 gives the proof of the main theorem and it is in my opinion very well-written. Proof outline is given in Section 4.1 and the authors marked the 3 steps of the proof, which are detailed later in Sections 4.2 – 4.4.

The exploration-exportation algorithm for bandit-MNL algorithm in Algorithm I iterations between learning the current best possible recommendation with using static assortment optimization methods. After recommending the calculated assortment to a customer, they observe purchasing decisions of customers until a no-purchase happens. Using information form user purchases, we can learn upper confidence bounds of model parameters $later \nu_{i}.$ The UCB estimates are used to find the assortment for the next iteration. This process of going back between offering assortments and observing purchase decisions continuous until T steps set before starting the experiment. In  Section 4.2, the authors should that the UCBs converge to the true value with high probability. (Be careful this is not a convergence in probability!).

In my opinion, this paper lies on the theoretical end of the spectrum and their is no simulation results or real-data examples accompanying the theoretical results. It would be nice to see some visualization of the cumulative regret with this algorithm and how it compare to the theoretical lower bound presented in Section 5.

I have two idea of how this problems can be extended to accommodate for complex applied problems. For examples, 1)  item prices are assumed to be known for now. It would be interesting extension for this paper to see if we can treat r_i‘s as random variables or something we can learn to optimize. 2) In this algorithm we have to assume customers are homogenous. Personalized assortment optimization would be an interesting direction to explore.


References:

  • Agrawal, Shipra, et al. “A near-optimal exploration-exploitation approach for assortment selection.” Proceedings of the 2016 ACM Conference on Economics and Computation. ACM, 2016.

 

 

SRN – applications of embeddings in search ranking and recommendations

In this Sunday’s Sunday Reading Notes (which actually is posted on Monday), I am venturing into the applied machine learning world and discussing two blog posts about the application of embeddings at AirBnB and Etsy.

It is a fascinating read for me because I am deeply attracted by the versatility of the algorithm. Neither blog posts focused on the details on the training of embeddings. Instead they are written to motivate readers to understand the intuition behind the algorithms and how to adapt the loss function to each websites specific needs.

Semantic embeddings are invented in Natural Language Process fields to learn continuous low-dimensional representations of high-dimensional sparse-vectors. Needless to say, working in a lower dimension makes computations faster. These neural network based algorithms are trained with large text data sets and are based on the intuitions that words that appear together a lot are related. AirBnB and Etsy used embedding to model user behavior. Using the analogy at Nishan provided in the Etsy article, each user session is a sentence and the sequence of actions by the user are the words.

The AirBnB article provided more details on the negative sampling, the algorithm used to  train embeddings.1*dOy3xKj5ts1FW_YypI1MAQ

This diagram from the AirBnB post illustrate the training process of the embeddings. The ‘booked listing’ (in purple) is what the AirBnB team added to the algorithm. The booked listing serves as a global context token, aiding the prediction of eventual booking in the embedding training process. In addition to training the central listing with context listings and the booked listing, because the travelers often only search with-in the same market, AirBnB engineers also added a randomly selected listing from the same market of the central listing. As a result listings within the same market should be closer in the embedding space. Indeed encoding geographical information in the embedding achieved as evident from the plot below (from the AirBnB post).1*JhWHx2mwuD898RiE18dF8Q

What the Etsy engineer’s do differently is that they added some ‘implicit contextual information’ into the lose function:

Training a Skip-gram model on only randomly selected negatives, however, ignores implicit contextual signals that we have found to be indicative of user preference in other contexts. For example, if a user clicks on the second item for a search query, the user most likely saw, but did not like, the first item that showed up in the search results. We extend the Skip-gram loss function by appending these implicit negative signals to the Skip-gram loss directly.

This is quite interesting because they considered the ordering of items (words/clicks) and thus both concurrence and ordering are considered in the loss function.

Both articles give concrete examples of how embeddings improved its performance on search ranking and similar items recommendations. I highly encouraged interested readers to check them out because you do not want to spoil your fun with my post.

What I find interesting in this paper is how engineers from both firms adopted this model from natural language processing and used in to serve its own customers. I have been so narrow minded about using NLP algorithms for NLP problems only. More than that, due to their respective needs they modified the loss function. I really enjoyed thinking through why they needed this adaptations.

 

Sunday Reading Notes – Bayesian Optimization

For this week’s Sunday Reading Notes, I am switching topics towards bayesian computations and machine learning. This week’s paper  is ‘Practice Bayesian Optimization of Machine Learning Algorithms‘ by Jasper Snoek, Hugo Larochelle and Ryan Adams, and it appeared on NIPS 2012.

On the high level, Bayesian optimization is about fitting Gaussian Process(GP) regression on data currently observed about some black-box function f, and choosing the next point x to get f(x) with result of the GP regression. The premise of such a procedure is that the black-box function f that we want to maximize is very expensive to evaluate. In this case, it make sense that we choose where to evaluate this function smartly based on current information. This is an interesting case for the exploration-exploitation trade-off.

When I first read the paper, I was asking myself: why do we need Bayesian optimization? How does it compare to other optimization methods we learned, for example Gradient Descent? Wouldn’t a grid search on the domain give a much better optimization result? I think to answer these questions, we have to keep remind ourselves that we are optimizing some black-box function, whose gradient is unknown. And what’s more, this function is very expensive to evaluate and we can not afford to perform a grid-search on it. As an example, think about tuning parameters for a neural network. Admittedly Bayesian optimization can be expensive as well, therefore we would choose to use Bayesian optimization over grid-search when all the integration and maximization involved in choosing x_{n+1} is much less expensive then evaluating f(x_{n+1}).

We start with putting on GP prior on f and assume each observation is some noisy realization of the true function value: y_n \sim \mathcal{N}(f(x_n),\nu). The posterior distribution f|\{x_n,y_n\},\theta is fully characterized by a predictive mean function \nu(x; \{x_n,y_n\},\theta) and the predictive variance function \sigma^2(x; \{x_n,y_n\},\theta)for every x on the domain of f: x\sim \mathcal{N}(\mu(x;\{x_n,y_n\},\theta), \sigma^2(x;\{x_n,y_n\},\theta). For more details about Gaussian Process regression, you can read the book Gaussian Processes for Machine Learning by Rasmussen and Williams.

If the current best value is x_{min} = \arg\min_{x_i}f(x_i), then for every x on the domain, the probability of f(x) smaller then f(x_{min}) is \alpha_{PI}(x;\{x_n,y_n\},\theta) = \Phi(\gamma(x)) where \gamma(x) = \frac{f(x_{min} - \mu(x; \{x_n,y_n\},\theta)}{\sigma(x;\{x_n,y_n\},\theta)}. In Bayesian optimization, we call this probability of improvement an acquisition function. There are other acquisition functions such as expectation of improvement (EI) and lower confidence bounds (LCB). We optimize the acquisition function to choose the next point x_{n+1} \gets argmax \ \alpha_{PI}.

The most interesting section in this paper is Section 3: practical considerations, where the authors goes through

  1. choice of covariance function;
  2. treatment of hyper-parameter;
  3. modeling cost;
  4. parallelization.

The first two issues also appear in the GPML book. Basically in this paper the authors  recommend the Matern 5/2 kernel because it only assumes twice-twice-differentiabiltiy. The squared exponential  kernel is the default choice for GP regression, but its smoothness assumption is unrealistic more most machine learning algorithms.

For choosing hyper-parameters, we can choose hyper-parameters that maximize the marginal likelihood function. For a full Bayesian treatment, we have to marginalize over the hyper-parameters and work with the integrated acquisition function \hat{\alpha}(x;\{x_n,y_n\}) = \int \alpha(x;\{x_n,y_n\},\theta) p(\theta|\{x_n,y_n\} d\theta. To approximate this integral, the authors used a slice-sampler with step-in and step-out procedure. The details of this slice-sampler is described in Slice sampling covariance hyperparameters of latent Gaussian models by Murray and Adams. There are some tricks like operating on a long MCMC chain so that we do not waste too many samples, but this could still be an expensive computation. However, the cost is justified by lower cost comparing to evaluating f:

As both opti- mization and Markov chain Monte Carlo are computationally dominated by the cubic cost of solving an N -dimensional linear system (and our function evaluations are assumed to be much more expen- sive anyway), the fully-Bayesian treatment is sensible and our empirical evaluations bear this out.

I have played with this algorithm myself on some simple functions like Brainin-Hoo function. I’d like to try how this function works on more complicated functions, like online LDA, Latent Structured SVM, and convolutional neural nets.

Lastly, to get some sense of how expensive the computations are, I want to show Figure 4 in the paper.

Screen Shot 2018-07-15 at 10.22.56 PM

If we look at (4b) on the axis, the unit of time is days! In terms of Function evaluations, the Bayesian optimization algorithm dominates a random grid search from a very early stage. In (4a) GP EI MCMC, which uses the least parallelizations is the most efficient in terms of function evaluations, but when we look at (4b) it could take longer time (measured by days!).

Sunday Reading Notes – Continuous Monitoring of AB tests without Pain by Deng et al.

In this week’s ‘Sunday Reading Notes’, I am writing about ‘continuous monitoring of A/B tests without pain: optional stopping in Bayesian testing‘ by Alex Deng, Jiannan Lu and Shouyuan Chen at Microsoft. The authors formally proved the validity of Bayesian testing with proper stopping rule when we have a genuine prior in Theorem 1 through an elegant use of change of measure.

Under the bayesian model comparison framework,  we have posterior odds equals to prior odds times Bayes Factor. The Bayes Factor is the likelihood ratio under model 1 and model 0, which can be intractable if there are hidden variables in the model. There has been research dedicated to unbiasedly estimating the Bayes factor (like bridge sampling in Gelman and Meng 1998). The posterior odds summaries all the information about an experiment, and more importantly as the authors pointed out:

Conditioning on a posterior odds K, rejecting H_1 with expose us to a risk of false rejection/discovery with probability P(H_0|data) = \frac{1}{K+1}.

The authors refers to it as the ‘Bayesian promise’: \frac{\mathbb{P}(H_1|PostOdds)}{\mathbb{P}(H_0|PostOdds)} = PostOdds. and the main result in this paper says that we still have the Bayesian promise with optional stopping. In other words: with a proper stopping time \tau (which does not depend on any future event), we always have \frac{\mathbb{P}(H_1|PostOdds_\tau)}{\mathbb{P}(H_0|PostOdds_\tau)} = PostOdds_\tau .

I really like the simulations in this paper because they explains the concept of ‘Bayesian promise’ really well. Screen Shot 2018-07-08 at 7.00.07 PM

Figure 1 from the paper illustrate the Bayesian promise in a fixed horizon test setting. The prior odds is set to 1 and when data is simulated from as 50-50 mixture model of H_1 and H_0, we look at the histogram of the Bayes factors of each of the 100K runs. If we look at the bin with BF = 2.1, there are about 4000 runs from H_1 with BF = 2.1 and 2000 runs from H_0. In this figure, the top margins which shows the ratio between counts from alternative and null are very close to the x-axis which are the Bayes factors. Conditioning on the Bases factor, the probability of data coming from H_1 is BF times that of probability of data coming from H_0.

This Bayesian promise is kept whatever proper stopping rule we decide to use:

 

These two figures are Figure 2 & 3 from the paper, where the stopping rules are (top left) BF >9, (bottom left) BF>9 or BF<1/9 and (right) $p-value <0.1 $ and sample size larger than 10.

The result from this paper should be reassuring to those running Bayesian A/B tests. Despite the theoretical guarantees, I feel a lot of people are still not convinced that peeking at Bayesian AB tests is okay. For example, I read the post Is Bayesian A/B Testing Immune to Peeking? Not Exactly by David Robinson many times. Although I feel there is still some connection I have not found out yet, I believe these two papers are talking about different things. In Robinson’s post, his focus in Type-I error control, which is not why Bayesians promise to do. In this paper by Deng and Lu, the ‘Bayesian promise’ gives false discovery rate (FDR) control. And I think the authors try to make it clear in Table 1.

Screen Shot 2018-07-08 at 7.21.28 PM

Although the FDR is inflated than the fixed horizon test, it is nevertheless still controlled below the promised 10%.

In my opinion, what might hurt the Deng and Lu paper is not that it inflates Type-I error or FDR, but how people could mis-use the optional stopping in Bayesian A/B tests in practice. I want to summarize some bad practices for people to avoid here:

  1. Not every prior promises FDR control with optional stopping! The prior in Theorem 1 and in the simulations is a genuine prior. It is in other words, the true prior, the prior from which data is generated. If you do not have this true prior or something very close to it, do not use optional stopping. Alex Deng has a paper on learning objective priors from historical data, which I plan to discuss in the future.
  2. Only proper stopping rules work! Do not peek into the future, only ‘peek’ at present and past.  These are included in Example 2 and 3 in the paper.
  3. Do not test until winning. You will fall into the trap of multiple testing. If we run 5 replications of the same experiment, they all fail to be ‘significant’, but after running it a 6th time we see a ‘significant’ result, we cannot ignore the previous runs and declare a victory.
  4. Do not expect to unbiasedly measure the average treatment effect (ATE) with an Bayesian AB test with optional stopping. I believe we can confirm its existence with an Bayesian AB test, but to estimate the ATE? I believe we are introducing some bias into the estimation with an optional stopping. I hope the authors discussed more of this in the paper, or maybe I show study this myself.

After all reading this paper has been a enjoyable and thought provoking journey for me. Every time I read it, I understand more of it. I asked many questions while reading it, some I answered myself by re-reading, some I have to ask others, and some I still do not understand.

Sunday Reading Notes – Chapter 2 of Sequential Analysis by Abraham Wald

I decide to start a series called ‘Sunday Reading Notes’ posting about papers, book chapters and anything statistics related that I read recently.


Opening remarks:

This week I will write about Chapter 2 of Sequential Analysis by Abraham Wald. Unlike the statistical null hypothesis testing we all learned from textbooks, sequential analysis is a method of inference where number of observations is not determined in advance of the experiment. At every stage of the experiment, the experimenter makes a decision to accept, reject or continue data collection. In Chapter 2 “general discussion” Wald formaly set up the system of sequential test in preparation for Chapter 3 “sequential likelihood ratio test”.

If you do not enjoy technical details, skip the rest and jump to the closing paragraph for some high-level summary. 


Reading this chapter is an eye – opening experience for me. As a statistics PhD student with a somewhat traditional training, I am so used to the common problem setup like: given a sequence of data X_{1:n} \overset{iid}{\sim} F(\theta) where F is some distribution parametrized by \theta, consider some null hypothesis H_0 and some alternative hypothesis H_1, determine rejection region of the most powerful level \alpha test; or given a model, a null and an alternative and a level \alpha, find the sample size to achieve at least a  1- \beta = 0.95 power.

Wald sets up the sequential test framework differently. He starts with a hypothesis H_0 and any decision rule that terminates in finite time. The decision rule is defined by mutually exclusive sets R_m^0, R_m^1, R_m (accept after m steps, reject after m steps, and make more observations) for all values of m.

Given such a decision rule, for every value of parameter \theta we can define the function L(\theta) to be the probability of accepting H if \theta were the data generating value (and yes I think we are in the well-specified model case). This function is called the Operating Characteristic (OC) function.

Another notation before jumping to the principles of a good sequential test is Averaged Sample Number (ASN) function. This function denoted by \mathbb{E}_n(\theta) is the expected number of samples to terminate the test.

At this point the alternative hypothesis is not part of the picture yet, but we are ready to discussion some principles of a sequential test.

Wald goes on to talking about degree of preference for acceptance (yes! the familiar notion of a loss function in decision theory). He partitions the parameter space into three sets: a region with preference for accepting H_0, a region with preference for rejection $H$ and a region of indifference, and these regions are characterized by loss functions w_0(\theta), w_1(\theta). We want to make zone of preference for acceptance a subset of H_0: \theta \in \omega because acceptance of H_0 should preferred whenever \theta \in \omega but this preference would not be so strong if \theta is close to the boundary of \omega. Likewise an experimenter need to make zone of preference for rejection a subset of w^c, the complement of w.

At first, this was like NEWS to me because I never thought about hypothesis testing this way. Who talks about partitioning the parameter space into three sets? This is a decision theoretic way to approach a statistical inference problem. Indeed discussion decision theory was part of Stat 210 at Harvard, but that was only after all discussions about hypothesis testing (including nonparametric tests and asymptotic tests) were done.  Here we do not even have an alternative hypothesis.

Another interesting notion is the idea of ‘acceptance’ here. We learn from textbook that we can only ‘reject null with some confidence’ but ‘not reject’ is not the same as ‘acceptance’. Why? One, because we have another option — get more data. Two, because ‘not reject’ might consist of two regions: zone of preference for acceptance and zone of indifference. I believe that the three zones of preference theory would really clear the discussions about “isn’t not reject the same as accept?”.

Ideally we want to have L(\theta) = 1 \ \forall \theta \in \omega and L(\theta) = 0 \ \forall \theta \not\in \omega. But because of randomize in observations, this is not achievable with finite samples. Instead we can give it some slack and require that 1 - L(\theta) \leq \alpha for every \theta in the zone of preference for acceptance  and L(\theta) \leq \beta for every \theta in the zone of preference for rejection.  These already looks like power and level from our textbook, except that we do not have an H_1 yet! A test satisfying the two equations above are called admissible.

Admissibility, which I believe is the same admissibility from decision theory) is defined using the OC function, which describes how well the test is making correct decisions. In order to define efficiency, we need to examine the ASN function. Recall that for every  set S and parameter \theta the expected number of samples required is \mathbb{E}_\theta(n|S). If there is an admissible test S_0 such that \mathbb{E}_\theta (n|S_0) = \min_{S} \mathbb{E}_\theta(n|S) for every \theta  then S_0  is the ‘uniformly best test.’ In general this is too good to be true because the above equation has to be true for every value of \theta.

But we can define ‘optimum test’ if we have simple hypotheses H_0: \theta = \theta_0 and H_1: \theta = \theta_1 \not= \theta_0. We can use Type I and II errors (\alpha, \beta) to define ‘strength’ of a test and if a test S_0 exists such that \mathbb{E}_{\theta_0}(n|S_0) \leq \mathbb{E}_{\theta_0}(n|S) and \mathbb{E}_{\theta_1}(n|S_0) \leq \mathbb{E}_{\theta_1}(n|S) for every S of the same strength, then S_0 is an optimum test.


Concluding remarks:

Unlike treatment of fixed-sample size NHST controlling Type I and II error with Neyman-Pearson lemma to get ‘most powerful level \alpha test’, in a sequential test we are looking for tests achieving the greatest efficiency (measured by average number of sample required) under some Type I and II error control (characterized by (\alpha,\beta)-strength).

Instead of straighforwardly presenting the framework to us, Wald carefully walks the reader through the development of the theory motivated by decision  theoretic considerations like: 1) what is the loss of a bad decisions? 2) what are some good properties of a sequential test? This chapter offers intuitions rather than extensive mathematical derivations and I would recommend it to anyone interested in statical inference and sequential testing in particular. The reader do need a good understanding of notions like power and significance level to appreciate it!


References:

Wald, Abraham. Sequential analysis. Courier Corporation, 1973.