Sunday Reading Notes – Bayesian Optimization

For this week’s Sunday Reading Notes, I am switching topics towards bayesian computations and machine learning. This week’s paper  is ‘Practice Bayesian Optimization of Machine Learning Algorithms‘ by Jasper Snoek, Hugo Larochelle and Ryan Adams, and it appeared on NIPS 2012.

On the high level, Bayesian optimization is about fitting Gaussian Process(GP) regression on data currently observed about some black-box function $f$, and choosing the next point $x$ to get $f(x)$ with result of the GP regression. The premise of such a procedure is that the black-box function $f$ that we want to maximize is very expensive to evaluate. In this case, it make sense that we choose where to evaluate this function smartly based on current information. This is an interesting case for the exploration-exploitation trade-off.

When I first read the paper, I was asking myself: why do we need Bayesian optimization? How does it compare to other optimization methods we learned, for example Gradient Descent? Wouldn’t a grid search on the domain give a much better optimization result? I think to answer these questions, we have to keep remind ourselves that we are optimizing some black-box function, whose gradient is unknown. And what’s more, this function is very expensive to evaluate and we can not afford to perform a grid-search on it. As an example, think about tuning parameters for a neural network. Admittedly Bayesian optimization can be expensive as well, therefore we would choose to use Bayesian optimization over grid-search when all the integration and maximization involved in choosing $x_{n+1}$ is much less expensive then evaluating $f(x_{n+1}).$

We start with putting on GP prior on $f$ and assume each observation is some noisy realization of the true function value: $y_n \sim \mathcal{N}(f(x_n),\nu).$ The posterior distribution $f|\{x_n,y_n\},\theta$ is fully characterized by a predictive mean function $\nu(x; \{x_n,y_n\},\theta)$ and the predictive variance function $\sigma^2(x; \{x_n,y_n\},\theta)$for every $x$ on the domain of $f$: $x\sim \mathcal{N}(\mu(x;\{x_n,y_n\},\theta), \sigma^2(x;\{x_n,y_n\},\theta)$. For more details about Gaussian Process regression, you can read the book Gaussian Processes for Machine Learning by Rasmussen and Williams.

If the current best value is $x_{min} = \arg\min_{x_i}f(x_i)$, then for every $x$ on the domain, the probability of $f(x)$ smaller then $f(x_{min})$ is $\alpha_{PI}(x;\{x_n,y_n\},\theta) = \Phi(\gamma(x))$ where $\gamma(x) = \frac{f(x_{min} - \mu(x; \{x_n,y_n\},\theta)}{\sigma(x;\{x_n,y_n\},\theta)}$. In Bayesian optimization, we call this probability of improvement an acquisition function. There are other acquisition functions such as expectation of improvement (EI) and lower confidence bounds (LCB). We optimize the acquisition function to choose the next point $x_{n+1} \gets argmax \ \alpha_{PI}.$

The most interesting section in this paper is Section 3: practical considerations, where the authors goes through

1. choice of covariance function;
2. treatment of hyper-parameter;
3. modeling cost;
4. parallelization.

The first two issues also appear in the GPML book. Basically in this paper the authors  recommend the Matern 5/2 kernel because it only assumes twice-twice-differentiabiltiy. The squared exponential  kernel is the default choice for GP regression, but its smoothness assumption is unrealistic more most machine learning algorithms.

For choosing hyper-parameters, we can choose hyper-parameters that maximize the marginal likelihood function. For a full Bayesian treatment, we have to marginalize over the hyper-parameters and work with the integrated acquisition function $\hat{\alpha}(x;\{x_n,y_n\}) = \int \alpha(x;\{x_n,y_n\},\theta) p(\theta|\{x_n,y_n\} d\theta.$ To approximate this integral, the authors used a slice-sampler with step-in and step-out procedure. The details of this slice-sampler is described in Slice sampling covariance hyperparameters of latent Gaussian models by Murray and Adams. There are some tricks like operating on a long MCMC chain so that we do not waste too many samples, but this could still be an expensive computation. However, the cost is justified by lower cost comparing to evaluating $f$:

As both opti- mization and Markov chain Monte Carlo are computationally dominated by the cubic cost of solving an N -dimensional linear system (and our function evaluations are assumed to be much more expen- sive anyway), the fully-Bayesian treatment is sensible and our empirical evaluations bear this out.

I have played with this algorithm myself on some simple functions like Brainin-Hoo function. I’d like to try how this function works on more complicated functions, like online LDA, Latent Structured SVM, and convolutional neural nets.

Lastly, to get some sense of how expensive the computations are, I want to show Figure 4 in the paper.

If we look at (4b) on the axis, the unit of time is days! In terms of Function evaluations, the Bayesian optimization algorithm dominates a random grid search from a very early stage. In (4a) GP EI MCMC, which uses the least parallelizations is the most efficient in terms of function evaluations, but when we look at (4b) it could take longer time (measured by days!).