SRN – The Bayesian Lasso by Park and Casella

‘The Bayesian Lasso’ paper by Park and Casella has been on my desk for a while. Thanks to the Sunday Reading Notes (SRN) series, I finally sit down to read and think about this paper.

Lasso (Least absolute shrinkage and selection operator) is a regression method where comparing to ordinary least squares some estimators shrinks to zero while others are set to zero. Lasso is able to do both variable selection (done with step-wise regression before Lasso) and shrinkage to avoid overfitting (for example with ridge regression) at the same time. The objective function for Lasso is \min_{\beta} (y-X\beta)^T(y-X\beta) + \lambda \sum_{j=1}^p |\beta_j|. In the 1996 paper, Tibshirani points out that Lasso estimates can be interpreted as posterior mode estimates (MAP) when we put i.i.d. Lapalace priors (\pi(\beta) \propto \prod_{j=1}^p \exp(-\lambda |\beta_j|).

When I took the Bayesian Data Analysis class (STAT 220 at Harvard), our textbook BDA3 contains an exercise on Lasso regularization and it claims that ‘a fully Bayesian lasso is the same model as lasso but assigning a hyperprior to \lambda. ‘ But ‘The Bayesian Lasso’ paper sets up the hierarchy with a different route:

graphicalmodel

The full conditional distributions on \beta, \sigma^2 and \vec{\tau^2} have closed-form solutions and are provided in Section 2 of the paper. In Section 3, the authors discuss ways of choosing the Lasso parameter: 1) Empirical Bayes by marginal maximum likelihood: iteratively update \lambda^{(k)} = \sqrt{\frac{2p}{\sum_{j=1}^p \mathbb{E}_{\lambda^{(k-1)}} [\tau_j^2|y] }}; 2) hyper-priors on \lambda^2.

The examples in this paper come from the diabetes dataset from the Tibshirani 1996 paper and can be found in the ‘lars‘ R package. On this dataset the results from a Bayesian lasso are very similar to a regular Lasso, as seen in Figure 1 & 2 from the paper or the plot below where I borrowed code from the Rdocumentation of the ‘monomvn‘ R package.

BayesianLassoVisual

In the plot above, we can see the shrinkage effect and the variable selection happening. Compared to ordinary least squares (green), the both the lasso estimates and samples from Bayesian lasso shrink towards zero (except for b.3) and the lasso estimate fixes some parameters (b.1,2,5,6,8,10) at zero. Comparing to Lasso where we do not have uncertainty estimates unless with bootstrap, the Bayesian lasso is providing these uncertainty estimates.

References:

Hastie, T., & Efron, B. (2013). lars: Least angle regression, lasso and forward stagewise, 2013.

Gramacy, R. B., & Gramacy, M. R. B. (2013). Package ‘monomvn’.

Park, T., & Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association103(482), 681-686.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267-288.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s