Home > Statistics > An introduction to the lasso in Stata

An introduction to the lasso in Stata

Why is the lasso interesting?

The least absolute shrinkage and selection operator (lasso) estimates model coefficients and these estimates can be used to select which covariates should be included in a model. The lasso is used for outcome prediction and for inference about causal parameters. In this post, we provide an introduction to the lasso and discuss using the lasso for prediction. In the next post, we discuss using the lasso for inference about causal parameters.

The lasso is most useful when a few out of many potential covariates affect the outcome and it is important to include only the covariates that have an affect. “Few” and “many” are defined relative to the sample size. In the example discussed below, we observe the most recent health-inspection scores for 600 restaurants, and we have 100 covariates that could potentially affect each one’s score. We have too many potential covariates because we cannot reliably estimate 100 coefficients from 600 observations. We believe that only about 10 of the covariates are important, and we feel that 10 covariates are “a few” relative to 600 observations.

Given that only a few of the many covariates affect the outcome, the problem is now that we don’t know which covariates are important and which are not. The lasso produces estimates of the coefficients and solves this covariate-selection problem.

There are technical terms for our example situation. A model with more covariates than whose coefficients you could reliably estimate from the available sample size is known as a high-dimensional model. The assumption that the number of coefficients that are nonzero in the true model is small relative to the sample size is known as a sparsity assumption. More realistically, the approximate sparsity assumption requires that the number of nonzero coefficients in the model that best approximates the real world be small relative to the sample size.

In these technical terms, the lasso is most useful when estimating the coefficients in a high-dimensional, approximately sparse, model.

High-dimensional models are nearly ubiquitous in prediction problems and models that use flexible functional forms. In many cases, the many potential covariates are created from polynomials, splines, or other functions of the original covariates. In other cases, the many potential covariates come from administrative data, social media, or other sources that naturally produce huge numbers of potential covariates.

Predicting restaurant inspection scores

We use a series of examples to make our discussion of the lasso more accessible. These examples use some simulated data from the following problem. A health inspector in a small U.S. city wants to use social-media reviews to predict the health-inspection scores of restaurants. The inspector plans to add surprise inspections to the restaurants with the lowest-predicted health scores, using our predictions.

hsafety2.dta has 1 observation for each of 600 restaurants, and the score from the most recent inspection is in score. The percentage of a restaurant’s social-media reviews that contain a word like “dirty” could predict the inspection score. We identified 50 words, 30 word pairs, and 20 phrases whose occurrence percentages in reviews written in the three months prior to an inspection could predict the inspection score. The occurrence percentages of the 50 words are in word1word50. The occurrence percentages of 30-word pairs are in wpair1wpair30. The occurrence percentages of the 20 phrases are in phrase1phrase20.

Researchers widely use the following steps to find the best predictor.

  1. Divide the sample into training and validation subsamples.
  2. Use the training data to estimate the model parameters of each of the competing estimators.
  3. Use the validation data to estimate the out-of-sample mean squared error (MSE) of the predictions produced by each competing estimator.
  4. The best predictor is the estimator that produces the smallest out-of-sample MSE.

The ordinary least-squares (OLS) estimator is frequently included as a benchmark estimator when it is feasible. We begin the process with splitting the sample and computing the OLS estimates.

In the output below, we read the data into memory and use splitsample with the option split(.75 .25) to generate the variable sample, which is 1 for a 75% of the sample and 2 for the remaining 25% of the sample. The assignment of each observation in sample to 1 or 2 is random, but the rseed option makes the random assignment reproducible.

. use hsafety2

. splitsample , generate(sample) split(.75 .25) rseed(12345)

. label define slabel 1 "Training" 2 "Validation"

. label values sample slabel

. tabulate sample

     sample |      Freq.     Percent        Cum.
------------+-----------------------------------
   Training |        450       75.00       75.00
 Validation |        150       25.00      100.00
------------+-----------------------------------
      Total |        600      100.00

The one-way tabulation of sample produced by tabulate verifies that sample contains the requested 75%–25% division.

Next, we compute the OLS estimates using the data in the training sample and store the results in memory as ols.

. quietly regress score word1-word50 wpair1-wpair30 phrase1-phrase20 
>         if sample==1

. estimates store ols

Now, we use lassogof with option over(sample) to compute the in-sample (Training) and out-of-sample (Validation) estimates of the MSE.

. lassogof ols, over(sample)

Penalized coefficients
-------------------------------------------------------------
Name             sample |         MSE    R-squared        Obs
------------------------+------------------------------------
ols                     |
               Training |    24.43515       0.5430        450
             Validation |    35.53149       0.2997        150
-------------------------------------------------------------

As expected, the estimated MSE is much smaller in the Training subsample than in the Validation sample. The out-of-sample estimate of the MSE is the more reliable estimator for the prediction error; see, for example, chapters 1, 2, and 3 in Hastie, Tibshirani, and Friedman (2009).

In this section, we introduce the lasso and compare its estimated out-of-sample MSE to the one produced by OLS.

What’s a lasso?
The lasso is an estimator of the coefficients in a model. What makes the lasso special is that some of the coefficient estimates are exactly zero, while others are not. The lasso selects covariates by excluding the covariates whose estimated coefficients are zero and by including the covariates whose estimates are not zero. There are no standard errors for the lasso estimates. The lasso’s ability to work as a covariate-selection method makes it a nonstandard estimator and prevents the estimation of standard errrors. In this post, we discuss how to use the lasso for inferential questions.

Tibshirani (1996) derived the lasso, and Hastie, Tibshirani, and Wainwright (2015) provide a textbook introduction.

The remainder of this section provides some details about the mechanics of how the lasso produces its coefficient estimates. There are different versions of the lasso for linear and nonlinear models. Versions of the lasso for linear models, logistic models, and Poisson models are available in Stata 16. We discuss only the lasso for the linear model, but the points we make generalize to the lasso for nonlinear models.

Like many estimators, the lasso for linear models solves an optimization problem. Specifically, the linear lasso point estimates \(\widehat{\boldsymbol{\beta}}\) are given by

$$
\widehat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}}
\left\{
\frac{1}{2n} \sum_{i=1}^n\left(y_i – {\bf x}_i\boldsymbol{\beta}’\right)^2
+\lambda\sum_{j=1}^p\omega_j\vert\beta_j\vert
\right\}
$$

where

  • \(\lambda>0\) is the lasso penalty parameter,
  • \(y\) is the outcome variable,
  • \({\bf x}\) contains the \(p\) potential covariates,
  • \(\boldsymbol{\beta}\) is the vector of coefficients on \({\bf x}\),
  • \(\beta_j\) is the \(j\)th element of \(\boldsymbol{\beta}\),
  • the \(\omega_j\) are parameter-level weights known as penalty loadings, and
  • \(n\) is the sample size.

There are two terms in this optimization problem, the least-squares fit measure
$$\frac{1}{2n} \sum_{i=1}^n\left(y_i – {\bf x}_i\boldsymbol{\beta}’\right)^2$$
and the penalty term
$$\lambda\sum_{j=1}^p\omega_j\vert\boldsymbol{\beta}_j\vert$$
The parameters \(\lambda\) and the \(\omega_j\) are called “tuning” parameters. They specify the weight applied to the penalty term. When \(\lambda=0\), the linear lasso reduces to the OLS estimator. As \(\lambda\) increases, the magnitude of all the estimated coefficients is “shrunk” toward zero. This skrinkage occurs because the cost of each nonzero \(\widehat{\beta}_j\) increases with the penalty term that increases as \(\lambda\) increases.

The penalty term includes the absolute value of each \(\beta_j\). The absolute value function has a kink, sometimes called a check, at zero. The kink in the contribution of each coefficient to the penalty term causes some of the estimated coefficients to be exactly zero at the optimal solution. See section 2.2 of Hastie, Tibshirani, and Wainwright (2015) for more details.

There is a value \(\lambda_{\rm max}\) for which all the estimated coefficients are exactly zero. As \(\lambda\) decreases from \(\lambda_{\rm max}\), the number of nonzero coefficient estimates increases. For \(\lambda\in(0,\lambda_{\rm max})\), some of the estimated coefficients are exactly zero and some of them are not zero. When you use the lasso for covariate selection, covariates with estimated coefficients of zero are excluded, and covariates with estimated coefficients that are not zero are included.

That the number of potential covariates \(p\) can be greater than the sample size \(n\) is a much discussed advantage of the lasso. It is important to remember that the approximate sparsity assumption requires that the number of covariates that belong in the model (\(s\)) must be small relative to \(n\).

Selecting the lasso tuning parameters

The tuning parameters must be selected before using the lasso for prediction or model selection. The most frequent methods used to select the tuning parameters are cross-validation (CV), the adaptive lasso, and plug-in methods. In addition, \(\lambda\) is sometimes set by hand in a sensitivity analysis.

CV finds the \(\lambda\) that minimizes the out-of-sample MSE of the predictions. The mechanics of CV mimic the process using split samples to find the best out-of-sample predictor. The details are presented in an appendix.

CV is the default method of selecting the tuning parameters in the lasso command. In the output below, we use lasso to estimate the coefficients in the model for score, using the training sample. We specified the option rseed() to make our CV results reproducible.

. lasso linear score word1-word50 wpair1-wpair30 phrase1-phrase20
>         if sample==1, nolog rseed(12345)

Lasso linear model                          No. of obs        =        450
                                            No. of covariates =        100
Selection: Cross-validation                 No. of CV folds   =         10

--------------------------------------------------------------------------
         |                                No. of      Out-of-      CV mean
         |                               nonzero       sample   prediction
      ID |     Description      lambda     coef.    R-squared        error
---------+----------------------------------------------------------------
       1 |    first lambda    3.271123         0       0.0022       53.589
      25 |   lambda before    .3507518        22       0.3916     32.53111
    * 26 | selected lambda     .319592        25       0.3917     32.52679
      27 |    lambda after    .2912003        26       0.3914     32.53946
      30 |     last lambda    .2202824        30       0.3794     33.18254
--------------------------------------------------------------------------
* lambda selected by cross-validation.

. estimates store cv

We specified the option nolog to supress the CV log over the candidate values of \(\lambda\). The output reveals that CV selected a \(\lambda\) for which 25 of the 100 covariates have nonzero coefficients. We used estimates store to store these results under the name cv in memory.

We use cvplot to plot the CV function.

. cvplot, minmax

graph1

The CV function appears somewhat flat near the optimal \(\lambda\), which implies that nearby values of \(\lambda\) would produce similar out-of-sample MSEs.

The number of included covariates can vary substantially over the flat part of the CV function. We can investigate the variation in the number of selected covariates using a table called a lasso knot table. In the jargon of lasso, a knot is a value of \(\lambda\) for which a covariate is added or subtracted to the set of covariates with nonzero values. We use lassoknots to display the table of knots.

. lassoknots

-------------------------------------------------------------------------------------
       |              No. of   CV mean |
       |             nonzero     pred. |        Variables (A)dded, (R)emoved,    
    ID |   lambda      coef.     error |             or left (U)nchanged         
-------+-------------------------------+---------------------------------------------
     2 | 2.980526          2   52.2861 | A phrase3   phrase4                     
     3 | 2.715744          3  50.48463 | A phrase5                               
     4 | 2.474485          4  48.55981 | A word3                                 
     6 | 2.054361          5  44.51782 | A phrase6                               
     9 | 1.554049          6  40.23385 | A wpair3                                
    10 | 1.415991          8  39.04494 | A wpair2    phrase2                     
    12 | 1.175581          9    36.983 | A word2                                 
    14 | .9759878         10  35.42697 | A word31                                
    16 | .8102822         11   34.2115 | A word19                                
    17 |  .738299         12  33.75501 | A word4                                 
    21 | .5088809         14  32.74808 | A word14    phrase7                     
    22 | .4636733         17  32.64679 | A word32    wpair19   wpair26           
    23 | .4224818         19  32.56572 | A wpair15   wpair25                     
    24 | .3849497         22  32.53301 | A wpair24   phrase13  phrase14          
  * 26 |  .319592         25  32.52679 | A word25    word30    phrase8           
    27 | .2912003         26  32.53946 | A wpair11                               
    29 | .2417596         27  32.86193 | A wpair17                               
    30 | .2202824         30  33.18254 | A word23    word38    wpair4            
-------------------------------------------------------------------------------------
* lambda selected by cross-validation.

The CV function is minimized at the \(\lambda\) with ID=26, and the lasso includes 25 covariates at this \(\lambda\) value. The flat part of the CV function includes the \(\lambda\) values with ID \(\in\{21,22,23,24,26,27\}\). Only 14 covariates are included by the lasso using the \(\lambda\) at ID=21. We will explore this observation using sensitivity analysis below.

CV tends to include extra covariates whose coefficients are zero in the model that best approximates the process that generated the data. This can affect the prediction performance of the CV-based lasso, and it can affect the performance of inferential methods that use a CV-based lasso for model selection. The adaptive lasso is a multistep version of CV. It was designed to exclude some of these extra covariates.

The first step of the adaptive lasso is CV. The second step does CV among the covariates selected in the first step. In this second step, the penalty loadings are \(\omega_j=1/| \widehat{\boldsymbol{\beta}}_j|\), where \(\widehat{\boldsymbol{\beta}}_j\) are the penalized estimates from the first step. Covariates with smaller-magnitude coefficients are more likely to be excluded in the second step. See Zou (2006) and Bühlmann and Van de Geer (2011) for more about the adaptive lasso and the tendency of the CV-based lasso to overselect. Also see Chetverikov, Liao, and Chernozhukov (2019) for formal results for the CV lasso and results that could explain this overselection tendency.

We specify the option selection(adaptive) below to cause lasso to use the adaptive lasso instead of CV to select the tuning parameters. We used estimates store to store the results under the name adaptive.

. lasso linear score word1-word50 wpair1-wpair30 phrase1-phrase20
>         if sample==1, nolog rseed(12345) selection(adaptive)

Lasso linear model                         No. of obs         =        450
                                           No. of covariates  =        100
Selection: Adaptive                        No. of lasso steps =          2

Final adaptive step results
--------------------------------------------------------------------------
         |                                No. of      Out-of-      CV mean
         |                               nonzero       sample   prediction
      ID |     Description      lambda     coef.    R-squared        error
---------+----------------------------------------------------------------
      31 |    first lambda    124.1879         0       0.0037     53.66569
      77 |   lambda before    1.719861        12       0.4238     30.81155
    * 78 | selected lambda    1.567073        12       0.4239      30.8054
      79 |    lambda after    1.427859        14       0.4237     30.81533
     128 |     last lambda    .0149585        22       0.4102     31.53511
--------------------------------------------------------------------------
* lambda selected by cross-validation in final adaptive step.

. estimates store adaptive

We see that the adaptive lasso included 12 instead of 25 covariates.

Plug-in methods tend to be even more parsimonious than the adaptive lasso. Plug-in methods find the value of the \(\lambda\) that is large enough to dominate the estimation noise. The plug-in method chooses \(\omega_j\) to normalize the scores of the (unpenalized) fit measure for each parameter. Given the normalized scores, it chooses a value for \(\lambda\) that is greater than the largest normalized score with a probability that is close to 1.

The plug-in-based lasso is much faster than the CV-based lasso and the adaptive lasso. In practice, the plug-in-based lasso tends to include the important covariates and it is really good at not including covariates that do not belong in the model that best approximates the data. The plug-in-based lasso has a risk of missing some covariates with large coefficients and finding only some covariates with small coefficients. See Belloni, Chernozhukov, and Wei (2016) and Belloni, et al. (2012) for details and formal results.

We specify the option selection(plugin) below to cause lasso to use the plug-in method to select the tuning parameters. We used estimates store to store the results under the name plugin.

. lasso linear score word1-word50 wpair1-wpair30 phrase1-phrase20  
>         if sample==1, selection(plugin)

Computing plugin lambda ...
Iteration 1:     lambda = .1954567   no. of nonzero coef. =       8
Iteration 2:     lambda = .1954567   no. of nonzero coef. =       9
Iteration 3:     lambda = .1954567   no. of nonzero coef. =       9

Lasso linear model                          No. of obs        =        450
                                            No. of covariates =        100
Selection: Plugin heteroskedastic

--------------------------------------------------------------------------
         |                                No. of
         |                               nonzero    In-sample
      ID |     Description      lambda     coef.    R-squared          BIC
---------+----------------------------------------------------------------
     * 1 | selected lambda    .1954567         9       0.3524     2933.203
--------------------------------------------------------------------------
* lambda selected by plugin formula assuming heteroskedastic.

. estimates store plugin

The plug-in-based lasso included 9 of the 100 covariates, which is far fewer than included by the CV-based lasso or the adaptive lasso.

Comparing the predictors

We now have four different predictors for score: OLS, CV-based lasso, adaptive lasso, and plug-in-based lasso. The three lasso methods could predict score using the penalized coefficients estimated by lasso, or they could predict score using the unpenalized coefficients estimated by OLS, including only the covariates selected by lasso. The predictions that use the penalized lasso estimates are known as the lasso predictions and the predictions that use the unpenalized coefficients are known as the postselection predictions, or the postlasso predictions.

For linear models, Belloni and Chernozhukov (2013) present conditions in which the postselection predictions perform at least as well as the lasso predictions. Heuristically, one expects the lasso predictions from a CV-based lasso to perform better than the postselection predictions because CV chooses \(\lambda\) to make the best lasso predictions. Analogously, one expects the postselection predictions for the plug-in-based lasso to perform better than the lasso predictions because the plug-in tends to select a set of covariates close to those that best approximate the process that generated the data.

In practice, we estimate the out-of-sample MSE of the predictions for all estimators using both the lasso predictions and the postselection predictions. We select the one that produces the lowest out-of-sample MSE of the predictions.

In the output below, we use lassogof to compare the out-of-sample prediction performance of OLS and the lasso predictions from the three lasso methods.

. lassogof ols cv adaptive plugin if sample==2

Penalized coefficients
-------------------------------------------------
       Name |         MSE    R-squared        Obs
------------+------------------------------------
        ols |    35.53149       0.2997        150
         cv |    27.83779       0.4513        150
   adaptive |    27.83465       0.4514        150
     plugin |    32.29911       0.3634        150
-------------------------------------------------

For these data, the lasso predictions using the adaptive lasso performed a little bit better than the lasso predictions from the CV-based lasso.

In the output below, we compare the out-of-sample prediction performance of OLS and the lasso predictions from the three lasso methods using the postselection coefficient estimates.

. lassogof ols cv adaptive plugin if sample==2, postselection

Penalized coefficients
-------------------------------------------------
       Name |         MSE    R-squared        Obs
------------+------------------------------------
        ols |    35.53149       0.2997        150
         cv |    27.87639       0.4506        150
   adaptive |    27.79562       0.4522        150
     plugin |    26.50811       0.4775        150
-------------------------------------------------

It is not surprising that the plug-in-based lasso produces the smallest out-of-sample MSE. The plug-in method tends to select covariates whose postselection estimates do a good job of approximating the data.

The real competition tends to be between the lasso estimates from the best of the penalized lasso predictions and the postselection estimates from the plug-in-based lasso. In this case, the postselection estimates from the plug-in-based lasso produced the better out-of-sample predictions, and we would use these results to predict score.

The elastic net and ridge regression

The elastic net extends the lasso by using a more general penalty term. The elastic net was originally motivated as a method that would produce better predictions and model selection when the covariates were highly correlated. See Zou and Hastie (2005) for details.

The linear elastic net solves

$$
\widehat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}}
\left\{
\frac{1}{2n} \sum_{i=1}^n\left(y_i – {\bf x}_i\boldsymbol{\beta}’\right)^2
+\lambda\left[
\alpha\sum_{j=1}^p\vert\boldsymbol{\beta}_j\vert
+ \frac{(1-\alpha)}{2}
\sum_{j=1}^p\boldsymbol{\beta}_j^2
\right]
\right\}
$$
where \(\alpha\) is the elastic-net penalty parameter. Setting \(\alpha=0\) produces ridge regression. Setting \(\alpha=1\) produces lasso.

The elasticnet command selects \(\alpha\) and \(\lambda\) by CV. The option alpha() specifies the candidate values for \(\alpha\).

. elasticnet linear score word1-word50 wpair1-wpair30 phrase1-phrase20
>         if sample==1, alpha(.25 .5 .75) nolog rseed(12345)

Elastic net linear model                         No. of obs        =        450
                                                 No. of covariates =        100
Selection: Cross-validation                      No. of CV folds   =         10

-------------------------------------------------------------------------------
               |                               No. of      Out-of-      CV mean
               |                              nonzero       sample   prediction
alpha       ID |     Description      lambda    coef.    R-squared        error
---------------+---------------------------------------------------------------
0.750          |
             1 |    first lambda    13.08449        0       0.0062     53.79915
            39 |   lambda before    .4261227       24       0.3918     32.52101
          * 40 | selected lambda    .3882671       25       0.3922     32.49847
            41 |    lambda after    .3537745       27       0.3917     32.52821
            44 |     last lambda    .2676175       34       0.3788     33.21631
---------------+---------------------------------------------------------------
0.500          |
            45 |    first lambda    13.08449        0       0.0062     53.79915
            84 |     last lambda    .3882671       34       0.3823     33.02645
---------------+---------------------------------------------------------------
0.250          |
            85 |    first lambda    13.08449        0       0.0058     53.77755
           120 |     last lambda    .5633091       54       0.3759       33.373
-------------------------------------------------------------------------------
* alpha and lambda selected by cross-validation.

. estimates store enet

We see that the elastic net selected 25 of the 100 covariates.

For comparison, we also use elasticnet to perform ridge regression, with the penalty parameter selected by CV.

. elasticnet linear score word1-word50 wpair1-wpair30 phrase1-phrase20
>         if sample==1, alpha(0) nolog rseed(12345)

Elastic net linear model                         No. of obs        =        450
                                                 No. of covariates =        100
Selection: Cross-validation                      No. of CV folds   =         10

-------------------------------------------------------------------------------
               |                               No. of      Out-of-      CV mean
               |                              nonzero       sample   prediction
alpha       ID |     Description      lambda    coef.    R-squared        error
---------------+---------------------------------------------------------------
0.000          |
             1 |    first lambda    3271.123      100       0.0062     53.79914
            90 |   lambda before     .829349      100       0.3617     34.12734
          * 91 | selected lambda    .7556719      100       0.3621      34.1095
            92 |    lambda after    .6885401      100       0.3620     34.11367
           100 |     last lambda    .3271123      100       0.3480     34.86129
-------------------------------------------------------------------------------
* alpha and lambda selected by cross-validation.

. estimates store ridge

Ridge regression does not perform model selection and thus includes all the covariates.

We now compare the out-of-sample predictive ability of the CV-based lasso, the elastic net, ridge regression, and the plug-in-based lasso using the lasso predictions. (For elastic net and ridge regression, the “lasso predictions” are made using the coefficient estimates produced by the penalized estimator.)

. lassogof cv adaptive enet ridge plugin if sample==2

Penalized coefficients
-------------------------------------------------
       Name |         MSE    R-squared        Obs
------------+------------------------------------
         cv |    27.83779       0.4513        150
   adaptive |    27.83465       0.4514        150
       enet |    27.77314       0.4526        150
      ridge |    29.47745       0.4190        150
     plugin |    32.29911       0.3634        150
-------------------------------------------------

In this case, the penalized elastic-net coefficient estimates predict best out of sample among the lasso estimates. The postselection predictions produced by the plug-in-based lasso perform best overall. This can be seen by comparing the above output with the output below.

. lassogof cv adaptive enet plugin if sample==2, postselection

Penalized coefficients
-------------------------------------------------
       Name |         MSE    R-squared        Obs
------------+------------------------------------
         cv |    27.87639       0.4506        150
   adaptive |    27.79562       0.4522        150
       enet |    27.87639       0.4506        150
     plugin |    26.50811       0.4775        150
-------------------------------------------------

So we would use these postselection coefficient estimates from the plug-in-based lasso to predict score.

Sensitivity analysis

Sensitivity analysis is sometimes performed to see if a small change in the tuning parameters leads to a large change in the prediction performance. When looking at the output of lassoknots produced by the CV-based lasso, we noted that for a small increase in the CV function produced by the penalized estimates, there could be a significant reduction in the number of selected covariates. Restoring the cv estimates and repeating the lassoknots output, we see that

. estimates restore cv
(results cv are active now)

. lassoknots

-------------------------------------------------------------------------------------
       |              No. of   CV mean |
       |             nonzero     pred. |        Variables (A)dded, (R)emoved,    
    ID |   lambda      coef.     error |             or left (U)nchanged         
-------+-------------------------------+---------------------------------------------
     2 | 2.980526          2   52.2861 | A phrase3   phrase4                     
     3 | 2.715744          3  50.48463 | A phrase5                               
     4 | 2.474485          4  48.55981 | A word3                                 
     6 | 2.054361          5  44.51782 | A phrase6                               
     9 | 1.554049          6  40.23385 | A wpair3                                
    10 | 1.415991          8  39.04494 | A wpair2    phrase2                     
    12 | 1.175581          9    36.983 | A word2                                 
    14 | .9759878         10  35.42697 | A word31                                
    16 | .8102822         11   34.2115 | A word19                                
    17 |  .738299         12  33.75501 | A word4                                 
    21 | .5088809         14  32.74808 | A word14    phrase7                     
    22 | .4636733         17  32.64679 | A word32    wpair19   wpair26           
    23 | .4224818         19  32.56572 | A wpair15   wpair25                     
    24 | .3849497         22  32.53301 | A wpair24   phrase13  phrase14          
  * 26 |  .319592         25  32.52679 | A word25    word30    phrase8           
    27 | .2912003         26  32.53946 | A wpair11                               
    29 | .2417596         27  32.86193 | A wpair17                               
    30 | .2202824         30  33.18254 | A word23    word38    wpair4            
-------------------------------------------------------------------------------------
* lambda selected by cross-validation.

lasso selected the \(\lambda\) with ID=26 and 25 covariates. We now use lassoselect to specify that the \(\lambda\) with ID=21 be the selected \(\lambda\) and store the results under the name hand.

. lassoselect id = 21
ID = 21  lambda = .5088809 selected

. estimates store hand

We now compute the out-of-sample MSE produced by the postselection estimates of the lasso whose \(\lambda\) has ID=21. The results are not wildly different and we would stick with those produced by the post-selection plug-in-based lasso.

. lassogof hand plugin if sample==2, postselection

Penalized coefficients
-------------------------------------------------
       Name |         MSE    R-squared        Obs
------------+------------------------------------
       hand |    27.71925       0.4537        150
     plugin |    26.50811       0.4775        150
-------------------------------------------------

Conclusion

This post has presented an introduction to the lasso and to the elastic net, and it has illustrated how to use them for prediction. There is much more information available in the Stata 16 LASSO manual. The next post will discuss using the lasso for inference about causal parameters.

References

Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen. 2012. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80: 2369–2429.

Belloni, A., and V. Chernozhukov. 2013. Least squares after model selection in high-dimensional sparse models. Bernoulli 19: 521–547.

Belloni, A., V. Chernozhukov, and Y. Wei. 2016. Post-selection inference for generalized linear models with many controls. Journal of Business & Economic Statistics 34: 606–619.

Bühlmann, P., and S. Van de Geer. 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. Berlin: Springer.

Chetverikov, D., Z. Liao, and V. Chernozhukov. 2019. On cross-validated Lasso. arXiv Working Paper No. arXiv:1605.02214. http://arxiv.org/abs/1605.02214.

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer.

Hastie, T., R. Tibshirani, and M. Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Rotaon, FL: CRC Press.

Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58: 267–288.

Zou, H. 2006. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association 101: 1418–1429.

Zou, H., and T. Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67: 301–320.

Appendix: K-fold cross-validation

Cross-validation finds the value for \(\lambda\) in a grid of candidate values \(\{\lambda_1, \lambda_2, \ldots, \lambda_Q\}\) that minimizes the MSE of the out-of-sample predictions. Cross-validation sets \(\omega_j=1\) or to user-specified values.

After you specify the grid, the sample is partitioned into \(K\) nonoverlapping subsets. For each grid value \(\lambda_q\), predict the out-of-sample squared errors using the following steps.

  • For each \(k\in\{1,2,\ldots, K\}\),
    • using the data not in partition \(k\), estimate the penalized coefficients \(\widehat{\boldsymbol{\beta}}\) with \(\lambda=\lambda_q\).
    • using the data in partition \(k\), predict the out-of-sample squared errors.

The mean of these out-of-sample squared errors estimates the out-of-sample MSE of the predictions. The cross-validation function traces the values of these out-of-sample MSEs over the grid of candidate values for \(\lambda\). The \(\lambda_j\) that produces the smallest estimated out-of-sample MSE minimizes the cross-validation function, and it is selected.