Home > Statistics > Prediction intervals with gradient boosting machine

Prediction intervals with gradient boosting machine

Introduction
Machine learning methods, such as ensemble decision trees, are widely used to predict outcomes based on data. However, these methods often focus on providing point predictions, which limits their ability to quantify prediction uncertainty. In many applications, such as healthcare and finance, the goal is not only to predict accurately but also to assess the reliability of those predictions. Prediction intervals, which provide lower and upper bounds such that the true response lies within them with high probability, are a reliable tool for quantifying prediction accuracy. An ideal prediction interval should meet several criteria: it should offer valid coverage (defined below) without relying on strong distributional assumptions, be informative by being as narrow as possible for each observation, and be adaptive—provide wider intervals for observations that are “difficult” to predict and narrower intervals for “easy” ones.

You may wonder whether it’s possible to construct a statistically valid prediction interval using any machine learning method, without any distributional assumption such as Gaussianity, while the above criteria are satisfied. Wait and see.

In this post, I demonstrate how to use Stata’s h2oml suite of commands to construct predictive intervals by using the conformalized quantile regression (CQR) approach, introduced in Romano, Patterson, and Candes (2019). The structure of the post is as follows: First, I provide a brief introduction to conformal prediction, with a focus on CQR, and then I show how to construct predictive intervals in Stata by using gradient-boosting regressions.

Conformal prediction
Conformal prediction (Papadopoulos et al. 2002; Vovk, Gammerman, and Shafer 2005; Lei et al. 2018; Angelopoulos and Bates 2023), also known as conformal inference, is a general methodology designed to complement any machine learning prediction by providing prediction intervals with guaranteed distribution-free statistical coverage. At a conceptual level, conformal prediction starts with a pretrained machine learning model (for example, a gradient boosting machine) trained on exchangeable or independent and identically distributed data. It then uses held-out validation data from the same data-generating distribution, referred to as calibration data, to define a score function \(S(\hat y, y)\). This function assigns larger scores when the discrepancy between the predicted value \(\hat y\) and the true response \(y\) is greater. These scores are subsequently used to construct prediction intervals for new, unseen observations \({\bf X}_{\text{new}}\), where \({\bf X}_{\text{new}}\) is a random vector of predictors.

It can be shown that conformal prediction \(\mathcal{C}({\bf X}_{\text{new}})\) provides valid prediction interval coverage (Lei et al. 2018; Angelopoulos and Bates 2023) in the sense that

\[
P\{Y_{\text{new}} \in \mathcal{C}({\bf X}_{\text{new}})\} \geq 1 – \alpha \tag{1}\label{eq1}
\]
where \(\alpha \in (0,1)\) is a user-defined miscoverage or error rate. This property is called marginal coverage, because the probability is averaged over the randomness of calibration and unseen or testing data.

Although the conformal prediction approach guarantees valid coverage \eqref{eq1} with minimal distributional assumptions and for any machine learning method, our focus here is on CQR (Romano, Patterson, and Candes 2019). It is one of the most widely used and recommended approaches to construct prediction intervals (Romano, Patterson, and Candes 2019; Angelopoulos and Bates 2023).

CQR
The exposure in this section closely follows Romano, Patterson, and Candes (2019) and Angelopoulos and Bates (2023). Consider a quantile regression that estimates a conditional quantile function \(q_{\alpha}(\cdot)\) of \(Y_{\text{new}}\) given \({\bf X}_{\text{new}} = {\bf x}\) for each possible realization of \({\bf x}\). We can use any quantile regression estimation method, such as gradient boosting machine with quantile or “pinball” loss to obtain \(\widehat q_{\alpha}(\cdot)\). By definition, \(Y_{\text{new}}|{\bf X}_{\text{new}} = {\bf x}\) is below \(q_{\alpha/2}({\bf x})\) with probability \(\alpha/2\) and above \(q_{1 – \alpha/2}({\bf x})\) with probability \(\alpha/2\), so the estimated prediction interval \([\widehat q_{\alpha/2}(\cdot), \widehat q_{1 – \alpha/2}(\cdot)]\) should have approximately \(1-\alpha\)% coverage. Unfortunately, because the estimated quantiles can be inaccurate, such coverage is not guaranteed. Thus, we need to conformalize them to have a valid coverage \eqref{eq1}. CQR steps can be summarized as follows:

  • Step 1. Split data \(\mathcal{D}\) into a training \(\mathcal{D}_1\) and calibration \(\mathcal{D}_2\), and let \(\mathcal{D}_3\) be the new, unseen testing data.
  • Step 2. Use \(\mathcal{D}_1\) to train any quantile regression estimation method \(f\) to estimate two conditional quantile functions \(\hat q_{\alpha_1}(\cdot)\) and \(\hat q_{\alpha_2}(\cdot)\), for \(\alpha_1 = \alpha/2\) and \(\alpha_2 = 1 – \alpha/2\), respectively. For example, when the miscoverage rate \(\alpha = 0.1\), we obtain \(\hat q_{0.05}(\cdot)\) and \(\hat q_{0.95}(\cdot)\).
  • Step 3. Use calibration data \(\mathcal{D}_2\) to compute conformity scores \(S_i\), for each \(i \in \mathcal{D}_2\), that quantify the error made by the interval \([\hat q_{\alpha_1}({\bf x}), \hat q_{\alpha_2}({\bf x})]\).
    \[
    S_i = \max \{\hat q_{\alpha_1}({\bf x}_i) – Y_i, Y_i – \hat q_{\alpha_2}({\bf x}_i)\}
    \]
  • Step 4. Given new unseen data \({\bf X}_{\text{new}} \subset \mathcal{D}_3\), construct the prediction interval for \(Y_{\text{new}}\),
    \[
    \mathcal{C}({\bf X}_{\text{new}}) = \Big[\hat q_{\alpha_1}({\bf X}_{\text{new}}) – q_{1 – \alpha}(S_i, \mathcal{D}_2), \hat q_{\alpha_2}({\bf X}_{\text{new}}) + q_{1 – \alpha}(S_i, \mathcal{D}_2)\Big]
    \]
    where the empirical quantile of conformity scores
    \[ \label{eq:empquantile}
    q_{1 – \alpha}(S_i, \mathcal{D}_2) = \frac{\lceil (|\mathcal{D}_2|+1)(1 – \alpha) \rceil}{|\mathcal{D}_2|} \tag{2}
    \]
    and \(|\mathcal{D}_2|\) is the number of observations of the calibration data and \(\lceil \cdot \rceil\) is the ceiling function.

The intuition behind the conformity score, computed in step 3, is the following:

  • If \(Y_{\text{new}} < q_{\alpha_1}({\bf X}_{\text{new}})\) or \(Y_{\text{new}} > q_{\alpha_2}({\bf X}_{\text{new}})\), the scores given by \(S_i = |Y_{\text{new}} – q_{\alpha_1}({\bf X}_{\text{new}})|\) or \(S_i = |Y_{\text{new}} – q_{\alpha_2}({\bf X}_{\text{new}})|\) represent the magnitude of the error incurred by this mistake.
  • On the other hand, if \(q_{\alpha_1}({\bf X}_{\text{new}}) \leq Y_{\text{new}} \leq q_{\alpha_2}({\bf X}_{\text{new}})\), the computed score is always nonpositive.

This way, the conformity score accounts for both undercoverage and overcoverage.

Romano, Patterson, and Candes (2019) showed that under the exchangeability assumption, steps 1–4 guarantee the valid marginal coverage \eqref{eq1}.

Implementation in Stata
In this section, we use the h2oml suite of commands in Stata to construct predictive intervals using conformalized gradient boosting quantile regression. H2O is a scalable and distributed machine learning and predictive analytics platform that allows us to perform data analysis and machine learning. For details, see the Stata 19 Machine Learning in Stata Using H2O: Ensemble Decision Trees Reference Manual.

We consider the Ames housing dataset (De Cock 2011), ameshouses.dta, also used in a Kaggle competition, which describes residential houses sold in Ames, Iowa, between 2006 and 2010. It contains about 80 housing (and related) characteristics, such as home size, amenities, and location. This dataset is often used for building predictive models for home sale price, saleprice. Before putting the dataset into an H2O frame, we perform some data manipulation in Stata. Because saleprice is right-skewed (for example, type histogram saleprice), we use its log. We also generate a variable, houseage, that calculates the age of the house at the time of a sales transaction.

. webuse ameshouses

. generate logsaleprice = log(saleprice)

. generate houseage = yrsold - yearbuilt

. drop saleprice yrsold yearbuilt

Next, we initialize a cluster and put the data into an H2O frame. Then, to perform step 1, let’s split the data into training (50%), calibration (40%), and testing (10%) frames, where the testing frame serves as a proxy for new, unseen data.

. h2o init
(output omitted)

. _h2oframe put, into(house) 

Progress (%): 0 100

. _h2oframe split house, into(train calib test) split(0.5 0.4 0.1)
> rseed(1)

. _h2oframe change train

Our goal is to construct a predictive interval with 90% coverage. We define three local macros in Stata to store the miscoverage rate \(\alpha = 1- 0.9 = 0.1\) and lower and upper bounds, \(0.05\) and \(0.95\), respectively. Let’s also create a global macro, predictors, that contains the names of our predictors.

. local alpha = 0.1

. local lower = 0.05

. local upper = 0.95

. global predictors overallqual grlivarea exterqual houseage garagecars 
> totalbsmtsf stflrsf garagearea kitchenqual bsmtqual

We perform step 2 by pretraining gradient boosting quantile regression by using the h2oml gbregress command with the loss(quantile) option. For illustration, I tune only the number of trees (ntrees()) and maximum-depth (maxdepth()) hyperparameters, and store the estimation results by using the h2omlest store command.

. h2oml gbregress logsaleprice $predictors, h2orseed(19) cv(3, modulo) 
> loss(quantile, alpha(`lower')) ntrees(20(20)80) maxdepth(6(2)12) 
> tune(grid(cartesian))

Progress (%): 0 100

Gradient boosting regression using H2O

Response: logsaleprice
Loss:     Quantile .05
Frame:                                 Number of observations:
  Training: train                                 Training =    737
                                          Cross-validation =    737
Cross-validation: Modulo               Number of folds     =      3

Tuning information for hyperparameters

Method: Cartesian
Metric: Deviance

-------------------------------------------------------------------
                 |                      Grid values
 Hyperparameters |         Minimum          Maximum        Selected
-----------------+-------------------------------------------------
 Number of trees |              20               80              40
 Max. tree depth |               6               12               8
-------------------------------------------------------------------

Model parameters

Number of trees      =  40             Learning rate       =     .1
              actual =  40             Learning rate decay =      1
Tree depth:                            Pred. sampling rate =      1
           Input max =   8             Sampling rate       =      1
                 min =   6             No. of bins cat.    =  1,024
                 avg = 7.5             No. of bins root    =  1,024
                 max =   8             No. of bins cont.   =     20
Min. obs. leaf split =  10             Min. split thresh.  = .00001
Metric summary
-----------------------------------
           |                 Cross-
    Metric |   Training  validation
-----------+-----------------------
  Deviance |   .0138451    .0259728
       MSE |   .1168036    .1325075
      RMSE |   .3417654    .3640158
     RMSLE |   .0259833    .0278047
       MAE |   .2636412    .2926809
 R-squared |   .3117896    .2192615
-----------------------------------

. h2omlest store q_lower

The best-selected model has 40 trees and a maximum tree depth of 8. I use this model to obtain predicted lower quantiles on the calibration dataset by using the h2omlpredict command with the frame() option. We will use those predicted values to compute conformity scores in step 3.

. h2omlpredict q_lower, frame(calib)

Progress (%): 0 100

For simplicity, I use the above hyperparameters to run gradient boosting quantile regression for the upper quantile. In practice, we need to tune hyperparameters for this model as well. As before, I predict the upper quantiles on the calibration dataset and store the model.

. h2oml gbregress logsaleprice $predictors, h2orseed(19) cv(3, modulo) 
> loss(quantile, alpha(`upper')) ntrees(40) maxdepth(8)

Progress (%): 0 1.2 100

Gradient boosting regression using H2O

Response: logsaleprice
Loss:     Quantile .95
Frame:                                 Number of observations:
  Training: train                                 Training =    737
                                          Cross-validation =    737
Cross-validation: Modulo               Number of folds     =      3

Model parameters

Number of trees      =  40             Learning rate       =     .1
              actual =  40             Learning rate decay =      1
Tree depth:                            Pred. sampling rate =      1
           Input max =   8             Sampling rate       =      1
                 min =   5             No. of bins cat.    =  1,024
                 avg = 7.2             No. of bins root    =  1,024
                 max =   8             No. of bins cont.   =     20
Min. obs. leaf split =  10             Min. split thresh.  = .00001

Metric summary
-----------------------------------
           |                 Cross-
    Metric |   Training  validation
-----------+-----------------------
  Deviance |   .0132103    .0218716
       MSE |   .1190689    .1365112
      RMSE |   .3450637    .3694742
     RMSLE |   .0268811    .0287084
       MAE |   .2567844    .2841911
 R-squared |   .2984421    .1956718
-----------------------------------

. h2omlest store q_upper

. h2omlpredict q_upper, frame(calib)

Progress (%): 0 100

To compute conformity scores as in step 3, let’s use the _h2oframe get command to load the estimated quantiles and the logarithm of the sales price from the calibration frame calib into Stata. Because the data in Stata’s memory have been modified, I also use the clear option.

. _h2oframe get logsaleprice q_lower q_upper using calib, clear

Then, we use those variables to generate a new variable, conf_scores, that contains the computed conformity scores.

. generate double conf_scores = 
> max(q_lower - logsaleprice, logsaleprice - q_upper)

To construct conformalized prediction intervals from step 4, we need to compute empirical quantiles \eqref{eq:empquantile}, which can be done in Stata by using the _pctile command. Figure 1a shows the distribution of conformity scores, and the red vertical line indicates the empirical quantile \eqref{eq:empquantile}, which is equal to 0.031.

. local emp_quantile = ceil((1 - 0.1) * (_N + 1))/ _N * 100

. _pctile conf_score, percentiles(`emp_quantile')

. local q = r(r1)

. di `q'
.03104496

Next, I restore both models that estimated lower and upper quantiles and obtain their predictions on the testing frame test.

. h2omlest restore q_lower
(results q_lower are active now)

. h2omlpredict q_lower, frame(test)

Progress (%): 0 100

. h2omlest restore q_upper
(results q_upper are active now)

. h2omlpredict q_upper, frame(test)

Progress (%): 0 100

I then load those predictions from the testing frame into Stata and generate lower and upper bounds for prediction intervals. In addition, I also load the logsaleprice and houseage variables. We will use those variables for illustration purposes.

. _h2oframe get logsaleprice houseage q_lower q_upper using test, clear

. generate double conf_lower = q_lower - `q'

. generate double conf_upper = q_upper + `q'

Figure 1b displays the prediction intervals for each observation in the testing frame. We can see that the computed interval is adaptive, meaning for “difficult” observations, for example, outliers, the interval is wide and vice versa.

graph1
(a) Histogram of conformity scores
graph1
(b) Prediction intervals for the testing dataset

Figure 1: (a) Histogram of conformity scores (b) CQR-based prediction intervals for the testing dataset

Below, I list a small sample of the prediction intervals and the true log of sale price response.

. list logsaleprice conf_lower conf_upper in 1/10

     +----------------------------------+
     | logsal~e   conf_lo~r   conf_up~r |
     |----------------------------------|
  1. | 11.67844   11.470275   12.052485 |
  2. |  12.6925   12.002838   12.773621 |
  3. | 11.31447   11.332932   12.058372 |
  4. | 11.64833   11.527202   12.099679 |
  5. | 12.10349   11.640463   12.401743 |
     |----------------------------------|
  6. |  11.8494   11.621721   12.231588 |
  7. |  11.8838   11.631179   12.045645 |
  8. | 12.06681   11.931026   12.204194 |
  9. | 12.12811   11.852887   12.375453 |
 10. | 11.65269   11.569664   12.066292 |
     +----------------------------------+

For 9 out of 10 observations, the responses belong to the respective predictive intervals. We can compute the actual average coverage of the interval in the testing frame, which I do next by generating a new variable, in_interval, that indicates whether the response logsaleprice is in the prediction interval.

. generate byte in_interval = 0

. replace in_interval = 1 if inrange(logsaleprice, conf_lower, conf_upper)
(124 real changes made)

. summarize in_interval, meanonly

. local avg_coverage = r(mean)

. di `avg_coverage' * 100
90.510949

We can see that the actual average coverage is 90.5%.

References
Angelopoulos, A. N., and S. Bates. 2023. Conformal prediction: A gentle introduction. Foundations and Trends in Machine Learning 16: 494–591. https://doi.org/10.1561/2200000101.

De Cock, D. 2011. Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education 19: 3. https://doi.org/10.1080/10691898.2011.11889627.

Lei, J., M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman. 2018. Distribution-free predictive inference for regression. Journal of the American Statistical Association 113: 1094–1111. https://doi.org/10.1080/01621459.2017.1307116.

Papadopoulos, H., K. Proedrou, V. Vovk, and A. Gammerman. 2002. “Inductive confidence machines for regression”. In Machine Learning: ECML 2002. Lecture Notes in Computer Science, edited by T. Elomaa, H. Mannila, H. Toivonen, vol. 2430: 345–356. Berlin: Springer. https://doi.org/10.1007/3-540-36755-1_29.

Romano, Y., E. Patterson, and E. Candes. 2019. “Conformalized quantile regression”. In Advances in Neural Information Processing Systems, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, vol. 32. Red Hook, NY: Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/5103c3584b063c431bd1268e9b5e76fb-Paper.pdf.

Vovk, V., A. Gammerman, and G. Shafer. 2005. Algorithmic Learning in a Random World. New York: Springer. https://doi.org/10.1007/b106715.