Approximate statistical tests for comparing binary classifier error rates using H2OML
Motivation
You have just trained a gradient boosting machine (GBM) and a random forest (RF) classifier on your data using Stata’s new h2oml command suite. Your GBM model achieves 87% accuracy on the testing data, and your RF model, 85%. It looks as if GBM is the preferred classifier, right? Not so fast.
Why accuracy alone isn’t enough
Accuracy, area under the curve, and root mean squared error are popular metrics, but they provide only point estimates. These numbers reflect how well a model performed on one specific testing sample, but they don’t account for the variability that can arise from sample to sample. In other words, they don’t answer this key question: Will the difference in performance between these methods hold at the population level, or could it have occurred by chance only in this particular testing dataset?
When comparing methods like GBM and RF, a few percentage points in performance might not be compelling on their own. Without considering how much the results might vary across different samples, it’s hard to tell whether one method consistently outperforms the other or whether the observed difference is just a product of random variation in the data. Statistical tests are essential in this regard, as they provide a framework for assessing whether the observed differences are likely to persist in the population.
Introduction
A common practice in machine learning for comparing classifiers is to split the dataset into either a three-way holdout (training, validation, and testing sets) or a two-way holdout (training and testing sets). The validation set (for three-way splits) or cross-validation (for two-way splits) is used to tune the model, while the testing set evaluates the final performance. For details, see Model selection in machine learning in [H2OML] Intro.
However, a subtle but critical drawback of relying on a single test set is random variation in the selection of the testing data. In particular, even if two classifiers perform identically on the entire population, one may appear superior because of chance fluctuations in the sampled testing data. This is especially problematic with small testing sets.
To address this, statistical tests are recommended in the literature (Dietterich 1998; Alpaydin 1998; Raschka 2018). In this post, we explore the following question: Given two machine learning methods and a training set, how can we test whether the classifiers exhibit the same error rate on unseen data?
We focus on two tests: the McNemar test (Mcnemar 1947) and the combined \(5 \times 2\) cross-validated (\(5 \times 2\) CV) F test (Alpaydin 1998). Using Stata and its h2oml suite, we’ll demonstrate their application. The post is structured as follows: First, we introduce both tests conceptually; then, we transition to practical implementation in Stata.
Statistical tests
In binary classification, the performance of a model can be evaluated using the misclassification error rate, which is the proportion of incorrect predictions among all predictions. Let true positives (TP) and true negatives (TN) represent the number of correctly classified positive and negative cases, respectively. Let false positives (FP) and false negatives (FN) represent the number of misclassified positive and negative cases. The misclassification error rate is defined as
\[
e = \frac{\text{FP} + \text{FN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \tag{1}\label{eq:errrate}
\]
Conversely, the accuracy of the model, which measures the proportion of correct predictions, is given by
\[
\text{acc} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} = 1 – e \tag{2}\label{eq:accuracy}
\]
For details, see [H2OML] metric_option. These metrics are fundamental for assessing the quality of predictions made by methods such as RFs or GBMs.
McNemar’s Test
McNemar’s test is a nonparametric test for paired comparisons that can be used to assess whether two classification methods differ in performance on the same testing set.
Let \(n_{ij}\) denote the number of instances for which classifier A’s (for example, GBM) prediction was \(i\) (\(i=1\) for correct prediction or \(i=0\) for incorrect prediction) and classifier B’s (for example, RF) prediction was \(j\) (\(j=1\) for correct prediction or \(j=0\) for incorrect prediction). The \(2 \times 2\) contingency table is
Table 1: Information needed to conduct McNemar’s test for comparing two binary classifiers’ error rates
B incorrect | B correct | |
---|---|---|
A incorrect | \(n_{00}\) | \(n_{01}\) |
A correct | \(n_{10}\) | \(n_{11}\) |
We are interested in the off-diagonal elements: \(n_{01}\) (A is incorrect, B is correct) and \(n_{10}\) (A is correct, B is incorrect). These values represent the disagreements between classifiers.
The null hypothesis \(H_0\) is that the two classifiers have the same error rate:
\[
H_0 : P(\text{A incorrect, B correct}) = P(\text{A correct, B incorrect})
\quad \text{or} \quad n_{01} = n_{10}
\]
Under the null hypothesis, the number of disagreements \(n_{01} + n_{10}\) follows a binomial distribution with equal probability of either outcome. For large sample sizes, the binomial distribution can be approximated by a chi-squared distribution with 1 degree of freedom.
The McNemar test statistic is
\[
\chi^2 = \frac{(n_{01} – n_{10})^2}{n_{01} + n_{10}}
\]
This statistic is approximately chi-squared distributed with 1 degree of freedom under the null hypothesis. See Unstratified matched case–control data (mcc and mcci) in [R] epitab for more details.
Combined 5 x 2 CV F test
The \( 5\times 2\) CV F test is a statistical method for comparing the performance of two supervised classification methods. It is designed to test the null hypothesis
\[
H_0: \text{The two classifiers have equal generalization error}
\]
and is built upon Dietterich’s \(5\times 2\) CV paired t test (Dietterich 1998). Alpaydin (1998) identified instability in the original test due to the arbitrary choice of one of 10 possible test statistics and proposed a combined F test that aggregates over all of them for robustness.
We perform 5 replications of 2-fold cross-validation, yielding 10 distinct test sets. Let \(p_i^{(j)}\) denote the difference in error rates between the two classifiers on fold \(j = 1, 2\) of replication \(i = 1, \dots, 5\). That is,
\[
p_i^{(j)} = e_{i,A}^{(j)} – e_{i,B}^{(j)} = \text{acc}_{i,B}^{(j)} – \text{acc}_{i,A}^{(j)} \tag{3} \label{eq:pij}
\]
where \( e_{i,A}^{(j)} \) and \( e_{i,B}^{(j)} \) are the misclassification error rates of classifiers A and B, respectively, on the \(j\)th fold of the \(i\)th replication [as defined in \eqref{eq:errrate}] and \( \text{acc}_{i,A}^{(j)} \) and \( \text{acc}_{i,B}^{(j)} \) are the corresponding accuracy values [as defined in \eqref{eq:accuracy}].
For each replication \(i\), we compute the average,
\[
\bar{p}_i = \frac{p_i^{(1)} + p_i^{(2)}}{2}
\]
and the estimate of the variance:
\[
s_i^2 = (p_i^{(1)} – \bar{p}_i)^2 + (p_i^{(2)} – \bar{p}_i)^2 = \frac{(p_i^{(1)} – p_i^{(2)})^2 }{2} \tag{4} \label{eq:var}
\]
Original 5 x 2 CV t test (for reference)
Dietterich (1998) proposed the t statistic:
\[
t = \frac{p_1^{(1)}}{\sqrt{ \frac{1}{5} \sum_{i=1}^{5} s_i^2 }}
\]
This uses only 1 of the 10 possible \(p_i^{(j)}\) values, which introduces randomness based on the choice of fold order.
Combined 5 x 2 CV F-test derivation
To improve robustness, the combined F test aggregates all 10 squared differences \(p_i^{(j)}\) and all five variances \(s_i^2\).
Define
\[
N = \sum_{i=1}^{5} \sum_{j=1}^{2} \left( p_i^{(j)} \right)^2
\quad\text{and}\quad
M = \sum_{i=1}^{5} s_i^2 \tag{5} \label{eq:NandM}
\]
Under the null hypothesis and the assumption of independence (approximate), we have
\[
F = \frac{N / 10}{M / 5} = \frac{ \sum_{i=1}^{5} \sum_{j=1}^{2} \left( p_i^{(j)} \right)^2}{2 \sum_{i=1}^{5} s_i^2} \tag{6} \label{eq:Fstat}
\]
This statistic is approximately F-distributed with \((10, 5)\) degrees of freedom.
In summary, the combined \(5\times 2\) CV F test improves upon Dietterich’s original t test by
- using all 10-fold differences instead of just 1,
- reducing sensitivity to the order of folds or replications, and
- providing better control of type I error and improved statistical power.
Implementation in Stata
We begin our analysis by loading attrition.dta and generating a new variable, logincome, that stores the log of monthly income. This is a common transformation used to normalize skewed variables before modeling.
. use https://www.stata.com/users/assaad_dallakyan/attrition, clear . gen logincome = log(monthlyincome)
We then initialize the H2O cluster using h2o init and put the current dataset into an H2O frame, attrition, and make it the current H2O frame.
. h2o init . _h2oframe put, into(attrition) current
We split attrition.dta into training (70%) and testing (30%) frames using random seed 19 for reproducibility. Then we set train as the current working frame for model training.
. _h2oframe split attrition, into(train test) split(0.7 0.3) rseed(19) replace . _h2oframe change train
For convenience, we define a global macro, predictors, that includes the complete set of predictors for the model. These cover a wide range of personal and job-related features, such as education, job satisfaction, work-life balance, and demographic details.
. global predictors age education employeenumber environmentsat > jobinvolvement jobsatisfaction logincome numcompaniesworked > performance relationshipsat totalworkingyears worklifebalance > yearsatcompany yearsincurrentrole yearswithcurrmanager > businesstravel gender jobrole maritalstatus
McNemar’s test
We first train a GBM classifier using the training dataset. Once the model is trained, we specify that the test frame should be used for subsequent postestimation commands, display the confusion matrix, and generate predictions. These predicted classes are stored in variable attrition_gbm in the testing frame test, and the model is saved under the name gbm for future comparison. For simplicity, for both the GBM and RF classifiers, we used the default values for all hyperparameters and did not perform tuning. However, in real-world applications, we would more likely want to compare the best models obtained after hyperparameter tuning; see Hypereparameter tuning in [H2OML] Intro for more details about tuning.
. h2oml gbbinclass attrition $predictors, h2orseed(19) (output omitted) . h2omlpostestframe test (testing frame test is now active for h2oml postestimation) . h2omlestat confmatrix Confusion matrix using H2O Testing frame: test | Predicted attrition | No Yes | Total Error Rate -----------+-----------------------+---------------------- No | 318 33 | 351 33 .094 Yes | 48 32 | 80 48 .6 -----------+-----------------------+---------------------- Total | 366 65 | 431 81 .188 Note: Probability threshold .254 that maximizes F1 metric used for classification. . h2omlpredict attrition_gbm, class Progress (%): 0 100 . h2omlest store gbm
Across all 431 observations in the testing dataset, there were 81 misclassifications, giving an overall error rate of 0.188.
We repeat the same procedure for a RF classifier. The predictions are stored in variable attrition_rf, and the model is saved as rf.
. h2oml rfbinclass attrition $predictors, h2orseed(19) (output omitted) . h2omlpostestframe test (testing frame test is now active for h2oml postestimation) . h2omlestat confmatrix Confusion matrix using H2O Testing frame: test | Predicted attrition | No Yes | Total Error Rate -----------+-----------------------+---------------------- No | 276 75 | 351 75 .214 Yes | 29 51 | 80 29 .362 -----------+-----------------------+---------------------- Total | 305 126 | 431 104 .241 Note: Probability threshold .21 that maximizes F1 metric used for classification. . h2omlpredict attrition_rf, class Progress (%): 0 100 . h2omlest store rf
Across all 431 observations in the testing dataset, there were 104 misclassifications, giving an overall error rate of 0.241. At first glance, it appears that GBM outperforms RF in terms of predictive accuracy (0.188 versus 0.241 error rates). However, this difference may not be indicative of a difference in the population. This highlights the importance of supplementing accuracy metrics with proper statistical testing, as we do next with McNemar’s test and the 5×2 CV F test.
To perform McNemar’s test, we bring the test data and predictions back into Stata (via _h2oframe get) for further statistical analysis. We encode the string-valued categorical predictions and outcome into numeric variables and drop the original string versions.
. clear . _h2oframe get attrition attrition_gbm attrition_rf using test . encode attrition, gen(nattrition) . encode attrition_gbm, gen(nattrition_gbm) . encode attrition_rf, gen(nattrition_rf) . drop attrition attrition_gbm attrition_rf
The next step is to produce a three-way table that cross-tabulates true values with both model predictions. From the results, we identify the counts needed (shown in table 1) for McNemar’s test and store them in local macros.
. table (nattrition_gbm) (nattrition nattrition_rf ), nototal --------------------------------------------------- | nattrition | No Yes | nattrition_rf nattrition_rf | No Yes No Yes ---------------+----------------------------------- nattrition_gbm | No | 303 17 41 8 Yes | 9 22 5 26 --------------------------------------------------- . local n00 = 22 + 41 // Nb. of obs. misclassified by both GBM and RF . local n01 = 17 + 5 // Nb. of obs. misclassified by RF but not by GBM . local n10 = 9 + 8 // Nb. of obs. misclassified by GBM but not by RF . local n11 = 303 + 26
We then run mcci to compute the McNemar statistic using these frequencies.
. mcci `n00' `n01' `n10' `n11' | Controls | Cases | Exposed Unexposed | Total -----------------+------------------------+----------- Exposed | 63 22 | 85 Unexposed | 17 329 | 346 -----------------+------------------------+----------- Total | 80 351 | 431 McNemar's chi2(1) = 0.64 Prob > chi2 = 0.4233 Exact McNemar significance probability = 0.5224
The result does not provide evidence to reject the null hypothesis, suggesting no performance difference.
For models that are computationally expensive to train, Dietterich (1998) recommended McNemar’s test as the method of choice. For models that can be trained multiple times (for example, 10 times), he recommended the \(5\times 2\) CV \(t\) test because it is slightly more powerful than McNemar’s test. Next, we describe how to implement the \(5\times 2\) CV \(F\) test in Stata, which is an improved version of the \(5 \times 2\) CV \(t\) test.
Combined 5 x 2 CV F test
We start by switching to the frame that contains the entire dataset (attrition). We then initialize scalars to accumulate \(N\) and \(M\) [see \eqref{eq:NandM}] that are used to compute the F statistic in \eqref{eq:Fstat}.
. _h2oframe change attrition . scalar N = 0 . scalar M = 0
We will then perform 5 iterations, where in each iteration, we randomly split the dataset into two equal halves, train and test. To ensure reproducibility, we first set a seed in Stata and then generate pseudo–random numbers using runiformint(). We extract digits from this number to form a new seed, which we pass to H2O’s pseudo-random-number generator via the rseed() option of the _h2oframe split command. Note that this procedure differs from the one we advised against in the [R] set seed entry. In this case, because H2O’s pseudo-random-number generator is unrelated to Stata’s, there is no risk of the generator converging to a cycle. We then train GBM and RF on each half and evaluate them on the other, recording their accuracy (computed via the h2omlestat threshmetric command). We compute the difference in performance for each fold (\(p_i^{(j)}, j = 1, 2\)) and store them in scalars pi1 and pi2. Then we calculate the variance and accumulate squared differences and variances across all replications. These are then used to calculate the F statistic.
. set seed 19 . forvalues i = 1(1)5 { 2. local split_seed = runiformint(1, 50000) 3. _h2oframe split attrition, into(train test) split(0.5 0.5) rseed(`split_seed') replace 4. quietly { 5. _h2oframe change train 6. h2oml gbbinclass attrition $predictors, h2orseed(19) validframe(test) 7. h2omlestat threshmetric 8. scalar accA_1 = r(threshmetric)[4,1] // Accuracy of A (GBM) on 1st fold 9. . h2oml rfbinclass attrition $predictors, h2orseed(19) validframe(test) 10. h2omlestat threshmetric 11. scalar accB_1 = r(threshmetric)[4,1] // Accuracy of B (RF) on 1st fold 12. . _h2oframe change test 13. h2oml gbbinclass attrition $predictors, h2orseed(19) validframe(train) 14. h2omlestat threshmetric 15. scalar accA_2 = r(threshmetric)[4,1] // Accuracy of A (GBM) on 2nd fold 16. . h2oml rfbinclass attrition $predictors, h2orseed(19) validframe(train) 17. h2omlestat threshmetric 18. scalar accB_2 = r(threshmetric)[4,1] // Accuracy of B (RF) on 2nd fold 19. // Compute the difference in performance . scalar pi1 = accA_1 - accB_1 // Equation (2) 20. scalar pi2 = accA_2 - accB_2 21. scalar variance = (pi1 - pi2){\caret}2 / 2 // Equation (3) 22. scalar N = N + pi1{\caret}2 + pi2{\caret}2 // Equation (4) 23. scalar M = M + variance // Equation (4) 24. } 25. } . scalar f_stat = N / (2 * M) // Equation (5) . scalar p_value = Ftail(10, 5, f_stat) . di p_value .19382379
The result of this test corroborates the result of McNemar’s test. There is not evidence to suggest that the methods perform differently.
References
Alpaydin, E. 1998. Combined 5x2cv f test for comparing supervised classification learning algorithms combined 5x2cv f test for comparing supervised classification learning algorithms.
https://api.semanticscholar.org/CorpusID:6872443.
Dietterich, T. G. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10: 1895–1923. https://doi.org/10.1162/089976698300017197.
Mcnemar, Quinn. 1947. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12: 153–157. https://doi.org/10.1007/BF02295996.
Raschka, S. 2018. Model evaluation, model selection, and algorithm selection in machine learning. arXiv:1811.12808 [cs.LG]. https://doi.org/10.48550/arXiv.1811.12808.