Hypothesis Testing for Cross-Validation
Yves Grandvalet CNRS, Heudiasyc UMR 6599 UTC, G? enie Informatique B.P. 20529, 60205 Compi` egne, France Yves.Grandvalet@utc.fr Yoshua Bengio Dept. IRO, Universit? e de Montr? eal P.O. Box 6128, Downtown Branch, Montreal, H3C 3J7, QC, Canada bengioy@iro.umontreal.ca Technical Report 1285 D? epartement d’Informatique et Recherche Op? erationnelle August 29th, 2006
Abstract K-fold cross-validation produces variable estimates, whose variance cannot be estimated unbiasedly. However, in practice, one would like to provide a ?gure related to the variability of this estimate. The ?rst part of this paper lists a series of restrictive assumptions (on the distribution of cross-validation residuals) that allow to derive unbiased estimates. We exhibit three such estimates, corresponding to di?ering assumptions. Their similar expressions however entail almost identical empirical behaviors. Then, we look for a conservative test for detecting signi?cant di?erences in performances between two algorithms. Our proposal is based on the derivation of the form of a t-statistic parametrized by the correlation of residuals between each validation set. Its calibration is compared to the usual t-test. While the latter is overcon?dent in concluding that di?erences are indeed signi?cant, our test is bound to be more skeptical, with smaller type-I error.
1
Introduction
The empirical assessment of performance of a new algorithm is one of the main steps in its introduction. The most convincing approach consists in benchmarking the new algorithm on a series of real data sets, comparing it with one or more baseline algorithms. For real data sets, the underlying distribution is 1
unknown. Hence, the generalization error cannot be measured. It has to be estimated on hold-out test sets or by means of resampling schemes such as K-fold cross-validation. Drawing statistically signi?cant conclusions thus requires to evaluate the uncertainty of these estimates. While K-fold cross-validation is recognized as the mehod of choice for estimating the generalization error for small sample sizes [1, 2], it su?ers from one major drawback: variance. Bengio and Grandvalet show that there exists no universal (valid under all distributions) unbiased estimator of the variance of K-fold cross-validation [3]. Here, we derive a set of variance estimates, which are unbiased under restrictive assumptions on the distribution of cross-validation residuals. Then, we derive a t-statistic parametrized by the between-block correlation and based on it, we propose a conservative estimate of the uncertainty in the cross-validation estimation of performance di?erence between algorithms.
2
2.1
Framework and background
Algorithm evaluation
We dispose of a data set D = {z1 , . . . , zn }, with zi ∈ Z , assumed independently sampled from an unknown distribution P . We have a learning algorithm A : Z ? → F , which maps data sets of (almost) arbitrary size to predictors. 1 Let f = A(D) be the function returned by algorithm A on the training set D. The discrepancy between a prediction and an observation z is measured by a loss functional L : F ×Z → R. For example one may take L(f, (x, y )) = (f (x) ? y )2 in regression , and L(f, (x, y )) = 1f (x)=y in classi?cation. In application-based evaluations, the goal of learning is usually stated as the minimization of the expected loss of f = A(D) on future test examples: PE(D) = E [L(f, z)] , (1)
where the expectation is taken with respect to z ? P . In algorithm-based evaluations, we are not really interested in the performance on a speci?c training set. We want to compare algorithms on a more general basis [2], such as the expected performance of learning algorithm A over di?erent training sets: EPE(n) = E [L(A(D), z)] , (2)
where the expectation is taken with respect to D × z independently sampled from P n × P . When P is unknown, EPE is estimated, and the most widely accepted estimator is cross-validation.
2.2
K-fold cross-validation estimation
In K-fold cross-validation, the data set D is ?rst chunked into K disjoint subsets (or blocks) of the same size (to simplify the analysis below we assume that n is a multiple of K ), with m = n/K . Following [3], we write Tk for the k -th such block, and Dk the training set obtained by removing the elements in Tk from D. The cross-validation estimator is CV = 1 K
K
k=1
1 m
L(A(Dk ), zi ) ,
zi ∈Tk
(3)
1 Here we consider symmetric algorithms, which are insensitive to the ordering of examples in the data set.
2
which is an unbiased estimate of EPE(n ? m). A version of cross-validation is speci?cally dedicated to comparing algorithms, using matched pairs 1 ?CV = K
K
k=1
1 m
L(A1 (Dk ), zi ) ? L(A2 (Dk ), zi ) .
zi ∈Tk
(4)
In what follows, CV and ?CV will generically be denoted by ? ?: ? ? = 1 n 1 ei = K i=1
n K
?k ,
k=1
?k =
1 m
ei ,
i∈Tk
where, slightly abusing notation, i ∈ Tk means zi ∈ Tk and ?i ∈ Tk , ei = L(A(Dk ), zi ) L(A1 (Dk ), zi ) ? L(A2 (Dk ), zi ) for ? ? = CV , for ? ? = ?CV .
2.3
Variance of K-fold cross-validation
It is crucial to assess the uncertainty of ? ? to derive trustworthy conclusions. We ?rst recall three theoretical results of [3] that will be useful in the following Sections. (Lemma 1 and 2, and Theorem 1). Lemma 1 The variance of the cross-validation estimator is a linear combination of three moments: θ= 1 n2 Cov(ei , ej ) =
i,j
n?m 1 2 m?1 σ + ω+ γ, n n n
(5)
where Cov(ei , ej ) = E [ei ej ] ? E [ei ]E [ej ] is the covariance between variables ei and ej ; σ 2 is the variance 2 of ei , i = 1, . . . , n; ω is the covariance for residuals from the same test block ((i, j ) ∈ T k , j = i); γ is the covariance for residuals from di?erent test blocks ((i, j ) ∈ Tk × T , = k ). ? be any quadratic estimate of Var[? Lemma 2 Let θ ?], its expectation is a linear combination of three terms ?] = a(σ 2 + ?2 ) + b(ω + ?2 ) + c(γ + ?2 ) , E [θ and a representer of estimators with this expected value is ? = as1 + bs2 + cs3 , θ where (s1 , s2 , s3 ) are de?ned from e = (e1 , . . . , en )T as follows: ? ? n 1 2 ? ? ? s1 = n i=1 ei , ? K 1 s2 = n(m k=1 i∈Tk j ∈Tk :j =i ei ej , ?1) ? ? ? K ? s = 1 3 k=1 =k i∈Tk j ∈T ei ej . n(n?m) (7) (6)
(8)
Theorem 1 There exists no universally unbiased estimator of Var[? ?]. The proof of the above Theorem shows that the bias of any quadratic estimator is a linear combination of ?2 , σ 2 , ω and γ . However, the following new proposition shows that one may ignore the covariances ω and γ for large sample sizes. It motivates, in the asymptotic limit, some of the variance estimators presented in the forthcoming Section. 3
Proposition 1 The covariances ω and γ go to zero as n goes to in?nity provided the learning algorithm A is consistent and the loss function L is bounded. Proof ?k is the mean loss on a test sample of size m = n/K , which is independant from the training sample of size n(1 ? K ). As both training and test sample sizes go to in?nity as n goes to in?nity, the consistency of A implies that, for every ε > 0
n→∞
lim P (|?k ? EPE(n ? m)| > ε) = 0 .
This convergence, together with the fact that L is bounded, implies lim n→∞ Var[?k ] = 0. By the CauchySchwarz inequality, we have Var[? ?] ≤ Var[?k ] and thus limn→∞ Var[? ?] = 0. K 2 ?] = It then follows from the developments of Var[?k ] and Var[? ?] (Var[?k ] = K n σ + (1 ? n )ω and Var[? 1 2 K K ?1 σ + (1 ? ) ω + γ ) that we have lim ω = lim γ = 0 Q.E.D. n→∞ n→∞ n n K
3
Estimators built from cross-validation residuals
The statement that there exists no universal unbiased estimator of Var[? ?] means that, for any estimator, there are distributions on cross-validation residuals ei , such that the estimation is biased. However, by itself, this negative result does not prohibit the quest for sensible estimators. One way to derive estimates of Var[? ?] consists in stating restrictive assumptions for which an unbiased estimator can be built. We consider the series of simplifying assumptions listed on the left-hand side of Table 1, each one providing a unique unbiased estimate of variance of ? ? expressed as a combination of s 1 , s2 and s3 . Unbiasedness applies to the whole set of distributions satisfying the assumption and is not guaranteed elsewhere. Hence, we also report the universal bias, computed thanks to Lemma 2, in the last column of Table 1. Table 1: Non-universal unbiased variance estimates Assumption Estimate Bias 1 ?1 m ?=0 θ1 = n s1 + mn s2 + n? s ?2 3 n n+1?m n?m 1 ?ω ω=0 θ2 = n s1 ? n s2 + n s3 1 ?1 γ=0 θ3 = n s1 + mn s2 ? m s ?γ 3 n 1 1 m m m ω θ = s ? s ? ω ? n? γ = ? n? 4 2 m n 1 n n n γ The ?rst assumption, ? = 0, corresponds to the null hypothesis when comparing the expected prediction error of two learning algorithms. All the other assumptions aim at simplifying the structure of the covariance matrix of e, either regarding its entries (ω = 0 and γ = 0) or regarding its eigendecomposition m (γ = 0, γ = ? n? m ω ) [3]. These simpli?cations push the ei toward independence: some correlations are assumed to vanish, or the ei are “sphered” by equalizing the eigenvalues of the covariance matrix). Supposing that ei are uncorrelated, i.e. ω = γ = 0, is more restrictive than the assumptions listed in Table 1. In this situation, many variance estimates are unbiased. However, in the absence of correlation,
4
the standard unbiased estimator of the variance of ? ? is θ5 = =
?1 whose universal bias is ? m n?1 ω ?
1 n(n ? 1)
(ei ? ? ? )2
i
m?1 n?m 1 s1 ? s2 ? s3 , n (n ? 1)n (n ? 1)n
n?m n?1 γ .
The variance estimates θ1 , θ2 , θ3 , θ4 and θ5 are discussed in more details in the following subsections. Note that Proposition 1 implies that θ2 –θ5 are asymptotically unbiased. A better understanding of some of these estimators is provided by rephrasing them as functions of ? ?, ? k and ei .
3.1
Assuming null mean
The ?rst variance estimate, θ1 , is an unbiased estimate of the variance of ? ?, when ? is assumed to be zero. This assumption is used to de?ne the distribution of ? ? = ?CV (4), under the null hypothesis that two algorithms A1 and A2 have an identical expected prediction error. This estimate is expressed more simply as θ1 = ? ?2 . Obviously, θ1 cannot be used to compute the usual pivotal t-statistic for testing the null hypothesis, since t = ? ?/ θ1 = sign(? ?). The distribution
?)(1 ? ?/? ?) explicitly depends on ?, hence this statistic is not pivotal and not of (? ? ? ?)/ θ1 = sign(? appropriate for hypothesis testing. Thus, in spite of the relevance of the assumption ? = 0 in algorithm comparison, θ1 is useless.
3.2
Assuming no correlation within test blocks
The estimator θ2 is based on the assumption that there is no correlation within test blocks. The following rewriting, K 1 θ2 = (ei ? ? ? )2 + (ei ? ? ?)(ej ? ? ?) , n(m ? 1)
k=1 i∈Tk =k j ∈T
shows that θ2 is proportional to a truncated sample covariance. It can be shown that this truncation does not preserve the positivity of the estimator. This major defect rules θ2 out.
3.3
Assuming no correlation between test blocks
= k , Cov(?k , ? ) = 0. One
The estimator θ3 assumes that the between block covariance γ is null: for can rewrite θ3 in the compact form θ3 = 1 1 K K ?1
K
(?k ? ? ? )2 ,
k=1
where one recognizes the sample variance of ?k , divided by K to account for that we are interested in 1 estimating the variance of the mean ? ?= K k ?k and not the variance of ?k themselves.
5
Since the training examples zi are independent, ?k are uncorrelated when A(D1 ), . . . , A(DK ) are identical. The present assumption should thus provide a good approximation of variance when the algorithm is stable. When the answers of the algorithm vary a lot, e.g. when the training samples D are small compared to the capacity of A, or when they are expected to contain some outliers, γ may reach high values, and θ3 will be consequently biased.
3.4
Assuming canceling correlations
The estimate θ4 assumes a linear dependence between ω and γ which implies that the within-block and between-block covariances approximately cancel out. It results that e can be decomposed in n ? K + 1 uncorrelated compounds that allow to estimate θ. One can show that θ4 = 1 1 nK
K
k=1
1 m?1
(ei ? ?k )2 ,
i∈Tk
where the inner averages are the within-block sample variances. These variances, known to be identical for all blocks, are averaged over the K test blocks, and the factor n?1 transforms the variance of ei into the variance of ? ?. This estimator considers that the problem of estimating the variance of ? ? and ? k are orthogonal. As a result, there is no term in s3 , i.e. no ei ej terms with i ∈ Tk , j ∈ T , = k . The assumed linear dependence between ω and γ is unlikely, but it can be a good approximation to reality when ω and γ are small. When the correlations are larger, there is little chance that θ4 will perform well.
3.5
Assuming no correlation
As already mentioned, many unbiased estimates of variance exist when assuming a total absence of correlation ω = γ = 0. The last estimator is motivated by the usual sample variance for uncorrelated examples n 1 1 θ5 = (ei ? ? ? )2 , n n ? 1 i=1 where the factor the factor n?1 transforms the variance of ei into the variance of ? ?. Again, the assumption of absence of correlation between errors is only realistic when the algorithm is stable. The range of possible bias for θ5 is similar to the one of θ3 and θ4 .
4
4.1
A t-statistic for K-fold cross-validation
Motivation
The variability of ? ? is due to the variance and the covariance of cross-validation residuals. The previous section illustrates that obtaining unbiased estimates of θ requires assumptions on the covariance matrix which are unsupprted for small sample sizes. The major concern is that γ may be positive; the consequences of this positivity have been motivating the present thread of research, starting from [2]. When γ is negative, θ3 over-estimates variance; this is much less problematic than under-estimating variance since it leads to a provable conservative test. Theory
6
does not preclude negative γ values, but, in a modi?ed cross-validation setup, where the assignment of examples to the training or the test set is performed independently on each fold, Nadeau and Bengio prooved that γ is non-negative [4]. In our simulations, negative γ values are atypical, but they were observed in regression problems with gross outliers. To go beyond speculations on γ , we should estimate its value, but the estimation process from a single cross-validation trial is clearly hopeless. We observed the result of cross-validation on a single training set, and we are willing to estimate a covariance re?ecting the variability with respect to changes in the training set. Circumventing this problem requires additional information, such as the one provided by several cross-validations operating on independent training sets. Dietterich showed experimentally that Student’s t-test (uncorrected for correlations) is not reliable in the context of K-fold cross-validation [2]. He then proposed the 5×2 cross-validation procedure, leading to the 5×2 cross-validation t-test, which was later modi?ed by Alpaydin’s 5×2 cross-validation F -test [5]. Besides assumptions of the kind of the ones leading to the variance estimates given in Table 1, these tests are based on independance assumptions which are in con?ict with the de?nitions of the variables (in the form of “X is independant of X + Y ”). Nadeau and Bengio proposed another procedure providing an unbiased estimate of variance, but achieve this goal at the expense of a 10-fold increase in computation, without variance reduction [4]. Here, we take a di?erent stance. Despite the negative results of [3], we aim at providing some guidance for concluding at the signi?cance of the observed di?erences in performance between two algorithms. As in the 5×2 cross-validation procedure, only 10 trainings are necessary, but unlike the latter, the standard 10-fold cross-validation is performed. Also, the derived statistic is, up to an unknown factor, proven to really follow a t-distribution.
4.2
Derivation
The t-test is designed to answer questions about the mean of a random sample of independent observations from a normal distribution when the variance is unknown. In the context of algorithm comparison via K-fold cross-validation, it is routinely though improperly used to test whether two algorithms perform equally, i.e. if the expected value of ?CV (4) is zero. The analysis presented in [3] can be used to analyze how the between-block covariances a?ect the t-statistic. Proposition 2 Under the assumption that ?k are normally distributed 2 , tK ?1 = K (K ? 1)(1 ? ρ) ? ???
K k=1 2
,
(9)
(?k ? ?)
1 2 ?1 Let C = m σ + mm ω ? γ I + γ 11T denote the covariance matrix of ? = (?1 , . . . , ?K )T . The eigendecomposition of C yields C = PT ΛP, where P = (Γ, 1/K 1) is an orthonormal matrix and Λ =
2 This assumption is rather weak since ? c k are averages of identically distributed variables. It is thus a good approximation of the true distribution even for moderately large n/K .
where ρ = γ θ is the between-block correlation, follows a Student t-distribution with K ? 1 degrees of freedom. Proof We ?rst recall that, ? k , ? = k ? E [?k ] = ? ? 1 2 ?1 Var[?k ] = m σ + mm ω ? Cov(?k , ? ) = γ .
7
1 2 ?1 diag(ν1 , . . . , ν1 , ν2 ), with ν1 = m σ + mm ω ? γ appearing (K ? 1) times and ν2 = ?1/2 We de?ne ? ?=Λ P?, which has the following mean and variance
1 2 m?1 m σ + m ω +(K ? 1)γ .
E [? ? ] = 0,
K ? ν2
T
, Var[? ?? ?T ] = I .
K ?1
2 Under the assumption that ?k are normally distributed, k=1 ? ?2 k is distributed according to a χ distribution with K ? 1 degrees of freedom. This sum can be computed as follows: K ?1
? ?2 k
k=1
= ? ?T ? ??? ?2 K = = ?T C?1 ? ? ?T 1 ?T 11T ? Kν2 1 1 ν1 ? ν 2 I+ ? ν1 Kν1 ν2 Kν2
11T
?
K ?1
? ?2 k
k=1
=
1 ν1
?T ? ?
K
1 T T ? 11 ? K
K 2
=
? 1 ? ν1 1 ν1
k=1
1 2 ?k ? K
2
?k
k=1
? ?
K
=
K ν2 ? K ? ν2 (?
(?k ? ?)
k=1
.
Furthermore, ? ?K ? hence
=
? ?) is normally distributed N (0, 1) and independent of ? ?k , k = K , K (K ? 1)ν1 ν2 ? ???
K k=1
(?k ? ?)
2
is t-distributed with K ? 1 degrees of freedom. The proposition follows as ν1 /ν2 = 1 ? ρ. Q.E.D. Note that Proposition 2 does not provide a means to build a principled t-test, because the correlation ρ is unknown. We however believe that it is worth being stated because of its consequences. First, approaches based on calibrating t-tests should be targeted at ?nding a substitute for ρ instead of adjusting the number of degrees of freedom of the test statistic [6]. 1 Second, constructing a conservative t-test requires to upper-bound (1 ? ρ) 2 by some positive constant. Overall, in our experiments, the maximum observed ρ value was 0.7. With this value, when comparing two algorithms with identical performances, assuming ρ = 0 will result in a type-I error of range 26% at an alleged 5% signi?cance level. In other words, by using the usual t-statistic, we will conclude in 26% of cases that the algorithms di?er while we would have like to limit the frequency of this type of erroneous judgment to 5%. Looking at the other side of the same problem, an alleged p-value of 0.2% in fact represents a true p-value of 5%. 8
Finally, one cannot compute tK ?1 (9) since ρ is unknown, but for a given signi?cance level α, we can compute the maximum value ρα such that the null hypothesis would be accepted. This ?gure is not exactly what we would ideally like (a p-value), but it is objective, does not rely on wrong nor unveri?able assumptions, and can be used to quantify the con?dence in the rejection of the hypothesis test, since it is a monotonically decreasing function of the true p-value. More rigorously, if one believes that correlations above 0.7 are extremely unlikely in 10-fold cross-validation, then it is sensible to compute t K ?1 with this ρ value to have a conservative test. Then, rejecting the null hypothesis amounts to state “Provided that ρ ≤ 0.7, the di?erence in ?CV is statistically signi?cant at the 5% level.”
5
Experimental results
We have conducted experiments for classi?cation and regression problems, for arti?cial and real data, using trees and linear predictors. In all experiments, the variance estimates θ3 , θ4 and θ5 behaved similarly. This is due to the fact that covariances ω and γ are alike. We report here the results for the t-test based on θ3 , for which the Gaussian assumption is more grounded than for θ4 or θ5 . For lack of space, we only report results with trees for the Letter and Forest data. The Letter data set comprises 20 000 examples described by 16 numeric features. The Forest data set comprises 581 012 examples described by 54 numeric features. Both problems were transformed in approximately balanced binary classi?cation problems to obtain sensible results for small sample sizes. Accurate estimates of the true expected cross-validation error (and its variance) require many independent training samples. This was achieved by considering the whole data set to be the population, from which 10 000 independent training samples of size n were drawn by uniform sampling with replacement. Results obtained for di?erent training set sizes are summarized in Table 2. Table 2: Experimental results over 10 000 independent trainings, for the significance level α = 5% Letter ρ Type-I error, ρ ?= 0 Type-I error, ρ ? = 0.7 Forest ρ Type-I error, ρ ?= 0 Type-I error, ρ ? = 0.7 n (%) (%) (%) n (%) (%) (%) 20 52.45 16.4 3.1 20 53.22 16.3 3.1 40 43.43 12.8 1.5 40 49.85 14.1 2.3 80 42.16 12.4 1.3 80 51.94 15.9 2.4 160 34.04 9.9 1.0 160 47.15 14 1.7 400 28.23 8.8 0.7 400 45.44 12.9 1.5 800 25.69 8.1 0.7 800 42.61 12.3 1.3 2000 22.68 7.8 0.5 2000 40.79 11.6 1.1
?From the de?nition of θ3 , the correlation ρ is the downward relative bias of the estimator, that is b [θ 3] . The ?gures obtained for θ4 and θ5 are very similar, thus not represented. One sees that, for ρ = θ?E θ small sample sizes, the variance estimates can be less than half of the total variance. This underestimation of variance causes the usual t-test, computed with ρ ? = 0, to erroneously reject the null hypothesis in more than 16% of cases, while this type-I error should normally be limited to α = 5%. To build our t-test, we chose ρ = 0.7, which is the maximum value we experimentally observed (including results not shown here). The test is conservative (as it should be, since ρ is always below 0.53),
9
but is highly skeptical for large sample sizes. Obtaining a better upper-bound for ρ without additional computation is still an open issue.
6
Conclusions
K-fold cross-validation is known to be variable, and its variance cannot be unbiasedly estimated. Although many estimators of variance are asymptotically unbiased (such as θ3 , θ4 and θ5 ), for small sample sizes, these estimates are almost systematically down-biased, leading to tests which reject unduly the null hypothesis. Based on the derivation of the form of a t-statistic parametrized by the between-block correlation, we propose a test which is consistently conservative. The t-statistic can also provide an interpretable con?dence score. For calibrated tests, this role is played by the p-value, but here, the latter is meaningless. On the other hand, knowing the maximum correlation for which the test is accepted is likely to be more interpretable. If extra computation is permitted, the unbiased variance estimate of Nadeau and Bengio [4] could be used to estimate the between-block correlation on training sets of size approximately n/2. In our experiments, this correlation decreases smoothly with sample size, so that a good estimate at n/2 would provide a tight upper-bound for sample size n.
References
[1] R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proc. of the Fourteenth International Joint Conference on Arti?cial Intelligence, pages 1137–1143, 1995. [2] T. G. Dietterich. Approximate statistical tests for comparing supervised classi?cation learning algorithms. Neural Computation, 10(7):1895–1924, 1998. [3] Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of K-fold cross-validation. Journal of Machine Learning Research, 5:1089–1105, 2004. [4] C. Nadeau and Y. Bengio. Inference for the generalization error. Machine Learning, 52(3):239–281, 2003. [5] E. Alpaydin. Combined 5 × 2 cv F test for comparing supervised classi?cation learning algorithms. Neural Computation, 11(8):1885–1892, 1999. [6] R. R. Bouckaert. Choosing between two learning algorithms based on calibrated tests. In ICML 2003: Proc. of the Twentieth International Conference on Machine Learning, 2003.
10