The Degrees of Freedom of Partial Least Squares Regression

The Degrees of Freedom of Partial Least Squares Regression Dr. Nicole Krämer TU München 5th ESSEC-SUPELEC Research Workshop May 20, 2011

My talk is about...... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity of PLS 2. comparison of regression methods 3. model selection 4. variable selection based on confidence intervals The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 2 / 27

Example: Near Infrared Spectroscopy Predict Y the percentage of water in meat based on X its near infra red spectrum. unknown linear relationship f (x) = β 0 + β, x = β 0 + We observe p β j x (j). j=1 y i f (x i ), i = 1,..., n. x 1 centered data X =. Rn p y = (y 1,..., y n) R n. x n The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 3 / 27

Partial Least Squares in one Slide Partial Least Squares (PLS) = 1. supervised dimensionality reduction 2. + least squares regression n p X Least Squares Regression 1 y n n supervised dimensionality reduction T m Partial Least Squares Regression The PLS components T have maximal covariance to the response variable y. The m p components T are used as new predictor variables in a least-squares fit. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 4 / 27

Partial Least Squares Algorithm p 1 Partial Least Squares (PLS) = 1. supervised dimensionality reduction 2. + least squares regression PLS components T have maximal covariance to y. n n X supervised dimensionality reduction T Least Squares Regression Partial Least Squares Regression y n m Algorithm (NIPALS) X 1 = X. For i = 1,..., m model parameter 1. w i X i y maximize covariance cov(x i w i,y) w w = w i X i y w w 2. t i X i w i latent component 3. X i+1 = X i t i t i X i enforce orthogonality β m = W (T XW ) 1 T y Return T = (t 1,..., t m) and W = (w 1,..., w m). The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 5 / 27

Degrees of Freedom: Why are they important? Degrees of Freedom (DoF) 1. capture the intrinsic complexity of a regression method. Y i = f (x i ) + ε i, ε i N ( 0, σ 2), 2. are used for model selection. test error = training error + complexity(dof) Examples: Bayesian Information Criterion BIC = y ŷ 2 n + log(n)var (ε) DoF n Akaike Information Criterion, Minimum Description Length,... The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 6 / 27

Definition: Degrees of Freedom Assumption: The regression method is linear, i.e. ŷ = Hy, with H independent of y. Degrees of Freedom The Degrees of Freedom of a linear fitting method are DoF = trace (H). Examples Principal Components Regression with m components DoF(m) = 1 + m Ridge Regression, Smoothing splines,... The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 7 / 27

Naive Degrees of Freedom for PLS? Recall: PLS is not linear in y p 1 ( ) 1 ŷ m = y + T T T T y. }{{} =: H depends on y n n X supervised dimensionality reduction T Least Squares Regression Partial Least Squares Regression y n m Degrees of Freedom 1 + trace( H) = 1 + m. If we ignore the nonlinearity, we obtain DoF naive (m) = 1 + m. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 8 / 27

Degrees of Freedom for PLS (K. & Braun, 2007; K. & Sugiyama, 2011) The generalized Degrees of Freedom of a regression method are ( )] ŷ DoF = E Y [trace y This coincides with the previous definition if the method is linear. Proposition An unbiased estimate of the Degrees of Freedom of PLS is ( ) ŷm DoF(m) = trace. y We need to compute (the trace of) the first derivative. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 9 / 27

Computational Details I (K. & Braun, 2007) m = 1 We compute the derivative along the lines of the PLS algorithm 1. w 1 = X y w 1 = X 2. t 1 = Xw 1 t 1 = X w 1 = XX 3. ŷ 1 = P t1 y ŷ 1 = 1 t 1 2 ( t1 y + t 1 yi n ) (In P t1 ) t 1 + P t1 m > 1 We rearrange the algorithm in terms of projections P onto vectors t i. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 10 / 27

Computational Details II (K., Sugiyama, & Braun, 2009) Define empirical scatter matrices s = X y and S = X X w 1,... w m is an orthogonal basis of the Krylov space K m (S, s) = span ( s, Ss,..., S m 1 s ) }{{} =:K m PLS computes orthogonal Gram-Schmidt basis of s, Ss,..., S m 1 s Minimization property: ) 1 β m = arg min y Xβ 2 = K m (KmSK m K m s β K m(s,s) explicit formula for the trace of ŷ m (but not for the derivative itself) The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 11 / 27

Summary Computational Details Two equivalent algorithms implemented in the R-package plsdof. my.pls<-pls.model(x,y,compute.dof=true,compute.jacobian) option: runtime scales in p (number of variables) or in n (number of observations). iterative computation of ŷ m projection on Krylov subspaces ŷ m and β m = A? yes no confidence intervals ) yes no for β m? ĉov ( βm = σ 2 AA run time The details can be found in the paper. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 12 / 27

Shape of the DoF-Curve Benchmark data with different correlation structures. training data set description variables examples examples correlation kin (fh) dynamics of a robot arm 32 8192 60 low boston census data for housing 13 506 50 medium cookie near infrared spectra 700 70 39 high pls.object<-pls.model(x, y, compute.dof=true) kin (fh) boston cookie Degrees of Freedom 0 5 10 15 20 25 30 DoF(m)=m+1 Degrees of Freedom 2 4 6 8 10 12 14 DoF(m)=m+1 Degrees of Freedom 0 10 20 30 40 DoF(m)=m+1 0 3 6 9 12 16 20 24 28 32 0 1 2 3 4 5 6 7 8 9 11 13 0 3 6 9 12 16 20 24 28 32 number of components number of components number of components The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 13 / 27

A Lower Bound for the Degrees of Freedom The lower is the collinearity, the higher are the Degrees of Freedom. Theorem If the largest eigenvalue λ max of the sample correlation matrix S fulfills then 2λ max trace(s) DoF(m = 1) 1 + trace(s) λ max. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 14 / 27

Shape of the DoF-Curve Benchmark data with different correlation structures. training data set description variables examples examples correlation kin (fh) dynamics of a robot arm 32 8192 60 low boston census data for housing 13 506 50 medium cookie near infrared spectra 700 70 39 high pls.object<-pls.model(x, y, compute.dof=true) Degrees of Freedom 0 5 10 15 20 25 30 kin (fh) * ******************************* lower bound DoF(m)=m+1 * 0 5 10 15 20 25 30 Degrees of Freedom 2 4 6 8 10 12 14 boston * * * * * * * * * * * * DoF(m)=m+1 lower bound * * 0 2 4 6 8 10 12 Degrees of Freedom 0 10 20 30 40 50 60 70 *** **** * * * ** cookie * * * * * *** ** ************** DoF(m)=m+1 0 5 10 15 20 25 30 35 number of components number of components number of components The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 15 / 27

Comparison of Regression Methods: ozone L.A. ozone pollution data p = 12 variables n = 203 observations, n train = 50 training observations Comparison of 1. Partial Least Squares (PLS) 2. Principal Components Regression (PCR) 3. Ridge Regression β ridge = { arg min y Xβ 2 + λ β 2} β The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 16 / 27

Comparison of Regression Methods: ozone mean-squared error and model complexity 32 30 28 26 24 22 20 mean squared error PLS PCR RIDGE 12 10 8 6 4 2 components PLS PCR 12 10 8 6 4 2 degrees of freedom PLS PCR RIDGE There is no difference with respect to mean-squared error. A direct comparison of model parameters (number of components and λ) is not possible. Degrees of Freedom enable a fair model comparison between PLS and PCR. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 17 / 27

Comparison of Regression Methods: ozone training error 14 18 22 26 x o x PLS PCR o x x x o x x x x o x o o o o x o o o o x o x 2 4 6 8 10 12 components training error 14 18 22 26 x o x x x x ox x x xo xooo x oo x 2 4 6 8 10 12 degrees of freedom,,pls fits closer than PCR. (de Jong) However, there is no clear difference with respect to the Degrees of Freedom. PLS puts focus on more complex models. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 18 / 27

Variable Selection for PLS PLS does not select variables. extensions to sparse PLS 1. thresholding of the weight vectors (Saigo, K.,& Tsuda, 2008) 2. sparsity constraints on the weight vectors (Le Cao et. al., 2008; Chun & Keles, 2010) 3. shrinkage (Kondylis & Whittaker, 2007) 4.... Classical approaches 5. bootstrapping R packages pls and ppls 6. hypothesis testing based on the distribution of β The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 19 / 27

Approximate distribution of β Recall: Regression coefficients are a non-linear function of y. First order Taylor approximation β β y y }{{} =:A approximate covariance matrix ) ĉov ( β = σ 2 AA The noise level can be estimated via σ 2 = y ŷ 2 n DoF The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 20 / 27

( ( ( ) ( ) ( ) ) ( ) ) ( ) ( ) ( ) ( ( ) ) ( ( ) ((( ( ) ))) ) ( ) ))))) ( ( ( ) ) ) ( ) ( ) ) ) Confidence Intervals for PLS: ozone and tecator implemented in the R-packages plsdof and multcomp cv.object<-pls.cv(x, y, compute.covariance=true) my.multcomp<-glht(cv.object,...) 95% family wise confidence level 1 ( ) 2 (( ))) 3 4 ( 5 ( ) 6 ( ) 7 ( ) 8 ( ) 9 ( ) ( ) ( ) ( ) 10 11 12 5 0 5 10 95% family wise confidence level ))))))) )) )))) 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10987654321 (((((((( (( ( ((( (( )) ( ) (( ( ( )) ( ) ) ( ) ( ) ( ) ( ) ( ) ( (( ) ))) ( (( ((((((((((((( ((((((( )))))) ( (((( )) ((( )))) 10 5 0 5 The function corrects for multiple comparisons. The computational cost is high for large number of variables. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 21 / 27

Model Selection Comparison of 1. 10-fold cross-validation (gold standard) cv.object<-pls.cv(x, y, k=10) 2. Bayesian Information Criterion with our DoF-estimate bic.object<-pls.ic(x, y, criterion= bic ) 3. Bayesian Information Criterion with the naive estimate DoF=m+1 naive.object<-pls.ic(x, y, criterion= bic, naive=true) Akaike Information Criterion and Minimum Description Length are also available in the R-package. data sets: kin (fh), Boston, cookie The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 22 / 27

Prediction Accuracy 10-fold cross-validation Bayesian Information Criterion with our DoF-estimate Bayesian Information Criterion with the naive estimate DoF=m+1 kin (fh) boston 30 cookie test error * 100 18 16 14 12 test error 45 40 35 test error * 100 25 20 15 10 30 10 8 25 CV BIC BIC naive CV BIC BIC naive CV BIC BIC naive 1. All three approaches obtain similar accuracy. 2. There is no clear difference between BIC and naive BIC. The plots look similar for the selected Degrees of Freedom. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 23 / 27

Model complexity (selected components) 10-fold cross-validation Bayesian Information Criterion with our DoF-estimate Bayesian Information Criterion with the naive estimate DoF=m+1 kin (fh) boston cookie 6 7 35 5 6 30 number of components 4 3 2 number of components 5 4 3 number of components 25 20 15 10 1 2 5 0 CV BIC BIC naive 1 CV BIC BIC naive 0 CV BIC BIC naive 1. BIC selects less complex models than naive BIC. 2. There is no clear difference between BIC and CV. The plots look similar for the selected Degrees of Freedom. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 24 / 27

Why not use the naive approach? The naive approach selects more components, but the mean-squared error is not higher. no overfitting? The test error curve can be flat or steep around the optimum. 1 2 3 4 5 x o scaled test error d=50 d=210 o ooo o oo oo o o ooo o o o o oo x ox x x x xxxxxxxx xxxxxxxxxxxxxxxxx ooo oo oo o 0 5 10 15 20 25 30 componenents Depending on the form of the curve, the selection of too complex models does lead to overfitting. More details in the paper. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 25 / 27

Summary Partial Least Squares... typically consumes more than one Degree of Freedom for each component. precise estimate of its intrinsic complexity ( ) ŷm DoF(m) = trace y Its Degrees of Freedom... allow us to compare different regression methods. select less complex models than the naive estimate (when combined with information criteria). Variables can be selected... by constructing approximate confidence intervals. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 26 / 27

References Krämer, N. and Sugiyama, M. (2011). The Degrees of Freedom of Partial Least Squares Regression Journal of the American Statistical Association, in press Krämer, N. and Braun, M. L. (2010). plsdof: Degrees of Freedom and Confidence Intervals for Partial Least Squares R package version 0.2-2 Krämer, N., Sugiyama, M. Braun, M.L. (2009). Lanczos Approximations for the Speedup of Kernel Partial Least Squares Regression Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS) Krämer, N., Braun, M.L. (2007) Kernelizing PLS, Degrees of Freedom, and Efficient Model Selection 24th International Conference on Machine Learning (ICML) The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 27 / 27