The Degrees of Freedom of Partial Least Squares Regression

Similar documents
Preface... xi. A Word to the Practitioner... xi The Organization of the Book... xi Required Software... xii Accessing the Supplementary Content...

Getting Started with Correlated Component Regression (CCR) in XLSTAT-CCR

Lecture 2. Review of Linear Regression I Statistics Statistical Methods II. Presented January 9, 2018

Regularized Linear Models in Stacked Generalization

Professor Dr. Gholamreza Nakhaeizadeh. Professor Dr. Gholamreza Nakhaeizadeh

Investigation in to the Application of PLS in MPC Schemes

Improving CERs building

Published: 14 October 2014

PARTIAL LEAST SQUARES: APPLICATION IN CLASSIFICATION AND MULTIVARIABLE PROCESS DYNAMICS IDENTIFICATION

Topic 5 Lecture 3 Estimating Policy Effects via the Simple Linear. Regression Model (SLRM) and the Ordinary Least Squares (OLS) Method

5. CONSTRUCTION OF THE WEIGHT-FOR-LENGTH AND WEIGHT-FOR- HEIGHT STANDARDS

Integrating remote sensing and ground monitoring data to improve estimation of PM 2.5 concentrations for chronic health studies

PLS score-loading correspondence and a bi-orthogonal factorization

PARTIAL LEAST SQUARES: WHEN ORDINARY LEAST SQUARES REGRESSION JUST WON T WORK

An Introduction to Partial Least Squares Regression

Collective Traffic Prediction with Partially Observed Traffic History using Location-Based Social Media

Statistical Learning Examples

Analysis of Big Data Streams to Obtain Braking Reliability Information July 2013, for 2017 Train Protection 1 / 25

Workshop on Frame Theory and Sparse Representation for Complex Data June 1, 2017

Domain-invariant Partial Least Squares (di-pls) Regression: A novel method for unsupervised and semi-supervised calibration model adaptation

9.2 User s Guide SAS/STAT. The PLS Procedure. (Book Excerpt) SAS Documentation

Article: Sulfur Testing VPS Quality Approach By Dr Sunil Kumar Laboratory Manager Fujairah, UAE

KINEMATICAL SUSPENSION OPTIMIZATION USING DESIGN OF EXPERIMENT METHOD

Sharif University of Technology. Graduate School of Management and Economics. Econometrics I. Fall Seyed Mahdi Barakchian

Optimal Vehicle to Grid Regulation Service Scheduling

Bayesian Trajectory Optimization for Magnetic Resonance Imaging Sequences

Optimization Methodology for CVT Ratio Scheduling with Consideration of Both Engine and CVT Efficiency

Calibration. DOE & Statistical Modeling

Predicting Solutions to the Optimal Power Flow Problem

Technical Papers supporting SAP 2009

SAS/STAT 13.1 User s Guide. The PLS Procedure

Chapter 5 ESTIMATION OF MAINTENANCE COST PER HOUR USING AGE REPLACEMENT COST MODEL

Antonio Olmos Priyalatha Govindasamy Research Methods & Statistics University of Denver

DOT HS Summary of Statistical Findings November Statistical Methodology to Make Early Estimates of Motor Vehicle Traffic Fatalities

PREDICTION OF FUEL CONSUMPTION

Robust alternatives to best linear unbiased prediction of complex traits

PUBLICATIONS Silvia Ferrari February 24, 2017

Forecasting China s Inflation in a Data-Rich. Environment

Forecast the charging power demand for an electric vehicle. Dr. Wilson Maluenda, FH Vorarlberg; Philipp Österle, Illwerke VKW;

Online Learning and Optimization for Smart Power Grid

Regression Models Course Project, 2016

Cost-Efficiency by Arash Method in DEA

Online Learning and Optimization for Smart Power Grid

Understanding the benefits of using a digital valve controller. Mark Buzzell Business Manager, Metso Flow Control

London calling (probably)

Introduction. Materials and Methods. How to Estimate Injection Percentage

Svante Wold, Research Group for Chemometrics, Institute of Chemistry, Umeå University, S Umeå, Sweden

ACCIDENT MODIFICATION FACTORS FOR MEDIAN WIDTH

Linking the Indiana ISTEP+ Assessments to the NWEA MAP Growth Tests. February 2017 Updated November 2017

Probabilistic Modeling of Fatigue Damage in Steel Box-Girder Bridges Subject to Stochastic Vehicle Loads

Enhanced gear efficiency calculation including contact analysis results and drive cycle consideration

Linking the Alaska AMP Assessments to NWEA MAP Tests

Dynamic Modeling of Large Complex Hydraulic System Based on Virtual Prototyping Gui-bo YU, Jian-zhuang ZHI *, Li-jun CAO and Qiao MA

ME scope Application Note 29 FEA Model Updating of an Aluminum Plate

Stopping criteria in iterative methods a miscellaneous issue?

Robust and Classical PLS Regression Compared

APPLICATION OF A PCA MODEL APPROACH FOR MISFIRE MONITORING. Paul J. King 1 and Keith J. Burnham 2

NRC Non-Destructive Examination Research

USE OF PLS COMPONENTS TO IMPROVE CLASSIFICATION ON BUSINESS DECISION MAKING

Linking the Virginia SOL Assessments to NWEA MAP Growth Tests *

Linking the Georgia Milestones Assessments to NWEA MAP Growth Tests *

HASIL OUTPUT SPSS. Reliability Scale: ALL VARIABLES

Embedded Torque Estimator for Diesel Engine Control Application

Linking the North Carolina EOG Assessments to NWEA MAP Growth Tests *

Fuel Economy and Safety

Bayes Factors. Structural Equation Models (SEMs): Schwarz BIC and Other Approximations

Use of Flow Network Modeling for the Design of an Intricate Cooling Manifold

Linking the Kansas KAP Assessments to NWEA MAP Growth Tests *

Wavelet-PLS Regression: Application to Oil Production Data

Data envelopment analysis with missing values: an approach using neural network

Linking the Mississippi Assessment Program to NWEA MAP Tests

Accelerating the Development of Expandable Liner Hanger Systems using Abaqus

Linking the New York State NYSTP Assessments to NWEA MAP Growth Tests *

STUDY OF AIRBAG EFFECTIVENESS IN HIGH SEVERITY FRONTAL CRASHES

Analysis of Partial Least Squares for Pose-Invariant Face Recognition

Linking the Florida Standards Assessments (FSA) to NWEA MAP

Supporting information

Linking the Indiana ISTEP+ Assessments to NWEA MAP Tests

A Battery Smart Sensor and Its SOC Estimation Function for Assembled Lithium-Ion Batteries

Relating your PIRA and PUMA test marks to the national standard

Relating your PIRA and PUMA test marks to the national standard

Statistics and Quantitative Analysis U4320. Segment 8 Prof. Sharyn O Halloran

INVITED REVIEW PAPER. Faisal Ahmed*, Lae-Hyun Kim**, and Yeong-Koo Yeo*,

Smart Operation for AC Distribution Infrastructure Involving Hybrid Renewable Energy Sources

PREDICTION OF REMAINING USEFUL LIFE OF AN END MILL CUTTER SEOW XIANG YUAN

Investigating the Concordance Relationship Between the HSA Cut Scores and the PARCC Cut Scores Using the 2016 PARCC Test Data

2018 Linking Study: Predicting Performance on the Performance Evaluation for Alaska s Schools (PEAKS) based on MAP Growth Scores

TABLE 4.1 POPULATION OF 100 VALUES 2

SPEED AND TORQUE CONTROL OF AN INDUCTION MOTOR WITH ANN BASED DTC

IDEA for GOES-R ABI. Presented by S. Kondragunta, NESDIS/STAR. Team Members: R. Hoff and H. Zhang, UMBC

A UNIFYING VIEW ON MULTI-STEP FORECASTING USING AN AUTOREGRESSION

Assignment 3 solutions

MORSE: MOdel-based Real-time Systems Engineering. Reducing physical testing in the calibration of diagnostic and driveabilty features

Fast and Robust Optimization Approaches for Pedestrian Detection

Statistical Applications in Genetics and Molecular Biology

Estimation of Unmeasured DOF s on a Scaled Model of a Blade Structure

Experimental analysis of a contact patch form of a rolling tire: influence of speed, wheel load, camber and slip angle

Index. Cambridge University Press Applied Nonparametric Econometrics Daniel J. Henderson and Christopher F. Parmeter.

A Distributed Neurocomputing Approach for Infrasound Event Classification

Road Surface characteristics and traffic accident rates on New Zealand s state highway network

Transcription:

The Degrees of Freedom of Partial Least Squares Regression Dr. Nicole Krämer TU München 5th ESSEC-SUPELEC Research Workshop May 20, 2011

My talk is about...... the statistical analysis of Partial Least Squares Regression. 1. intrinsic complexity of PLS 2. comparison of regression methods 3. model selection 4. variable selection based on confidence intervals The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 2 / 27

Example: Near Infrared Spectroscopy Predict Y the percentage of water in meat based on X its near infra red spectrum. unknown linear relationship f (x) = β 0 + β, x = β 0 + We observe p β j x (j). j=1 y i f (x i ), i = 1,..., n. x 1 centered data X =. Rn p y = (y 1,..., y n) R n. x n The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 3 / 27

Partial Least Squares in one Slide Partial Least Squares (PLS) = 1. supervised dimensionality reduction 2. + least squares regression n p X Least Squares Regression 1 y n n supervised dimensionality reduction T m Partial Least Squares Regression The PLS components T have maximal covariance to the response variable y. The m p components T are used as new predictor variables in a least-squares fit. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 4 / 27

Partial Least Squares Algorithm p 1 Partial Least Squares (PLS) = 1. supervised dimensionality reduction 2. + least squares regression PLS components T have maximal covariance to y. n n X supervised dimensionality reduction T Least Squares Regression Partial Least Squares Regression y n m Algorithm (NIPALS) X 1 = X. For i = 1,..., m model parameter 1. w i X i y maximize covariance cov(x i w i,y) w w = w i X i y w w 2. t i X i w i latent component 3. X i+1 = X i t i t i X i enforce orthogonality β m = W (T XW ) 1 T y Return T = (t 1,..., t m) and W = (w 1,..., w m). The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 5 / 27

Degrees of Freedom: Why are they important? Degrees of Freedom (DoF) 1. capture the intrinsic complexity of a regression method. Y i = f (x i ) + ε i, ε i N ( 0, σ 2), 2. are used for model selection. test error = training error + complexity(dof) Examples: Bayesian Information Criterion BIC = y ŷ 2 n + log(n)var (ε) DoF n Akaike Information Criterion, Minimum Description Length,... The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 6 / 27

Definition: Degrees of Freedom Assumption: The regression method is linear, i.e. ŷ = Hy, with H independent of y. Degrees of Freedom The Degrees of Freedom of a linear fitting method are DoF = trace (H). Examples Principal Components Regression with m components DoF(m) = 1 + m Ridge Regression, Smoothing splines,... The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 7 / 27

Naive Degrees of Freedom for PLS? Recall: PLS is not linear in y p 1 ( ) 1 ŷ m = y + T T T T y. }{{} =: H depends on y n n X supervised dimensionality reduction T Least Squares Regression Partial Least Squares Regression y n m Degrees of Freedom 1 + trace( H) = 1 + m. If we ignore the nonlinearity, we obtain DoF naive (m) = 1 + m. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 8 / 27

Degrees of Freedom for PLS (K. & Braun, 2007; K. & Sugiyama, 2011) The generalized Degrees of Freedom of a regression method are ( )] ŷ DoF = E Y [trace y This coincides with the previous definition if the method is linear. Proposition An unbiased estimate of the Degrees of Freedom of PLS is ( ) ŷm DoF(m) = trace. y We need to compute (the trace of) the first derivative. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 9 / 27

Computational Details I (K. & Braun, 2007) m = 1 We compute the derivative along the lines of the PLS algorithm 1. w 1 = X y w 1 = X 2. t 1 = Xw 1 t 1 = X w 1 = XX 3. ŷ 1 = P t1 y ŷ 1 = 1 t 1 2 ( t1 y + t 1 yi n ) (In P t1 ) t 1 + P t1 m > 1 We rearrange the algorithm in terms of projections P onto vectors t i. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 10 / 27

Computational Details II (K., Sugiyama, & Braun, 2009) Define empirical scatter matrices s = X y and S = X X w 1,... w m is an orthogonal basis of the Krylov space K m (S, s) = span ( s, Ss,..., S m 1 s ) }{{} =:K m PLS computes orthogonal Gram-Schmidt basis of s, Ss,..., S m 1 s Minimization property: ) 1 β m = arg min y Xβ 2 = K m (KmSK m K m s β K m(s,s) explicit formula for the trace of ŷ m (but not for the derivative itself) The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 11 / 27

Summary Computational Details Two equivalent algorithms implemented in the R-package plsdof. my.pls<-pls.model(x,y,compute.dof=true,compute.jacobian) option: runtime scales in p (number of variables) or in n (number of observations). iterative computation of ŷ m projection on Krylov subspaces ŷ m and β m = A? yes no confidence intervals ) yes no for β m? ĉov ( βm = σ 2 AA run time The details can be found in the paper. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 12 / 27

Shape of the DoF-Curve Benchmark data with different correlation structures. training data set description variables examples examples correlation kin (fh) dynamics of a robot arm 32 8192 60 low boston census data for housing 13 506 50 medium cookie near infrared spectra 700 70 39 high pls.object<-pls.model(x, y, compute.dof=true) kin (fh) boston cookie Degrees of Freedom 0 5 10 15 20 25 30 DoF(m)=m+1 Degrees of Freedom 2 4 6 8 10 12 14 DoF(m)=m+1 Degrees of Freedom 0 10 20 30 40 DoF(m)=m+1 0 3 6 9 12 16 20 24 28 32 0 1 2 3 4 5 6 7 8 9 11 13 0 3 6 9 12 16 20 24 28 32 number of components number of components number of components The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 13 / 27

A Lower Bound for the Degrees of Freedom The lower is the collinearity, the higher are the Degrees of Freedom. Theorem If the largest eigenvalue λ max of the sample correlation matrix S fulfills then 2λ max trace(s) DoF(m = 1) 1 + trace(s) λ max. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 14 / 27

Shape of the DoF-Curve Benchmark data with different correlation structures. training data set description variables examples examples correlation kin (fh) dynamics of a robot arm 32 8192 60 low boston census data for housing 13 506 50 medium cookie near infrared spectra 700 70 39 high pls.object<-pls.model(x, y, compute.dof=true) Degrees of Freedom 0 5 10 15 20 25 30 kin (fh) * ******************************* lower bound DoF(m)=m+1 * 0 5 10 15 20 25 30 Degrees of Freedom 2 4 6 8 10 12 14 boston * * * * * * * * * * * * DoF(m)=m+1 lower bound * * 0 2 4 6 8 10 12 Degrees of Freedom 0 10 20 30 40 50 60 70 *** **** * * * ** cookie * * * * * *** ** ************** DoF(m)=m+1 0 5 10 15 20 25 30 35 number of components number of components number of components The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 15 / 27

Comparison of Regression Methods: ozone L.A. ozone pollution data p = 12 variables n = 203 observations, n train = 50 training observations Comparison of 1. Partial Least Squares (PLS) 2. Principal Components Regression (PCR) 3. Ridge Regression β ridge = { arg min y Xβ 2 + λ β 2} β The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 16 / 27

Comparison of Regression Methods: ozone mean-squared error and model complexity 32 30 28 26 24 22 20 mean squared error PLS PCR RIDGE 12 10 8 6 4 2 components PLS PCR 12 10 8 6 4 2 degrees of freedom PLS PCR RIDGE There is no difference with respect to mean-squared error. A direct comparison of model parameters (number of components and λ) is not possible. Degrees of Freedom enable a fair model comparison between PLS and PCR. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 17 / 27

Comparison of Regression Methods: ozone training error 14 18 22 26 x o x PLS PCR o x x x o x x x x o x o o o o x o o o o x o x 2 4 6 8 10 12 components training error 14 18 22 26 x o x x x x ox x x xo xooo x oo x 2 4 6 8 10 12 degrees of freedom,,pls fits closer than PCR. (de Jong) However, there is no clear difference with respect to the Degrees of Freedom. PLS puts focus on more complex models. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 18 / 27

Variable Selection for PLS PLS does not select variables. extensions to sparse PLS 1. thresholding of the weight vectors (Saigo, K.,& Tsuda, 2008) 2. sparsity constraints on the weight vectors (Le Cao et. al., 2008; Chun & Keles, 2010) 3. shrinkage (Kondylis & Whittaker, 2007) 4.... Classical approaches 5. bootstrapping R packages pls and ppls 6. hypothesis testing based on the distribution of β The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 19 / 27

Approximate distribution of β Recall: Regression coefficients are a non-linear function of y. First order Taylor approximation β β y y }{{} =:A approximate covariance matrix ) ĉov ( β = σ 2 AA The noise level can be estimated via σ 2 = y ŷ 2 n DoF The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 20 / 27

( ( ( ) ( ) ( ) ) ( ) ) ( ) ( ) ( ) ( ( ) ) ( ( ) ((( ( ) ))) ) ( ) ))))) ( ( ( ) ) ) ( ) ( ) ) ) Confidence Intervals for PLS: ozone and tecator implemented in the R-packages plsdof and multcomp cv.object<-pls.cv(x, y, compute.covariance=true) my.multcomp<-glht(cv.object,...) 95% family wise confidence level 1 ( ) 2 (( ))) 3 4 ( 5 ( ) 6 ( ) 7 ( ) 8 ( ) 9 ( ) ( ) ( ) ( ) 10 11 12 5 0 5 10 95% family wise confidence level ))))))) )) )))) 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10987654321 (((((((( (( ( ((( (( )) ( ) (( ( ( )) ( ) ) ( ) ( ) ( ) ( ) ( ) ( (( ) ))) ( (( ((((((((((((( ((((((( )))))) ( (((( )) ((( )))) 10 5 0 5 The function corrects for multiple comparisons. The computational cost is high for large number of variables. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 21 / 27

Model Selection Comparison of 1. 10-fold cross-validation (gold standard) cv.object<-pls.cv(x, y, k=10) 2. Bayesian Information Criterion with our DoF-estimate bic.object<-pls.ic(x, y, criterion= bic ) 3. Bayesian Information Criterion with the naive estimate DoF=m+1 naive.object<-pls.ic(x, y, criterion= bic, naive=true) Akaike Information Criterion and Minimum Description Length are also available in the R-package. data sets: kin (fh), Boston, cookie The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 22 / 27

Prediction Accuracy 10-fold cross-validation Bayesian Information Criterion with our DoF-estimate Bayesian Information Criterion with the naive estimate DoF=m+1 kin (fh) boston 30 cookie test error * 100 18 16 14 12 test error 45 40 35 test error * 100 25 20 15 10 30 10 8 25 CV BIC BIC naive CV BIC BIC naive CV BIC BIC naive 1. All three approaches obtain similar accuracy. 2. There is no clear difference between BIC and naive BIC. The plots look similar for the selected Degrees of Freedom. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 23 / 27

Model complexity (selected components) 10-fold cross-validation Bayesian Information Criterion with our DoF-estimate Bayesian Information Criterion with the naive estimate DoF=m+1 kin (fh) boston cookie 6 7 35 5 6 30 number of components 4 3 2 number of components 5 4 3 number of components 25 20 15 10 1 2 5 0 CV BIC BIC naive 1 CV BIC BIC naive 0 CV BIC BIC naive 1. BIC selects less complex models than naive BIC. 2. There is no clear difference between BIC and CV. The plots look similar for the selected Degrees of Freedom. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 24 / 27

Why not use the naive approach? The naive approach selects more components, but the mean-squared error is not higher. no overfitting? The test error curve can be flat or steep around the optimum. 1 2 3 4 5 x o scaled test error d=50 d=210 o ooo o oo oo o o ooo o o o o oo x ox x x x xxxxxxxx xxxxxxxxxxxxxxxxx ooo oo oo o 0 5 10 15 20 25 30 componenents Depending on the form of the curve, the selection of too complex models does lead to overfitting. More details in the paper. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 25 / 27

Summary Partial Least Squares... typically consumes more than one Degree of Freedom for each component. precise estimate of its intrinsic complexity ( ) ŷm DoF(m) = trace y Its Degrees of Freedom... allow us to compare different regression methods. select less complex models than the naive estimate (when combined with information criteria). Variables can be selected... by constructing approximate confidence intervals. The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 26 / 27

References Krämer, N. and Sugiyama, M. (2011). The Degrees of Freedom of Partial Least Squares Regression Journal of the American Statistical Association, in press Krämer, N. and Braun, M. L. (2010). plsdof: Degrees of Freedom and Confidence Intervals for Partial Least Squares R package version 0.2-2 Krämer, N., Sugiyama, M. Braun, M.L. (2009). Lanczos Approximations for the Speedup of Kernel Partial Least Squares Regression Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS) Krämer, N., Braun, M.L. (2007) Kernelizing PLS, Degrees of Freedom, and Efficient Model Selection 24th International Conference on Machine Learning (ICML) The Degrees of Freedom of PLS Dr. Nicole Krämer (TUM) 27 / 27