Getting Started with Correlated Component Regression (CCR) in XLSTAT-CCR

Similar documents
. Enter. Model Summary b. Std. Error. of the. Estimate. Change. a. Predictors: (Constant), Emphaty, reliability, Assurance, responsive, Tangible

HASIL OUTPUT SPSS. Reliability Scale: ALL VARIABLES

Preface... xi. A Word to the Practitioner... xi The Organization of the Book... xi Required Software... xii Accessing the Supplementary Content...

The Degrees of Freedom of Partial Least Squares Regression

Lampiran IV. Hasil Output SPSS Versi 16.0 untuk Analisis Deskriptif

Statistics and Quantitative Analysis U4320. Segment 8 Prof. Sharyn O Halloran

Stat 301 Lecture 26. Model Selection. Indicator Variables. Explanatory Variables

Stat 401 B Lecture 31

Topic 5 Lecture 3 Estimating Policy Effects via the Simple Linear. Regression Model (SLRM) and the Ordinary Least Squares (OLS) Method

Stat 301 Lecture 30. Model Selection. Explanatory Variables. A Good Model. Response: Highway MPG Explanatory: 13 explanatory variables

LAMPIRAN I Data Perusahaan Sampel kode DPS EPS Ekuitas akpi ,97 51,04 40,

Universitas Sumatera Utara

TRY OUT 25 Responden Variabel Kepuasan / x1

Chapter 5 ESTIMATION OF MAINTENANCE COST PER HOUR USING AGE REPLACEMENT COST MODEL

Regression Models Course Project, 2016

KISSsoft 03/2016 Tutorial 7

KISSsoft 03/2018 Tutorial 7

UJI VALIDITAS DAN RELIABILIAS VARIABEL KOMPENSASI

Factors Affecting Vehicle Use in Multiple-Vehicle Households

TRY OUT 30 Responden Variabel Kompetensi/ x1

Stat 401 B Lecture 27

CHAPTER V CONCLUSION, SUGGESTION AND LIMITATION. 1. Independent commissioner boards proportion does not negatively affect

Draft Project Deliverables: Policy Implications and Technical Basis

Motor Trend MPG Analysis

Regularized Linear Models in Stacked Generalization

Sharif University of Technology. Graduate School of Management and Economics. Econometrics I. Fall Seyed Mahdi Barakchian

From Developing Credit Risk Models Using SAS Enterprise Miner and SAS/STAT. Full book available for purchase here.

Investigation of Relationship between Fuel Economy and Owner Satisfaction

: ( .

Improving CERs building

Assignment 3 solutions

LAMPIRAN I FORMULIR SURVEI

Motor Tuning Instructions

APPLICATION NOTE

The Coefficient of Determination

Relating your PIRA and PUMA test marks to the national standard

Tutorial: Calculation of two shafts connected by a rolling bearing

Relating your PIRA and PUMA test marks to the national standard

QuaSAR Quantitative Statistics

Daftar Sampel Perusahaan

Chapter01 - Control system types - Examples

Setting Up an Oscillation Amplitude Sweep Test

Technical Papers supporting SAP 2009

University of Jordan School of Engineering Mechatronics Engineering Department. Fluid Power Engineering Lab

2018 Linking Study: Predicting Performance on the NSCAS Summative ELA and Mathematics Assessments based on MAP Growth Scores

RDS. For Windows TORSION SPRING CALCULATOR For ROLLING DOORS Version 4 REFERENCE MANUAL

Supervised Learning to Predict Human Driver Merging Behavior

StepSERVO Tuning Guide

MODUL PELATIHAN SEM ANANDA SABIL HUSSEIN, PHD

Oregon DOT Slow-Speed Weigh-in-Motion (SWIM) Project: Analysis of Initial Weight Data

ME scope Application Note 24 Choosing Reference DOFs for a Modal Test

CAE Analysis of Passenger Airbag Bursting through Instrumental Panel Based on Corpuscular Particle Method

Burn Characteristics of Visco Fuse

Investigation in to the Application of PLS in MPC Schemes

Lampiran 1. Data Perusahaan

ACCIDENT MODIFICATION FACTORS FOR MEDIAN WIDTH

Problem Set 3 - Solutions

QUALITY ASSURANCE & LAB ACCREDITATION

Professor Dr. Gholamreza Nakhaeizadeh. Professor Dr. Gholamreza Nakhaeizadeh

KISSsoft 03/2013 Tutorial 2

Introduction. Materials and Methods. How to Estimate Injection Percentage

LAMPIRAN 1. Lampiran Nama dan Kondisi Perusahaan Textile No Kode Nama Perusahaan Hasil z-score FD Non-FD

Effect of Sample Size and Method of Sampling Pig Weights on the Accuracy of Estimating the Mean Weight of the Population 1

Modeling Ignition Delay in a Diesel Engine

LAMPIRAN DAFTAR SAMPEL PENELITIAN. Kriteria No. Nama Perusahaan. Sampel Emiten

Antonio Olmos Priyalatha Govindasamy Research Methods & Statistics University of Denver

CRASH RISK RELATIONSHIPS FOR IMPROVED SAFETY MANAGEMENT OF ROADS

Linking the Alaska AMP Assessments to NWEA MAP Tests

PARTIAL LEAST SQUARES: WHEN ORDINARY LEAST SQUARES REGRESSION JUST WON T WORK

Lecture 2. Review of Linear Regression I Statistics Statistical Methods II. Presented January 9, 2018

Review of Upstate Load Forecast Uncertainty Model

BigStuff3 - GEN3. 1st Gear Spark Retard with Spark Retard Traction Control System (SR 2 ) Rev

tool<-read.csv(file="d:/chilo/regression 7/tool.csv", header=t) tool

CONSTRUCT VALIDITY IN PARTIAL LEAST SQUARES PATH MODELING

Statistical Evaluation of Standardized Field Sobriety Tests

1) Introduction to wind power

The inuence of engine demand map design on vehicle perceived performance

Featured Articles Utilization of AI in the Railway Sector Case Study of Energy Efficiency in Railway Operations

FLL Workshop 1 Beginning FLL Programming. Patrick R. Michaud University of Texas at Dallas September 8, 2016

Programming of different charge methods with the BaSyTec Battery Test System

Effect of driving pattern parameters on fuel-economy for conventional and hybrid electric city buses

Linking the Virginia SOL Assessments to NWEA MAP Growth Tests *

Linking the Georgia Milestones Assessments to NWEA MAP Growth Tests *

KISSsoft Tutorial 012: Sizing of a fine pitch Planetary Gear set. 1 Task. 2 Starting KISSsoft

Linking the North Carolina EOG Assessments to NWEA MAP Growth Tests *

Oil Palm Ripeness Detector (OPRID) and Non-Destructive Thermal Method of Palm Oil Quality Estimation

CNY Rocket Team Challenge. Basics of Using RockSim 9 to Predict Altitude for the Central New York Rocket Team Challenge

9.2 User s Guide SAS/STAT. The PLS Procedure. (Book Excerpt) SAS Documentation

Linking the Kansas KAP Assessments to NWEA MAP Growth Tests *

Tutorial. Running a Simulation If you opened one of the example files, you can be pretty sure it will run correctly out-of-the-box.

Effect of driving patterns on fuel-economy for diesel and hybrid electric city buses

Linking the Mississippi Assessment Program to NWEA MAP Tests

Linking the Indiana ISTEP+ Assessments to the NWEA MAP Growth Tests. February 2017 Updated November 2017

Linking the New York State NYSTP Assessments to NWEA MAP Growth Tests *

PLS score-loading correspondence and a bi-orthogonal factorization

Transient Stability Analysis with PowerWorld Simulator

Driving Tests: Reliability and the Relationship Between Test Errors and Accidents

Linking the Florida Standards Assessments (FSA) to NWEA MAP

DRIVER SPEED COMPLIANCE WITHIN SCHOOL ZONES AND EFFECTS OF 40 PAINTED SPEED LIMIT ON DRIVER SPEED BEHAVIOURS Tony Radalj Main Roads Western Australia

DATA PENELITIAN 1. CAR CAR (%)

Transcription:

Tutorial 1 Getting Started with Correlated Component Regression (CCR) in XLSTAT-CCR Dataset for running Correlated Component Regression This tutorial 1 is based on data provided by Michel Tenenhaus and used in Magidson (2011), Correlated Component Regression: A Sparse Alternative to PLS Regression, 5th ESSEC-SUPELEC Statistical Workshop on PLS (Partial Least Squares) Developments. The data consists of N=24 car models, the dependent variable PRICE = price of a car, and P=6 explanatory variables (predictors), each of which has a positive correlation with PRICE Explanatory Variable Correlation with PRICE CYLINDER (engine measured in cubic centimeters).85 POWER (horsepower).89 SPEED (top speed in kilometers/hour).72 WEIGHT (kilograms).81 LENGTH (centimeters).75 WIDTH (centimeters).61 but each predictor also has a moderate correlation with the other predictor variables Predictor CYLINDER POWER SPEED WEIGHT LENGTH CYLINDER 1 POWER.86 1 SPEED.69.89 1 WEIGHT.90.75.49 1 LENGTH.86.69.53.92 1 WIDTH.71.55.36.79.86 An Excel sheet containing both the data and the results for use in this tutorial can be downloaded by clicking here. 1 To reproduce the results shown in this tutorial exactly, you will need to fix the seed to 123456789. To fix the seed in XLSTAT, go to Options, then click on the Advanced tab. Check the box to activate the option 'Fix the seed to:', and change the seed to 123456789. 1

Goal of CCR for this example CCR will apply the proper amount of regularization to reduce confounding effects of high predictor correlation, thus allowing us to obtain more interpretable regression coefficients, better predictions, and include more significant predictors in a model than traditional OLS regression. The OLS regression solution maximizes R 2 in the training sample, yielding R 2 =.85. However, since this solution is based on a relatively small sample (N=24) and correlated predictors, it is likely that this model overfits the data and that.85 is an overly optimistic estimate of the true population R 2. Consistent with an overfit model, Table 1 shows that the OLS solution yields large standard errors and unrealistic negative coefficient estimates for the predictors CYLINDER, SPEED, and WIDTH. OLS Regression Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. (Constant) 12070.41 194786.56.06.95 CYLINDER -1.94 33.62 -.02 -.06.95 POWER 1315.91 613.51.89 2.14.05 SPEED -472.51 740.32 -.21 -.64.53 WEIGHT 45.92 100.05.18.46.65 LENGTH 209.65 504.15.15.42.68 WIDTH -505.43 1501.59 -.07 -.34.74 Table 1: Results from traditional OLS regression: CV-R 2 = 0.63 Moreover, POWER is the only predictor that achieves statistical significance (p=.05) according to the traditional t-test. CCR utilizes the cross-validated R 2 as its criterion for determining the proper amount of regularization (K) to use in a regression model. Fig.1 shows that substantial decay in CV-R 2 occurs for K>2. Thus, a substantial amount of regularization is required (K<3) to obtain a reliable result. Since OLS regression applies no regularization at all (K=P=6), this plot indicates that the OLS model definitely is overfit, and the CCR model (with K=2) should predict PRICE better than OLS regression when applied out-of-sample to new data. The results based on all 6 predictors: CV- R 2 =.75 for CCR compared to.63 for OLS regression. 2

Fig. 1. Cross-Validation Component (CV-R 2 ) Plot showing deterioration for K>2 Also, in contrast to OLS regression which yields some negative coefficient estimates, CCR yields more reasonable positive coefficients for all 6 predictors as shown below. Predictor B Beta CYLINDER 20.9 0.19 POWER 545.5 0.37 SPEED 445.7 0.20 WEIGHT 43.4 0.17 LENGTH 32.6 0.02 WIDTH 343.6 0.05 (Constant) -177941 Table 2. CCR solution with K=2 components. The first part of this tutorial shows how to use XLSTAT-CCR to obtain these results. The second part (see Activating the Step-down Algorithm ) shows how to activate the CCR step-down procedure to eliminate extraneous predictors and obtain even better results (CV-R 2 =.77) as indicated in the following table. 3

CV- R 2 = 0.77 Predictor B Beta POWER 673.3 0.45 SPEED 222.9 0.10 WEIGHT 110.9 0.44 (Constant) -115044 Table 3. Results from CCR with step-down algorithm Setting up a Correlated Component Regression To activate the Correlated Component Regression dialog box, first start XLSTAT by clicking on the button in the Excel toolbar, then select the XLSTAT / Modeling data / Correlated Component Regression command in the Excel menu or click the corresponding button on the Modeling data toolbar. Once you have clicked the button, the Correlated Component Regression dialog box is displayed with the Method=CCR.LM (linear regression model) selected by default. 4

Fig. 2: General Tab In the Y/ Dependent variables field, use your mouse to select the variable PRICE (see the tutorial on Selecting data for more information on this topic). The prices are the "Ys" of the model as we want to predict these prices as a linear function of the other car attributes. In X/ Predictors field, select the other 6 car attributes. The name of the car models (MODEL) has also been selected as Observation labels. To obtain the OLS regression solution, fix the number of components at 6, so it equals the number of predictors. To accomplish this, in the Options tab set Number of components to 6 and uncheck Automatic. In the Options tab of the dialog box, make sure that the settings are as shown below. 5

Fig. 3: Options Tab The fast computations start when you click on OK. 6

Interpreting CCR Model Output Following the basic statistics output section, the coefficients (unstandardized and standardized) are presented. For example, Table 3A presents the unstandardized coefficients. Comparing Table 3A to Table 1, we see that the results match the OLS regression coefficients. Unstandardized coefficients: Variable Coefficient Intercept 12070.645 CYLINDER -1.936 POWER 1315.907 SPEED -472.509 WEIGHT 45.923 LENGTH 209.654 WIDTH -505.431 Table 3A. Unstandardized coefficient estimates obtained from the 6-component (saturated) CCR model These coefficients can be decomposed into parts associated with each of the 6 components using the component weights provided in Table 3B and the component coefficients (loadings) provided in Table 3C. Number of components: 6 Unstandardized component Weights: Component Value 1 0.006 2 0.124 3 0.804 4 0.627 5 0.422 6 0.167 Table 3B. Unstandardized component weights 7

Unstandardized loadings: Variable \ Component 1 2 3 4 5 6 CYLINDER 92.774 1.381-3.728-11.016 15.190 5.053 POWER 1320.804 728.560 1228.528 472.999-198.632 107.935 SPEED 1642.054 239.324-818.476 512.133-367.926-119.735 WEIGHT 203.058-4.066 88.336-37.350-3.215-5.772 LENGTH 1038.776-563.277 52.095 283.260 154.288-67.762 WIDTH 4588.211-1915.940-191.909-29.443-343.730 136.449 Table 3C. Unstandardized loadings For example, the coefficient -1.94 for CYLINDER, can be decomposed as follows: -1.94 =.006*(92.774) +.124*(1.381) +.804*(-3.728) +.627*(-11.016) +.422*(15.190) +.167*(5.053) Activating the Automatic and M-fold Cross-validation options Re-open the CCR dialog box by selecting the Modeling data / Correlated Component Regression command in the Excel menu or click the corresponding button on the Modeling data toolbar. Since N is relatively small (N=24) and the correlation between the predictors is fairly high, this saturated regression model overfits these data. We will now show how to activate the M-fold cross-validation (CV) option and show that this model is overfit, and that eliminating CCR components 3-6 provides the proper amount of regularization to produce more reliable results. To allow CV to assess all possible degrees of regularization, we will estimate all 6 CCR models (K 6). We do this by activating the Automatic option in the Options tab. The number of folds M is generally taken to be between 5 and 10, so we select M=6, the only integer between 5 and 10 that divides evenly into 24. In the Validation tab we activate Cross-validation and request 100 rounds of 6-folds. By requesting more than 1 round, we obtain a standard error for the CV- R 2. 8

Fig. 4: Validation Tab Note that activating the Automatic option also requests the Cross-Validation Component Plot to be generated (this is checked in the Charts tab) shown earlier in Fig. 1. Click OK to perform these analyses. The Goodness of Fit Statistics show that the resulting model has K=2 components. For this model, the CV-R 2 increases to.750 with a standard error of only.014, providing a significant improvement over the OLS regression CV-R 2 =.64. Unstandardized coefficients: Variable Coefficient Intercept -177941.121 CYLINDER 20.944 POWER 545.463 SPEED 445.654 WEIGHT 43.368 LENGTH 32.618 WIDTH 343.616 Table 4A. Coefficients obtained from the 2-component model 9

Number of components: 2 Unstandardized component Weights: Component Value 1 0.221 2 0.349 Table 4B. Component weights obtained from the 2-component model Unstandardized loadings: Variable \ Component 1 2 CYLINDER 92.774 1.381 POWER 1320.804 728.560 SPEED 1642.054 239.324 WEIGHT 203.058-4.066 LENGTH 1038.776-563.277 WIDTH 4588.211-1915.940 Table 4C. Loadings obtained from the 2-component model From the Coefficients Output in Tables 4A, 4B and 4C we see how the coefficients are now constructed based on only 2 components. For example, the coefficient for CYLINDER can be decomposed as follows: 20.944 =.221*92.774 +.349*1.381 10

Activating the Step-down Algorithm Re-open the CCR dialog box by selecting the Modeling data / Correlated Component Regression command in the Excel menu or click the corresponding button on the Modeling data toolbar. To eliminate extraneous and weak predictors, in the options tab we will now activate the step-down algorithm as shown below: Figure 5. Options Tab Activation of the step-down option automatically requests the step-down predictor selection plot in the Charts tab and the Predictor Count table from the Output tab. Click on OK to estimate. The predictor selection plot suggests that inclusion of 3 predictors in the model is optimal. 11

Figure 6. Cross-validation Step-down Plot The Cross-validation Predictor Count table suggests that POWER and WEIGHT are the most important predictors, being included in 600 and 584 of the 1800 cross-validated regressions. Cross-validation predictor count table: Predictor Round_1 Round_2 Round_3 Round_4 Round_5 Round_100 Total POWER 6 6 6 6 6 6 600 WEIGHT 6 6 6 6 6 6 584 SPEED 1 2 2 3 0 4 260 CYLINDER 4 4 3 3 0 2 242 LENGTH 0 0 1 0 0 0 69 WIDTH 1 0 0 0 0 0 45 Total 18 18 18 18 12 18 1800 The final model has CV-R 2 =.77 and includes the predictors POWER, SPEED and WEIGHT: Goodness of fit statistics: ValueCross-validation Std. dev.(cv) Number of observations 24 Sum of weights 24 R² 0.836 0.766 0.021 12

Predictors retained in the model: POWER SPEED WEIGHT General Discussion and Additional Tutorials Key driver regression attempts to ascertain the importance of several key explanatory variables (predictors) X 1, X 2,, X P that influence a dependent variable. For example, a typical dependent variable in key driver regression is Customer Satisfaction. Traditional OLS regression methods have difficulty with such derived importance tasks because the predictors usually have moderate to high correlation with each other, resulting in problems of confounding, making parameter estimates unstable and thus unusable as measures of importance. Correlated Component Regression (CCR) is designed to handle such problems, and as shown in Tutorial 2 it even works with high-dimensional data where there are more predictors than cases! Parameter estimates become more interpretable and cross-validation is used to avoid over-fitting, thus producing better out-of-sample predictions. For CCR Tutorial 2 click here: Using Correlated Component Regression with a Dichotomous Y and Many Correlated Predictors For CCR Tutorial 3 click here: Obtaining Predictions from a 2-class Regression For other XLSTAT Tutorials click here Copyright 2011 Statistical Innovations Inc. All rights reserved. 13