Regression Models Course Project, 2016

Similar documents
Motor Trend MPG Analysis

Motor Trend Yvette Winton September 1, 2016

AIC Laboratory R. Leaf November 28, 2016

tool<-read.csv(file="d:/chilo/regression 7/tool.csv", header=t) tool

Drilling Example: Diagnostic Plots

R-Sq criterion Data : Surgical room data Chap 9

delivery<-read.csv(file="d:/chilo/regression 4/delivery.csv", header=t) delivery

Lampiran IV. Hasil Output SPSS Versi 16.0 untuk Analisis Deskriptif

HASIL OUTPUT SPSS. Reliability Scale: ALL VARIABLES

Stat 301 Lecture 30. Model Selection. Explanatory Variables. A Good Model. Response: Highway MPG Explanatory: 13 explanatory variables

Stat 401 B Lecture 31

TABLE 4.1 POPULATION OF 100 VALUES 2

Appendix B STATISTICAL TABLES OVERVIEW

. Enter. Model Summary b. Std. Error. of the. Estimate. Change. a. Predictors: (Constant), Emphaty, reliability, Assurance, responsive, Tangible

fruitfly fecundity example summary Tuesday, July 17, :13:19 PM 1

5. CONSTRUCTION OF THE WEIGHT-FOR-LENGTH AND WEIGHT-FOR- HEIGHT STANDARDS

Booklet of Code and Output for STAD29/STA 1007 Final Exam

Getting Started with Correlated Component Regression (CCR) in XLSTAT-CCR

9.3 Tests About a Population Mean (Day 1)

Technical Papers supporting SAP 2009

Modeling Ignition Delay in a Diesel Engine

Lecture 2. Review of Linear Regression I Statistics Statistical Methods II. Presented January 9, 2018

TRY OUT 25 Responden Variabel Kepuasan / x1

Important Formulas. Discrete Probability Distributions. Probability and Counting Rules. The Normal Distribution. Confidence Intervals and Sample Size

ggplot2: easy graphics with R

Subsetting Data in R. Data Wrangling in R

TRY OUT 30 Responden Variabel Kompetensi/ x1

Preface... xi. A Word to the Practitioner... xi The Organization of the Book... xi Required Software... xii Accessing the Supplementary Content...

Investigation of Relationship between Fuel Economy and Owner Satisfaction

Identify Formula for Throughput with Multi-Variate Regression

Lampiran 1. Data Perusahaan

UJI VALIDITAS DAN RELIABILIAS VARIABEL KOMPENSASI

Stat 401 B Lecture 27

Descriptive Statistics

Exercises An Introduction to R for Epidemiologists using RStudio SER 2014

Problem Set 3 - Solutions

Robust alternatives to best linear unbiased prediction of complex traits

Guatemalan cholesterol example summary

The PRINCOMP Procedure

Graphics in R. Fall /5/17 1

Basic SAS and R for HLM

SAN PEDRO BAY PORTS YARD TRACTOR LOAD FACTOR STUDY Addendum

PREDICTION OF FUEL CONSUMPTION

Improving CERs building

Math 135 S18 Exam 1 Review. The Environmental Protection Agency records data on the fuel economy of many different makes of cars.

TRINITY COLLEGE DUBLIN THE UNIVERSITY OF DUBLIN. Faculty of Engineering, Mathematics and Science. School of Computer Science and Statistics

: ( .

Road Surface characteristics and traffic accident rates on New Zealand s state highway network

Power and Fuel Economy Tradeoffs, and Implications for Benefits and Costs of Vehicle Greenhouse Gas Regulations

From Developing Credit Risk Models Using SAS Enterprise Miner and SAS/STAT. Full book available for purchase here.

Assignment 3 solutions

DEPARTMENT OF STATISTICS AND DEMOGRAPHY MAIN EXAMINATION, 2011/12 STATISTICAL INFERENCE II ST232 TWO (2) HOURS. ANSWER ANY mree QUESTIONS

FutureMetrics LLC. 8 Airport Road Bethel, ME 04217, USA. Cheap Natural Gas will be Good for the Wood-to-Energy Sector!

Chapter 5 ESTIMATION OF MAINTENANCE COST PER HOUR USING AGE REPLACEMENT COST MODEL

Stat 301 Lecture 26. Model Selection. Indicator Variables. Explanatory Variables

Antonio Olmos Priyalatha Govindasamy Research Methods & Statistics University of Denver

EXST7034 Multiple Regression Geaghan Chapter 11 Bootstrapping (Toluca example) Page 1

LET S ARGUE: STUDENT WORK PAMELA RAWSON. Baxter Academy for Technology & Science Portland, rawsonmath.

Lampiran 1. Penjualan PT Honda Mandiri Bogor

DRIVER SPEED COMPLIANCE WITHIN SCHOOL ZONES AND EFFECTS OF 40 PAINTED SPEED LIMIT ON DRIVER SPEED BEHAVIOURS Tony Radalj Main Roads Western Australia

Topic 5 Lecture 3 Estimating Policy Effects via the Simple Linear. Regression Model (SLRM) and the Ordinary Least Squares (OLS) Method

Using Statistics To Make Inferences 6. Wilcoxon Matched Pairs Signed Ranks Test. Wilcoxon Rank Sum Test/ Mann-Whitney Test

Statistics and Quantitative Analysis U4320. Segment 8 Prof. Sharyn O Halloran

The Session.. Rosaria Silipo Phil Winters KNIME KNIME.com AG. All Right Reserved.

Vehicle Scrappage and Gasoline Policy. Online Appendix. Alternative First Stage and Reduced Form Specifications

DATA SAMPEL TAHUN 2006

MODUL PELATIHAN SEM ANANDA SABIL HUSSEIN, PHD

The Degrees of Freedom of Partial Least Squares Regression

LAMPIRAN I FORMULIR SURVEI

female male help("predict") yhat age

Appendices for: Statistical Power in Analyzing Interaction Effects: Questioning the Advantage of PLS with Product Indicators

LAMPIRAN I Data Perusahaan Sampel kode DPS EPS Ekuitas akpi ,97 51,04 40,

Relating your PIRA and PUMA test marks to the national standard

Level of service model for exclusive motorcycle lane

Relating your PIRA and PUMA test marks to the national standard

Fuel Economy and Safety

Effect of Sample Size and Method of Sampling Pig Weights on the Accuracy of Estimating the Mean Weight of the Population 1

Consumer Satisfaction with New Vehicles Subject to Greenhouse Gas and Fuel Economy Standards

DEFECT DISTRIBUTION IN WELDS OF INCOLOY 908

LECTURE 6: HETEROSKEDASTICITY

Voting Draft Standard

THE ACCURACY OF WINSMASH DELTA-V ESTIMATES: THE INFLUENCE OF VEHICLE TYPE, STIFFNESS, AND IMPACT MODE

Universitas Sumatera Utara

Bioconductor s sva package

Sharif University of Technology. Graduate School of Management and Economics. Econometrics I. Fall Seyed Mahdi Barakchian

Analyzing Crash Risk Using Automatic Traffic Recorder Speed Data

Article: Sulfur Testing VPS Quality Approach By Dr Sunil Kumar Laboratory Manager Fujairah, UAE

Model Information Data Set. Response Variable (Events) Summe Response Variable (Trials) N Response Distribution Binomial Link Function

Studying the Factors Affecting Sales of New Energy Vehicles from Supply Side Shuang Zhang

Multiple Imputation of Missing Blood Alcohol Concentration (BAC) Values in FARS

Mouse Trap Racer Scientific Investigations (Exemplar)

Follow this and additional works at:

Predicting Solutions to the Optimal Power Flow Problem

namibia UniVERSITY OF SCIEnCE AnD TECHnOLOGY FACULTY OF HEALTH AND APPLIED SCIENCES DEPARTMENT OF MATHEMATICS AND STATISTICS MARKS: 100

Forecasting elections with tricks and tools from Ch. 2 in BDA3

MOTORCYCLE ACCIDENT MODEL ON THE ROAD SECTION OF HIGHLANDS REGION BY USING GENELARIZED LINEAR MODEL

ACCIDENT MODIFICATION FACTORS FOR MEDIAN WIDTH

CHAPTER V CONCLUSION, SUGGESTION AND LIMITATION. 1. Independent commissioner boards proportion does not negatively affect

Oklahoma Gas & Electric P.O. Box 321 Oklahoma City, OK, Main Street, Suite 900 Cambridge, MA 02142

Effects of two-way left-turn lane on roadway safety

Transcription:

Regression Models Course Project, 2016 Venkat Batchu July 13, 2016 Executive Summary In this report, mtcars data set is explored/analyzed for relationship between outcome variable mpg (miles for gallon) and a set of predictor/confounder variables. The main objectives of the study are as follows: Is an automatic or manual transmission better for miles per gallon (MPG)? How different is the MPG between automatic and manual transmissions? Simple linear regression tells us that the manual transmission cars give 7.25 miles more on average when compared against auto transmission cars. Howver, when we take into consideration of confounder variables like weight,cylinders and hp manual transmission cars give 1.8 miles more per gallon only. library(datasets) data(mtcars) dim(mtcars) [1] 32 11 sapply(mtcars,class) mpg cyl disp hp drat wt qsec "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" vs am gear carb "numeric" "numeric" "numeric" "numeric" We see that variables including our predictor variable am, are of numeric class. convert dichotomous predictor variable to a factor class and label the levels as Automatic and Manual for better readability. Also, transform some otherconfounder variables to factor class. mtcars$am < factor(mtcars$am,labels=c('automatic','manual')) mtcars$cyl < factor(mtcars$cyl) mtcars$vs < factor(mtcars$vs) mtcars$gear < factor(mtcars$gear) mtcars$carb < factor(mtcars$carb) Exploratory Data Analysis

As we are going to build a linear regression model, we need to make sure that data meets its assumptions. Plot the outcome variable mpg to check its distribution: par(mfrow = c(1, 2)) # Histogram with Normal Curve x< mtcars$mpg h< hist(x, breaks=10, col="blue", xlab="miles Per Gallon", main="histogram of Miles per Gallon") xfit< seq(min(x),max(x),length=40) yfit< dnorm(xfit,mean=mean(x),sd=sd(x)) yfit < yfit*diff(h$mids[1:2])*length(x) lines(xfit, yfit, col="blue", lwd=2) # Kernel Density Plot d < density(mtcars$mpg) plot(d, xlab = "MPG", main ="Density Plot of MPG") The distribution of mpg is close to normal and there are no outliers that skew our data. Once again, there are no outliers in our data. From the boxplot, it is clearly evident that manual transmission is better than Automatic transmission for MPG. Regression Analysis

In this section, we build linear regression models based on the different variables and try to figure out the best model. Then compare it against the base model using ANOVA. Perform analysis of residuals upon finalizing the model. Model Building and Selection To identify predictors for our model, we look at how mpg correlated with all other variables. Based on the correlation matrix, several variables appear to have high correlation with mpg, We build an initial model with all the variables as predictors. Perfom stepwise model selection to select significant predictors for the final model which is the best model. Step method builds various multiple regression models in iterative manner and find out the best variables using both forward selection and backward elimination methods by the AIC algorithm. init_model < lm(mpg ~., data = mtcars) best_model < step(init_model, direction = "both") The best model obtained from the above computations consists of the confound variables, cyl, wt and hp along with predictor variable am. summary(best_model) Call: lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars) Residuals: Min 1Q Median 3Q Max 3.9387 1.2560 0.4013 1.1253 5.0513 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 33.70832 2.60489 12.940 7.73e 13 *** cyl6 3.03134 1.40728 2.154 0.04068 * cyl8 2.16368 2.28425 0.947 0.35225 hp 0.03211 0.01369 2.345 0.02693 * wt 2.49683 0.88559 2.819 0.00908 ** ammanual 1.80921 1.39630 1.296 0.20646 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.41 on 26 degrees of freedom Multiple R squared: 0.8659, Adjusted R squared: 0.8401 F statistic: 33.57 on 5 and 26 DF, p value: 1.506e 10 Please note that the adjusted R squared value is 0.84 which is best value after including appropriate confounder variables.this model explains 84% variability. We compare base model with only one predictor variable am against our best mode.

base_model< lm(mpg ~ am, data=mtcars) anova(base_model,best_model) Analysis of Variance Table Model 1: mpg ~ am Model 2: mpg ~ cyl + hp + wt + am Res.Df RSS Df Sum of Sq F Pr(>F) 1 30 720.90 2 26 151.03 4 569.87 24.527 1.688e 08 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Since the p value is significant, null hypothesis that the confounder variables do not improve the model is rejected. Residuals and Diagnostics In this section, we shall study the residual plots of our regression model and also compute some of the regression diagnostics for our model to find out some interesting outliers in the data set. par(mfrow = c(2,2)) plot(best_model,which=1:4)

From the above plots, we can make the below conclusions The points on the Residuals vs. Fitted plot seem to be randomly scattered on the plot which satisfies the i.i.d assumption. From the Normal Q Q plot it appears that the points are closely aligned to the line which means that the residuals are normally distributed. The scale location plot verifies constant variance of the points. The fourth plot is of Cook s distance, which is a measure of the influence of each observation on the regression coefficients. We now compute regression diagnostics of our model to find out points of influence and leverage points. influence < dfbetas(best_model) tail(sort(influence[,6]),3) Chrysler Imperial Fiat 128 Toyota Corona 0.3507458 0.4292043 0.7305402 leverage < hatvalues(best_model) tail(sort(leverage),3) Toyota Corona Lincoln Continental Maserati Bora 0.2777872 0.2936819 0.4713671 ``` Inference t.test(mpg ~ am,data=mtcars,var.equal=true) Two Sample t test data: mpg by am t = 4.1061, df = 30, p value = 0.000285 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 10.84837 3.64151 sample estimates: mean in group Automatic mean in group Manual 17.14737 24.39231 This test shows that we should reject the null hypothesis and conclude that mean MPGs are different for manual transmission cars and automatic transmission cars. Conclusions

Based on the observations from our best fit model, we can draw below conclusions: Cars with Manual transmission give better miles per gallon mpg than cars with Automatic transmission. (1.8 confounded by hp, cyl, and wt). mpg will decrease by 2.5 (confounded by hp, cyl, and am) for every 1000 lb increase in wt. For cylinders increase from 4 to 6 and 8, mpg will decrease by 3 and 2.2 respectively (confounded by hp, wt, and am) mpg decreases negligibly with increase of hp. Appendix Explore how mpg varies by automatic versus manual transmission boxplot(mpg~am, data = mtcars, col = c("red", "blue"), xlab = "Transmission", ylab = "Miles per Gallon", main = "MPG by Transmission Type") par(mar = c(1, 1, 1, 1)) pairs(~mpg+.,data=mtcars,panel=panel.smooth,main="pairs Plot for mtcars dataset")

data(mtcars) sort(cor(mtcars)[1,]) wt cyl disp hp carb qsec 0.8676594 0.8521620 0.8475514 0.7761684 0.5509251 0.4186840 gear am vs drat mpg 0.4802848 0.5998324 0.6640389 0.6811719 1.0000000