Regression Models Course Project, 2016 Venkat Batchu July 13, 2016 Executive Summary In this report, mtcars data set is explored/analyzed for relationship between outcome variable mpg (miles for gallon) and a set of predictor/confounder variables. The main objectives of the study are as follows: Is an automatic or manual transmission better for miles per gallon (MPG)? How different is the MPG between automatic and manual transmissions? Simple linear regression tells us that the manual transmission cars give 7.25 miles more on average when compared against auto transmission cars. Howver, when we take into consideration of confounder variables like weight,cylinders and hp manual transmission cars give 1.8 miles more per gallon only. library(datasets) data(mtcars) dim(mtcars) [1] 32 11 sapply(mtcars,class) mpg cyl disp hp drat wt qsec "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" vs am gear carb "numeric" "numeric" "numeric" "numeric" We see that variables including our predictor variable am, are of numeric class. convert dichotomous predictor variable to a factor class and label the levels as Automatic and Manual for better readability. Also, transform some otherconfounder variables to factor class. mtcars$am < factor(mtcars$am,labels=c('automatic','manual')) mtcars$cyl < factor(mtcars$cyl) mtcars$vs < factor(mtcars$vs) mtcars$gear < factor(mtcars$gear) mtcars$carb < factor(mtcars$carb) Exploratory Data Analysis
As we are going to build a linear regression model, we need to make sure that data meets its assumptions. Plot the outcome variable mpg to check its distribution: par(mfrow = c(1, 2)) # Histogram with Normal Curve x< mtcars$mpg h< hist(x, breaks=10, col="blue", xlab="miles Per Gallon", main="histogram of Miles per Gallon") xfit< seq(min(x),max(x),length=40) yfit< dnorm(xfit,mean=mean(x),sd=sd(x)) yfit < yfit*diff(h$mids[1:2])*length(x) lines(xfit, yfit, col="blue", lwd=2) # Kernel Density Plot d < density(mtcars$mpg) plot(d, xlab = "MPG", main ="Density Plot of MPG") The distribution of mpg is close to normal and there are no outliers that skew our data. Once again, there are no outliers in our data. From the boxplot, it is clearly evident that manual transmission is better than Automatic transmission for MPG. Regression Analysis
In this section, we build linear regression models based on the different variables and try to figure out the best model. Then compare it against the base model using ANOVA. Perform analysis of residuals upon finalizing the model. Model Building and Selection To identify predictors for our model, we look at how mpg correlated with all other variables. Based on the correlation matrix, several variables appear to have high correlation with mpg, We build an initial model with all the variables as predictors. Perfom stepwise model selection to select significant predictors for the final model which is the best model. Step method builds various multiple regression models in iterative manner and find out the best variables using both forward selection and backward elimination methods by the AIC algorithm. init_model < lm(mpg ~., data = mtcars) best_model < step(init_model, direction = "both") The best model obtained from the above computations consists of the confound variables, cyl, wt and hp along with predictor variable am. summary(best_model) Call: lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars) Residuals: Min 1Q Median 3Q Max 3.9387 1.2560 0.4013 1.1253 5.0513 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 33.70832 2.60489 12.940 7.73e 13 *** cyl6 3.03134 1.40728 2.154 0.04068 * cyl8 2.16368 2.28425 0.947 0.35225 hp 0.03211 0.01369 2.345 0.02693 * wt 2.49683 0.88559 2.819 0.00908 ** ammanual 1.80921 1.39630 1.296 0.20646 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.41 on 26 degrees of freedom Multiple R squared: 0.8659, Adjusted R squared: 0.8401 F statistic: 33.57 on 5 and 26 DF, p value: 1.506e 10 Please note that the adjusted R squared value is 0.84 which is best value after including appropriate confounder variables.this model explains 84% variability. We compare base model with only one predictor variable am against our best mode.
base_model< lm(mpg ~ am, data=mtcars) anova(base_model,best_model) Analysis of Variance Table Model 1: mpg ~ am Model 2: mpg ~ cyl + hp + wt + am Res.Df RSS Df Sum of Sq F Pr(>F) 1 30 720.90 2 26 151.03 4 569.87 24.527 1.688e 08 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Since the p value is significant, null hypothesis that the confounder variables do not improve the model is rejected. Residuals and Diagnostics In this section, we shall study the residual plots of our regression model and also compute some of the regression diagnostics for our model to find out some interesting outliers in the data set. par(mfrow = c(2,2)) plot(best_model,which=1:4)
From the above plots, we can make the below conclusions The points on the Residuals vs. Fitted plot seem to be randomly scattered on the plot which satisfies the i.i.d assumption. From the Normal Q Q plot it appears that the points are closely aligned to the line which means that the residuals are normally distributed. The scale location plot verifies constant variance of the points. The fourth plot is of Cook s distance, which is a measure of the influence of each observation on the regression coefficients. We now compute regression diagnostics of our model to find out points of influence and leverage points. influence < dfbetas(best_model) tail(sort(influence[,6]),3) Chrysler Imperial Fiat 128 Toyota Corona 0.3507458 0.4292043 0.7305402 leverage < hatvalues(best_model) tail(sort(leverage),3) Toyota Corona Lincoln Continental Maserati Bora 0.2777872 0.2936819 0.4713671 ``` Inference t.test(mpg ~ am,data=mtcars,var.equal=true) Two Sample t test data: mpg by am t = 4.1061, df = 30, p value = 0.000285 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 10.84837 3.64151 sample estimates: mean in group Automatic mean in group Manual 17.14737 24.39231 This test shows that we should reject the null hypothesis and conclude that mean MPGs are different for manual transmission cars and automatic transmission cars. Conclusions
Based on the observations from our best fit model, we can draw below conclusions: Cars with Manual transmission give better miles per gallon mpg than cars with Automatic transmission. (1.8 confounded by hp, cyl, and wt). mpg will decrease by 2.5 (confounded by hp, cyl, and am) for every 1000 lb increase in wt. For cylinders increase from 4 to 6 and 8, mpg will decrease by 3 and 2.2 respectively (confounded by hp, wt, and am) mpg decreases negligibly with increase of hp. Appendix Explore how mpg varies by automatic versus manual transmission boxplot(mpg~am, data = mtcars, col = c("red", "blue"), xlab = "Transmission", ylab = "Miles per Gallon", main = "MPG by Transmission Type") par(mar = c(1, 1, 1, 1)) pairs(~mpg+.,data=mtcars,panel=panel.smooth,main="pairs Plot for mtcars dataset")
data(mtcars) sort(cor(mtcars)[1,]) wt cyl disp hp carb qsec 0.8676594 0.8521620 0.8475514 0.7761684 0.5509251 0.4186840 gear am vs drat mpg 0.4802848 0.5998324 0.6640389 0.6811719 1.0000000