Review of Linear Regression I Statistics 211 - Statistical Methods II Presented January 9, 2018 Estimation of The OLS under normality the OLS Dan Gillen Department of Statistics University of California, Irvine of the OLS 2.1
Review of Linear Regression Our plan 1. Start with OLS and see where we get 2. What are not fulfilled with categorical response data (eg. binary or Poisson) 3. Fix up OLS to satisfy and obtain valid inference Estimation of The OLS under normality the OLS of the OLS 2.2
Review of Linear Regression Linear Regression Definition: By a classical (ordinary least squares) linear regression model, we mean a model in which we assume that 1. E[Y i X i ] = X T i 2. ɛ i = Y i X T i 3. the ɛ i s are independent β β and note that the model demands E[ɛ i ] 0 4. var(ɛ i ) = σ 2 for all i = 1,..., n 5. the ɛ i s are identically distributed 6. ɛ i N (0, σ 2 ) Estimation of The OLS under normality the OLS of the OLS 2.3
Review of Linear Regression Goal Conduct a model for the dependence of a response Y on predictors X 1, X 2,..., X p 1 Two components to the model: 1. The systematic component (mean model) µ i = β 0 + β 1 X i1 + β 2 X i2 +... + β p X ip 1 2. The random component (error term) Y i = µ i + ɛ i, where ɛ i N (0, σ 2 ) Note: We can write the above model using matrix notation in which the ith row of design matrix X is X i T, response vector Y T = (Y 1,..., Y n ). and error vector ɛ T = (ɛ 1,..., ɛ n ) obeys Y = X β + ɛ Estimation of The OLS under normality the OLS of the OLS 2.4
Estimation of Least Squares We consider parameter estimates that minimize the sum of squared errors n (Y i µ i ) 2 = i=1 n (Y i X i T β) 2 i=1 = ( Y X β) T ( Y X β) where X i is the i th row of the design matrix (the row vector of covariate values corresponding to the i th observation) and β = (β 0, β 1,..., β p 1 ) T Why focus on the sum of squared errors? Leads to score estimating equation under classical OLS It is reasonable and mathematically convenient! Estimation of The OLS under normality the OLS of the OLS 2.5
Estimation of Least Squares Proposition: Assume rank(x T X) = p (ie. the number of observations n is greater than the number of parameters p and no predictors are constant or linear combinations of the other predictors). Then the least squares estimate is given by β = (X T X) 1 X T Y. Estimation of The OLS under normality the OLS of the OLS 2.6
Estimation of Proof: Estimation of The OLS under normality the OLS of the OLS 2.7
Mean and Variance of the OLS Mean of the OLS Proposition: ˆ β is unbiased for β (ie. E[ ˆ β] = β) Proof: Estimation of The OLS under normality the OLS of the OLS 2.8
Mean and Variance of the OLS Variance of the OLS Proposition: The variance of the ordinary least squares estimate is var( β) = (X T X) 1 X T ΣX(X T X) 1 where Σ = var( Y ). When Σ = σ 2 I n (i.e., the Y i s are uncorrelated and have equal variance (3-4)), this reduces to var( β) = σ 2 (X T X) 1 Proof: Follows directly from var(a Y ) = Avar( Y )A T. Estimation of The OLS under normality the OLS of the OLS 2.9
Mean and Variance of the OLS Estimation of Var[ β] Note: In practice, we estimate σ 2 with ˆσ 2 = 1 n p n (Y i ˆµ i ) 2 i=1 **It can easily" be shown that ˆσ 2 is an unbiased and consistent of σ 2 using the methods of Stat 120B/200B! Estimation of The OLS under normality the OLS of the OLS 2.10
OLS is optimal" under the normality assumption If the ɛ i are independent and distributed N (0, σ 2 ) then the OLS is the MLE This means that the OLS is: 1. Consistent, 2. Asymptotically normally distributed, 3. Asymptotically efficient (achieves the Cramer-Rao lower bound) Estimation of The OLS under normality the OLS of the OLS 2.11
under non-normal errors Gauss-Markov If we do not assume normality, we may appeal to the Gauss-Markov theorem... Proposition: (Gauss-Markov Thm) Suppose Var( Y ) = σ 2 I n. Let β = CY be an unbiased of β. Then, the variance of linear functions of β is at least as great as the variance of linear functions of β (that is, the ordinary least squares estimate is the best linear unbiased (BLUE) of β). Estimation of The OLS under normality the OLS of the OLS 2.12
Gauss-Markov Thm Proof: Estimation of The OLS under normality the OLS of the OLS 2.13
under non-normal errors Gauss-Markov Note: Now suppose that Var( Y ) = Σ is arbitrary. For a positive definite symmetric matrix we can find nonsingular symmetric matrix A such that Σ = AA. In that case, then, Z = A 1 Y has expectation A 1 X β and variance A 1 ΣA 1 = I n. Letting W = A 1 X in this transformed model, the ordinary least squares estimate for β would be β = (W T W) 1 W T Z Estimation of The OLS under normality the OLS of the OLS 2.14
under non-normal errors Gauss-Markov In terms of the original response Y and predictors X this yields generalized least squares estimate β = (X T Σ 1 X) 1 X T Σ 1 Y which is unbiased for β and has variance (X T Σ 1 X) 1. Note that by the Gauss Markov Thm, this is the best linear unbiased estimate of β for this general setting. Note: Generalized least squares can obviously handle the case of correlated Y i s. In this class, we do not consider such settings. We do however consider the setting in which the Y i s are uncorrelated but do not have equal variance. Estimation of The OLS under normality the OLS of the OLS 2.15
under non-normal errors Gauss-Markov Definition: Consider a linear regression model in which Var(Y i ) = σ i and Cov(Y i, Y j ) = 0 for i j. Thus Σ = Var( Y ) = diag(σ 1,..., σ n ). The weighted least squares estimate of β is given by the generalized least squares estimate using the above definition of Σ Note: The above optimality (BLUE) of the ordinary, weighted, and generalized least squares estimates is not dependent upon any particular distribution of the Y i s beyond their first two moments. However, if we want to make inference after an analysis, we need to know the distribution of the estimates, which in turn requires some on the regression model. Estimation of The OLS under normality the OLS of the OLS 2.16
under normality Proposition: Suppose the Y i s are jointly normally distributed and are uncorrelated (hence independent) ( 1-6)). Then, the ordinary (weighted, generalized) least squares estimates are multivariately normally distributed. Thus in the case of constant variance, ˆ β N ( β, σ 2 (X T X) 1 ) Proof: This follows from linear transformations of multivariate normals. Estimation of The OLS under normality the OLS of the OLS 2.17
under normality Consider testing the null hypothesis H 0 : β k = β k,0 vs H 1 : β k β k,0 In Stat 210 you found that for the Wald test statistic we have: T = ˆβ k β k,0 ŝe( ˆβ k ) H0 t n p where ŝe( ˆβ k ) is given by the square-root of the k th diagonal element of Var[ ˆ β] = ˆσ 2 (X T X) 1 with ˆσ 2 = 1 n p n (y i ˆµ i ) 2 i=1 A 100(1-α)% CI for βk is given by computing ˆβ k ± t n p,1 α/2 ŝe( ˆβ k ) Estimation of The OLS under normality the OLS of the OLS 2.18
Asymptotic normality of OLS Question: What happens when the normality assumption is not satisfied? Answer: Like most (useful) s we can approximate the sampling distribution in large samples! To do this, we must appeal to the Lindeberg-Feller Central Limit... Estimation of The OLS under normality the OLS of the OLS 2.19
Lindeberg-Feller Central Limit Proposition: (Lindeberg-Feller Central Limit ) Let Y 1, Y 2,... be independent random variables with E[Y i ] = 0 and var(y i ) = σi 2. Define S n = n i=1 Y i and σ(n) 2 = n i=1 σ2 i. Then both 1. S n /σ (n) d N (0, 1), and 2. lim n max{σ 2 i /σ 2 (n), 1 i n} = 0 if and only if (the Lindeberg condition) ɛ > 0 1 lim n σ(n) 2 n E i=1 [ ] Y i 2 1 [ Yi ɛσ (n) ] = 0 Estimation of The OLS under normality the OLS of the OLS 2.20
Asymptotic normality of OLS Proposition: Consider simple linear regression in which (Y i, X i ) are pairs of response R.V. s and known predictors. Y i s are independently distributed Y i (µ i, σ 2 ) with σ 2 < known. In particular, we will consider regression model of the form µ i = β 0 + β 1 (X i X) and assume (Y i µ i ) iid (0, σ 2 ). Further, let X = ( 1 X X), β = (β 0 β 1 ) T and consider the OLSE β = (X T X) 1 X Y. Then, Z n = (X T X) 1/2 ( β β) Estimation of The OLS under normality the OLS = n n( ˆβ0 β 0 ) i=1 (x i x) 2 ( ˆβ 1 β 1 ) d N 2 (0, σ 2 I 2 ) of the OLS 2.21
Asymptotic normality of OLS Proof: Estimation of The OLS under normality the OLS of the OLS 2.22
Asymptotic normality of OLS Conclusion: Even if we do not assume normality, but simply have independence between the errors, the ordinary least squares estimate will be asymptotically normally distributed as long as max { (Xi X) 2 (Xi X) 2 } 0 as n by the Lindeberg-Feller CLT. In particular, in the case of constant variance we have ˆ β N ( β, σ 2 (X T X) 1 ) Estimation of The OLS under normality the OLS of the OLS 2.23
Consider the regression model Y i = µ i + ɛ i Varying degrees of 1. ɛ i N (0, σ 2 ) for all i 2. ɛ independent and identically distributed with mean zero 3. ɛ independent with constant variance and mean zero 4. ɛ independent with mean zero 5. ɛ has mean zero Estimation of The OLS under normality the OLS of the OLS 2.24
Consider the regression model Y i = µ i + ɛ i Weaker lead to weaker properties for the OLS 1. OLS is optimal (consistent, unbiased, most efficient) 2. OLS is consistent and is the best linear unbiased (BLUE) 3. OLS is consistent and is the best linear unbiased (BLUE) 4. OLS is consistent and asymptotically Normal 5. No guarantees (OLS consistent and asymptotically Normal under additional ) Estimation of The OLS under normality the OLS of the OLS 2.25
What is the effect of changing the error distribution? Thus, changing the error distribution could... 1. Could change Var[ ˆβ] In repeated experimentation, ˆβ varies more than it would if ɛ N (0, σ 2 ) Estimation of The OLS 2. Could affect the efficiency of ˆβ In repeated experimentation, ˆβ varies more than some other of β 3. Could make ˆσ 2 (X T X) 1 a bad estimate of Var[ ˆβ] under normality the OLS In repeated experimentation, the variability of ˆβ is greater (or less) than ˆσ 2 (X T X) 1 of the OLS 2.26
What is the effect of changing the error distribution? These results are distinct... The above results of changing the error distribution are all different phenomena Items (1) and (2) mean that another may be more efficient (smaller variability) than the OLS Item (3) means that if we estimate Var[ ˆβ] by ˆσ 2 (X T X) 1 then our inference for ˆβ will be wrong: Estimation of The OLS Type I error rate of hypothesis tests will be higher (lower) than the nominal level Confidence intervals will not have the correct coverage probability under normality the OLS of the OLS 2.27
(3) occurs when the variance of the error terms is not constant Why does this matter to us? 1. Suppose that our response is a binary outcome variable Y Y i Binom(µ i, 1) Standard linear regression mean model: E[Y i ] = µ i = X i β Error distribution: Var[Y i ] = µ i (1 µ i ) 2. Suppose that our response Y counts the number of events over a specified interval Might assume Y i Poisson(µ i ) Standard linear regression mean model: E[Y i ] = µ i = X i β Error distribution: Var[Y i ] = µ i *Note: Nonconstant variance can also cause (1) and (2) Estimation of The OLS under normality the OLS of the OLS 2.28
Bottom line Because of the mean-variance relationship in these (and many other) outcome distributions, we cannot fulfill the constant variance assumption! ˆσ 2 (X T X) 1 is a bad estimate for ˆβ Invalid inference Much of our class will be devoted to deriving a general class of s for regression models where a mean-variance assumption exists... Estimation of The OLS under normality the OLS of the OLS 2.29