Chapter 3 - Linear Regression

Applied Exercises

Q8. This question involves the use of simple linear regression on the Auto data set.

a) Use th lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. For example:

library(ISLR)
data("Auto")

lm.fit <- lm(mpg ~ horsepower, data=Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictor and the response?

Yes, the coefficient p-value has a very low value.

ii. How strong is the relationship between the predictor and the response?

Good evidence of relationship, \(R^2\) presents a value of approximately 0.6, that’s 60% of the response variance explained by the simple model.

iii. Is the relationship between the predictor and the response positive or negative?

Negative, since the coefficient has a negative value.

iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?

predict(lm.fit, data.frame("horsepower"=98), interval="confidence")

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

predict(lm.fit, data.frame("horsepower"=98), interval="prediction")

##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

b) Plot the response and the predictor. Use the abline() function to display the least squares regression line.

plot(Auto$horsepower, Auto$mpg)
abline(lm.fit, lwd=3, col="red")

c) Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

par(mfrow=c(2,2))
plot(lm.fit)

We’ll list common problems, and talk about them:

Non-linearity of the response-predictor relationships
- The Residuals vs Fitted graph appears to have a soft U-shape tendency, and as shown in the plot figure of b, the relationship between predictors and response is not so linear.
Non-constant Variance of Error Terms
- Analyzing the Residuals vs Fitted graph, it does NOT shows a great heteroscedasticity, which the magnitude of the residuals does not tend to increase with the fitted values.
Outliers
- Seen the Scale-Location graph, it is pointed some possible outliers, but checking the picture they don’t seem real outliers. I will get the ISLR reference and use the studentized residuals and observe which ones are greater than 3.
```
which(rstudent(lm.fit)>3)
```
```
## 323 330 
## 321 328
```
High Leverage Points
- The Residuals vs Leverage graph presents many high leverage points, but the most leverage points are not the outliers detect above.

Q9. This question involves the use of multiple linear regression on the Auto data set.

a) Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(Auto)

b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable which is qualitative.

cor(Auto[, !(names(Auto)=="name")])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

lm.fit <- lm(mpg ~ .-name, data=Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?

Yes, the F-statistic is highly significant with a very small p-value.

ii. Which predictors appear to have a statiscally significant relationship to the response?

The origin, the year and the cylinders.

iii. What does the coefficient for the year variable suggest?

It suggest that, for each additional year, more 0.75 miles per galon is possible.

d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plot suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(lm.fit)

Non-linearity of response-predictors values
- The Residuals vs Fitted graph does not seem no pattern, so it points no strong evidence of non-linearity.
Non-constant Variance of Error Terms
- The Residuals vs Fitted graph assumes a bit of funnel shape, so it presents a bit of heteroscedasticity.
Outliers
- At the Scale-Location graph, some observations are potential outliers, let’s check by studentized residuals.
```
rstudent(lm.fit)[which(rstudent(lm.fit)>3)]
```
```
##      245      323      326      327 
## 3.390068 4.029537 3.494823 3.690246
```
- Mainly the observation 323is the most potential outlier.
High Leverage Points
- Specifically the observation 14 is a highly leverage point as shown in Residuals vc Leverage graph.
Collinearity
```
require(car)
```
```
## Loading required package: car
```
```
vif(lm.fit)
```
```
##    cylinders displacement   horsepower       weight acceleration 
##    10.737535    21.836792     9.943693    10.831260     2.625806 
##         year       origin 
##     1.244952     1.772386
```
- As presented by the book VIF values that exceeds 5 of 10 indicates a problematic amount of collinearity. And how was already perceived by the cor() table it is noted many correlated variables. Four variables are greater than 5 and three of them are greater than 10. The problem of collinearity is that it reduces the accuracy of the estimates of the regression coefficients and it causes the standard error for \(\hat\beta_j\) to grow.

e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interaction effects. Do any interactions appear to be statiscally significant?

lm.fit.inter = lm(mpg ~ (.-name)*(.-name), data=Auto)
summary(lm.fit.inter)

## 
## Call:
## lm(formula = mpg ~ (. - name) * (. - name), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

The model at all had an improvement in \(R^2\) from 0.82 to almost 0.89, maybe it can be overfitting, though the interactive term most significant was acceleration:origin with a good coefficient in comparison with the main terms and a small p-value, validating the coefficient.

f) Try a few different transformation of the variables, such as \(log(X)\), \(\sqrt{X}\), \(X^2\). Comment on your findings.

I analyze the range of each variable to check which transformation maybe interesting for it.

apply(Auto,2,range)

##      mpg    cylinders displacement horsepower weight acceleration year
## [1,] "10.0" "3"       "101.0"      "100"      "1613" "10.0"       "70"
## [2,] " 9.0" "8"       " 98.0"      " 98"      "5140" " 9.5"       "82"
##      origin name                     
## [1,] "1"    "amc ambassador brougham"
## [2,] "3"    "vw rabbit custom"

Arbitrarily, to the variables of low range, i put them squared, and to the variable of high range i took the squared root or log of them, i didn’t modify origin because it is a qualitative variable.

lm.fit.1 <- lm(I(mpg^2) ~ cylinders + I(displacement^2) + I(horsepower^2) + sqrt(weight) + I(acceleration^2) + sqrt(year) + origin, data=Auto)
summary(lm.fit.1)

## 
## Call:
## lm(formula = I(mpg^2) ~ cylinders + I(displacement^2) + I(horsepower^2) + 
##     sqrt(weight) + I(acceleration^2) + sqrt(year) + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -509.60 -125.72  -14.61   92.29 1010.75 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -3.970e+03  4.642e+02  -8.552 2.92e-16 ***
## cylinders         -1.025e+01  1.640e+01  -0.625    0.532    
## I(displacement^2)  3.296e-03  7.292e-04   4.520 8.25e-06 ***
## I(horsepower^2)    1.599e-03  2.751e-03   0.581    0.562    
## sqrt(weight)      -4.569e+01  3.346e+00 -13.654  < 2e-16 ***
## I(acceleration^2)  5.829e-01  1.458e-01   3.998 7.66e-05 ***
## sqrt(year)         7.657e+02  5.167e+01  14.820  < 2e-16 ***
## origin             6.808e+01  1.572e+01   4.332 1.89e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 194.4 on 384 degrees of freedom
## Multiple R-squared:  0.7676, Adjusted R-squared:  0.7633 
## F-statistic: 181.1 on 7 and 384 DF,  p-value: < 2.2e-16

The difference noticed from the original model were: acceleration became a significant variable and displacement also decrease its p-value. Though the \(R^2\) decreases its value. The predictors year and weight had kept as the most important estimators.

Q10. This question should be answered using the Carseats data set.

data("Carseats")

a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

lm.fit.a <- lm(Sales ~ Price + Urban + US, data=Carseats)
summary(lm.fit.a)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

b) Provide an interpretation of each coefficient in the model. Be careful - some of the variables in the model are qualitative!

Let’s check which variables are qualitative and quantitative from the predictors.

attach(Carseats)
str(data.frame(Price, Urban, US))

## 'data.frame':    400 obs. of  3 variables:
##  $ Price: num  120 83 80 97 128 72 108 120 124 124 ...
##  $ Urban: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US   : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

The Urband and US are qualitative. Let’s see the code that R uses for the dummy variables.

attach(Carseats)
contrasts(Urban)

##     Yes
## No    0
## Yes   1

contrasts(US)

##     Yes
## No    0
## Yes   1

Now, analyzing the coefficients. The UrbanYes has a very high p-value, so it doesn’t prove any evidence of relevance for Sales. The USYes factor indicates a strong influence in the model and assigns more 1.2 thousands sales units for each US location. The Price coefficient has a negative relationship with Sales

c) Write out the model in the equation form, being careful to handle the qualitative variables properly.

Sales are in thousand units.

\(Sales = 13,043 - 0,055*Price -0,022*UrbanYes + 1,2*USYes\)

d) For which of the predictors can you reject the null hypothesis \(H_0 : \beta_j=0\)?

Urban

e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm.fit.e <- lm(Sales ~ Price + US, data=Carseats)
summary(lm.fit.e)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

f) How well do the models in (a) and (e) fit the data?

anova(lm.fit.a, lm.fit.e)

## Analysis of Variance Table
## 
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price + US
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    396 2420.8                           
## 2    397 2420.9 -1  -0.03979 0.0065 0.9357

Both appears identically, and the p-value of F-statistic doesn’t present evidence of differiation.

g) Using the model from (e), obtain 95% confidence intervals for the coefficients(s).

confint(lm.fit.e)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

h) Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow=c(2,2))
plot(lm.fit.2)

In the Scale-Location graph does not show any highlighted outlier. In the Residuals vs Leverage graph notes a very high leverage observation, which is the observation:

hatvalues(lm.fit.e)[order(hatvalues(lm.fit.e), decreasing = T)][1]

##         43 
## 0.04333766

Q11. In this problem we will investigate the t-statistic for the null hypothesis \(H_0 : \beta=0\) in simple linear regression without an intercept. To begin, we generate a predictor x and a response y as follows.

set.seed(1)
x=rnorm(100)
y=2*x+rnorm(100)

a) Perform a simple linear regression of y onto x, without an intercept. Report the coefficient estimate \(\hat\beta\), the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis \(H_0 : \beta=0\). Comment on these results. (You can perform regression without an intercept using the command lm(y~x+0).)

lm.fit.a <- lm(y ~ x+0)
summary(lm.fit.a)$coefficients

##   Estimate Std. Error  t value     Pr(>|t|)
## x 1.993876  0.1064767 18.72593 2.642197e-34

This p-value indicates a strong evidence of the relationship of x and y.

b) Now perform a simple linear regression of x onto y without an intercept, and report the coefficient estimate, its standard error, and the corresponding t-statistic and p-values associated with the null hypothesis \(H_0 : \beta=0\). Comment on these results.

lm.fit.b <- lm(x ~ y+0)
summary(lm.fit.b)$coefficients

##    Estimate Std. Error  t value     Pr(>|t|)
## y 0.3911145 0.02088625 18.72593 2.642197e-34

This p-value indicates a strong evidence of the relationship of x and y.

c) What is the relationship between the results obtained in (a) and (b)?

The t-value is exactly the same.

d) For the regression of \(Y\) onto \(X\) without an intercept, the t-statistic for \(H_0 : \beta=0\) takes the form \(\hat\beta/SE(\hat\beta)\), where \(\hat\beta\) is given by (3.38), and where \[SE(\hat\beta) = \sqrt{\frac{\sum\limits_{i=1}^n(y_i-x_i\hat\beta)^2}{(n-1)\sum\limits_{{i'}=1}^n{x_{i'}}^2}} \] (These formulas are slightly differente from those given in Sections 3.1.1 and 3.1.2, since here we are performing regression without an intercept.) Show algebraically, and confirm numerically in R, that the t-statistic can be written as \[\frac{(\sqrt{n-1})\sum\limits_{i=1}^nx_iy_i}{\sqrt{(\sum\limits_{i=1}^n{x_i}^2)(\sum\limits_{{i'}=1}^n{y_{i'}}^2) - (\sum\limits_{{i'}=1}^nx_{i'}y_{i'})^2}} .\]

The \(\hat\beta\) given in 3.38 is: (I) \(\hat\beta = (\sum\limits_{i=1}^nx_iy_i)/(\sum\limits_{{i'}=1}^{n}x_{i'}^2)\)

Considering all summation limits equals [1,n], i’ll omit them to clear the equations.

\[t = \frac{\hat\beta}{SE(\hat\beta)}\]

Expand the standard error:

\[t = \frac{\hat\beta}{\sqrt{\frac{\sum(y_i-x_i\hat\beta)^2}{(n-1)\sum{x_i}^2}}}\]

Square both sides:

\[t^2 = \frac{\hat\beta^2}{\frac{\sum(y_i-x_i\hat\beta)^2}{(n-1)\sum{x_i}^2}}\]

\[t^2 = (n-1)\sum{x_i}^2\frac{\hat\beta^2}{\sum(y_i-x_i\hat\beta)^2}\]

Expand bellow term:

\[t^2 = (n-1)\sum{x_i}^2\frac{\hat\beta^2}{\sum({y_i}^2-2y_ix_i\hat\beta+{x_i}^2\hat\beta^2)}))\]

\[t^2 = (n-1)\sum{x_i}^2\frac{\hat\beta^2}{\sum{{y_i}^2}-2\hat\beta\sum{y_ix_i}+\hat\beta^2\sum{{x_i}^2}}\]

\[t^2 = (n-1)\sum{x_i}^2\frac{\hat\beta^2}{\sum{{y_i}^2} + \hat\beta\Big(\hat\beta\sum{{x_i}^2} - 2\sum{y_ix_i}\Big)}\]

Substitue the down right \(\hat\beta\) by I:

\[t^2 = (n-1)\sum{x_i}^2\frac{\hat\beta^2}{\sum{{y_i}^2} + \hat\beta\Big[\Big(\frac{\sum{x_iy_i}}{\sum{{x_i}^2}}\Big)\sum{{x_i}^2} - 2\sum{y_ix_i}\Big]}\]

Simplify the bellow term:

\[t^2 = (n-1)\sum{x_i}^2\frac{\hat\beta^2}{\sum{{y_i}^2} + \hat\beta\Big[\sum{x_iy_i} - 2\sum{y_ix_i}\Big]}\]

\[t^2 = (n-1)\sum{x_i}^2\frac{\hat\beta^2}{\sum{{y_i}^2} - \hat\beta\sum{x_iy_i}}\]

Substitue again the bellow \(\hat\beta\) by I:

\[t^2 = (n-1)\sum{x_i}^2\frac{\hat\beta^2}{\sum{{y_i}^2} - \Big(\frac{\sum{x_iy_i}}{\sum{{x_i}^2}}\Big)\sum{x_iy_i}}\]

\[t^2 = (n-1)\sum{x_i}^2\frac{\hat\beta^2}{\frac{\sum{{y_i}^2}\sum{{x_i}^2} - (\sum{x_iy_i})^2}{\sum{{x_i}^2}}}\]

\[t^2 = \hat\beta^2\frac{(n-1)(\sum{x_i}^2)^2}{\sum{{y_i}^2}\sum{{x_i}^2} - (\sum{x_iy_i})^2}\]

Now, substitue the upper \(\hat\beta\) by I:

\[t^2 = \Big(\frac{\sum{x_iy_i}}{\sum{{x_i}^2}}\Big)^2\frac{(n-1)(\sum{x_i}^2)^2}{\sum{{y_i}^2}\sum{{x_i}^2} - (\sum{x_iy_i})^2}\]

Cut-off the \((\sum{{x_i}^2})^2\) term from up and down parts:

\[t^2 = \frac{(n-1)(\sum{x_iy_i})^2}{\sum{{y_i}^2}\sum{{x_i}^2} - (\sum{x_iy_i})^2}\]

And now we take the square root and the equations assumes the t-value as proposed by the question:

\[t = \frac{\sqrt{(n-1)}\sum{x_iy_i}}{\sqrt{\sum{{y_i}^2}\sum{{x_i}^2} - (\sum{x_iy_i})^2}}\]

Now let’s confirm this by R.

t_value = (sqrt(length(x)-1)*sum(x*y))/sqrt(sum(y^2)*sum(x^2) - (sum(x*y))^2)

The value is the same pointed in summary function.

e) Using the results from (d), argue that the t-statistic for the regression of y onto x is the same as the t-statistic for the regression of x onto y.

if we take only the formula of \(\hat\beta\) and \(SE(\hat\beta)\), the ratio of them will be the same indepedently if we do regression onto x or y.

f) In R, show that when regression is performed with an intercept, the t-statistic for \(H_0 : \beta=0\) is the same for the regression of y onto x as it is for the regression of x onto y.

regression x onto y

lm.fit.f1 <- lm(x ~ y)
summary(lm.fit.f1)$coefficients[2,3]

## [1] 18.5556

regression y onto x

lm.fit.f2 <- lm(x ~ y)
summary(lm.fit.f2)$coefficients[2,3]

## [1] 18.5556

both t-values are equal.

Q12. This problem involves simple linear regression without an intercept.

a) Recall that the coefficient estimate \(\hat\beta\) for the linear regression of \(Y\) onto \(X\) without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of \(X\) onto \(Y\) the same as the coefficient estimate for the regression of \(Y\) onto \(X\)?

The numerator term of the equation is equal for both regressions. For the coefficients be the same the denominator term must be equal too. So, \[\sum\limits_{i=1}^{n}x_i^2 = \sum\limits_{i=1}^{n}y_i^2\]

b) Generate an example in R with \(n=100\) observations in which the coefficient estimate for the regression of \(X\) onto \(Y\) is different from the coefficient estimate for the regression of \(Y\) onto \(X\).

x = rnorm(100, mean=10, sd=10)
y = 2*x + rnorm(100, mean=5, sd=20)

lm.fit.b1 <- lm(y ~ x + 0)
lm.fit.b2 <- lm(x ~ y + 0)
coef(lm.fit.b1)

##        x 
## 2.399236

coef(lm.fit.b2)

##         y 
## 0.3154295

plot(x,y)

c) Generate an example in R with \(n=100\) observations in which the coefficient estimate for the regression of \(X\) onto \(Y\) is the same as the coefficient estimate for the regression of \(Y\) onto \(X\).

x = rnorm(100, mean=10, sd=10)

# create another distribution `y`, such as to the summation of x^2 equals to summation of y^2
sum.X.e2 = sum(x^2)
y.ult.e2 = sum.X.e2/50
r = y.ult.e2/(100-1)
y.e2 =seq(0, y.ult.e2, r)
y = sqrt(y.e2)

lm.fit.c1 <- lm(y ~ x + 0)
lm.fit.c2 <- lm(x ~ y + 0)
coef(lm.fit.c1)

##         x 
## 0.5981553

coef(lm.fit.c2)

##         y 
## 0.5981553

plot(x,y)

Q13. In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.

set.seed(1)

a) Using the rnorm() function, create a vector, x, containing 100 observations drawn from a \(N(0,1)\) distribution. This represents a feature, \(X\).

x <- rnorm(100)

b) Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a \(N(0,0.25)\) distribution, i.e. a normal distribution with mean zero and variance 0.25.

eps <- rnorm(100, 0, 0.25)

c) Using x and eps, generate a vector y according to the model \[Y = -1 + 0.5X + \epsilon\] What is the length of the vector y? What are the values of \(\beta_0\) and \(\beta_1\) in this linear model?

y = -1 + .5*x + eps

The length of y is 100. \(\beta_0 = -1\) and \(\beta_1 = 0.5\)

d) Create a scatterplot displaying the relationship between x and y. Comment on what you observe.

plot(x,y)

We have a clearly linear relationship between x and y, and a presence of variance, \(var(\epsilon)\), in this distribuition represented by \(\epsilon\).

e) Fit a least squares linear model to predict y using x. Comment on the model obtained. How do \(\hat\beta_0\) and \(\hat\beta_1\) compare to \(\beta_0\) and \(\beta_1\)?

lm.fit.e <- lm(y ~ x)
summary(lm.fit.e)$coefficients

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -1.0094232 0.02424682 -41.63115 4.106694e-64
## x            0.4997349 0.02693176  18.55560 7.723851e-34

The estimate coefficients are very closely from the original ones. So the linear relationship is very closely of the true form of \(\hat{f}\).

f) Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.

plot(x,y)
abline(lm.fit.e, col="red", lwd=2)
legend(x=-2, y=.25, legend = "Linear Regression Model")

g) Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the quadratic term improves the model fit? Explain your answer.

lm.fit.g <- lm(y ~ poly(x,2))
summary(lm.fit.g)

## 
## Call:
## lm(formula = y ~ poly(x, 2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4913 -0.1563 -0.0322  0.1451  0.5675 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.95501    0.02395 -39.874   <2e-16 ***
## poly(x, 2)1  4.46612    0.23951  18.647   <2e-16 ***
## poly(x, 2)2 -0.33602    0.23951  -1.403    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2395 on 97 degrees of freedom
## Multiple R-squared:  0.7828, Adjusted R-squared:  0.7784 
## F-statistic: 174.8 on 2 and 97 DF,  p-value: < 2.2e-16

anova(lm.fit.e, lm.fit.g)

## Analysis of Variance Table
## 
## Model 1: y ~ x
## Model 2: y ~ poly(x, 2)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     98 5.6772                           
## 2     97 5.5643  1   0.11291 1.9682 0.1638

How p-value is 0.16, hence there’s no strong evidence of difference of the models.

h) Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the data. The model (3.39) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term \(\epsilon\) in (b). Describe your results.

x <- rnorm(100)
eps <- rnorm(100, 0, .05)
y = -1 + .5*x + eps

plot(x,y)

lm.fit.h <- lm(y ~ x)
summary(lm.fit.h)$coefficients

##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) -0.9975775 0.004954993 -201.3277 4.384015e-130
## x            0.5053109 0.004812968  104.9895 1.646414e-102

abline(lm.fit.h, col="red", lwd=2)
legend(x=-2, y=.25, legend = "Linear Regression Model")

The estimate coefficients appears very similar, but the stardande error, \(SE(\hat\beta)\), is much less than the prior model, hence higher t-values.

i) Repeat (a)-(f) after modifying the data generation process in such a way that there is more noise in the data. The model (3.39) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term \(\epsilon\) in (b). Describe your results.

x <- rnorm(100)
eps <- rnorm(100, 0, .5)
y = -1 + .5*x + eps

plot(x,y)

lm.fit.i <- lm(y ~ x)
summary(lm.fit.i)$coefficients

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -1.0237262 0.04838194 -21.15926 2.546997e-38
## x            0.4625256 0.04155070  11.13159 4.320800e-19

abline(lm.fit.i, col="red", lwd=2)
legend(x=-2, y=.25, legend = "Linear Regression Model")

Again, the estimate coefficients appears very similar, but the stardande error, \(SE(\hat\beta)\), that time, is much higher than the prior model, hence lower t-values.

j) What are the confidence interval for \(\beta_0\) and \(\beta_1\) based on the original data set, the noiser data set, and the less noisy data set? Comment on your results.

# original
confint(lm.fit.e)

##                  2.5 %     97.5 %
## (Intercept) -1.0575402 -0.9613061
## x            0.4462897  0.5531801

# less noisy
confint(lm.fit.h)

##                  2.5 %     97.5 %
## (Intercept) -1.0074105 -0.9877445
## x            0.4957597  0.5148621

# more noisy
confint(lm.fit.i)

##                  2.5 %     97.5 %
## (Intercept) -1.1197386 -0.9277138
## x            0.3800695  0.5449816

It is very noted that noisiest data sets cause a wider confidential interval.

Q14. This problem focuses on the collinearity problem.

a) Perform the following commands in R:

set.seed(1)
x1 = runif(100)
x2 = 0.5*x1 + rnorm(100)/10
y = 2 + 2*x1 + 0.3*x2*rnorm(100)

The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?

\(\beta_0 = 2\), \(\beta_1 = 2\) and \(\beta_2 = 0.3\).

b) What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.

cor(x1,x2)

## [1] 0.8351212

plot(x1,x2)

c) Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are \(\hat\beta_0\), \(\hat\beta_1\), and \(\hat\beta_2\)? How do these relate to the true \(\beta_0\), \(\beta_1\), and \(\beta_2\)? Can you reject the null hypothesis \(H_0 : \beta_1 = 0\)? How about the null hypothesis \(H_0 : \beta_2 = 0\)

lm.fit.c <- lm(y ~ x1 + x2)
summary(lm.fit.c)$coefficients

##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 2.00213431 0.02141954 93.4723230 7.575302e-97
## x1          1.98242901 0.06661731 29.7584647 1.440146e-50
## x2          0.03304309 0.10472504  0.3155223 7.530430e-01

The means of the estimate coefficients are closely from the original ones, but at \(\beta_2\) we can’t reject the null hypothesis, so \(\beta_2\) can be zero.

d) Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis \(H_0 : \beta_1 = 0\)?

lm.fit.d <- lm(y ~ x1)
coefficients(summary(lm.fit.d))

##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 2.001542 0.02123880 94.23985 5.876338e-98
## x1          1.999983 0.03647518 54.83133 2.369746e-75

Yes, by the p-value we can reject the null hypothesis.

e) Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis \(H_0 : \beta_1 = 0\)?

lm.fit.e <- lm(y ~ x2)
coefficients(summary(lm.fit.e))

##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 2.359425 0.05616653 42.00768 1.779990e-64
## x2          2.635662 0.18240349 14.44962 5.029358e-26

Yes, using only \(x_2\), we can reject the null hypothesis.

f) Do the results obtained in (c)-(e) contradict each other? Explain your answer.

Yes, because in the first model it seems as the \(x_2\) predictor doesn’t influence the response, but it does. This event occurs due to collinearity, which increase a lot the standard error.

g) Now suppose we obtain one additional observation, which was unfortunately mismeasured.

x1 = c(x1, 0.1)
x2 = c(x2, 0.8)
y = c(y, 6)

Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the modes? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.

lm.fit.g.c <- lm(y ~ x1 + x2)
summary(lm.fit.g.c)$coefficients

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 2.1265261 0.06615935 32.142486 7.452021e-54
## x1          0.8184356 0.16934535  4.832938 4.980157e-06
## x2          1.9791129 0.25670492  7.709680 1.050503e-11

lm.fit.g.d <- lm(y ~ x1)
coefficients(summary(lm.fit.g.d))

##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 2.150323 0.08334065 25.80162 9.615471e-46
## x1          1.783571 0.14382053 12.40136 7.165059e-22

lm.fit.g.e <- lm(y ~ x2)
coefficients(summary(lm.fit.g.e))

##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 2.306185 0.06059344 38.05998 6.508047e-61
## x2          2.896228 0.19140723 15.13124 1.699816e-27

With the model of both predictors, we can reject the null hypothesis for all coefficients, it seems that the additional observation broke the collinearity. Let’s check the correlation.

cor(x1,x2)

## [1] 0.7392279

The correlation had a reduced value from the original 0.835. Now, let’s see a plot pairing both predictors.

plot(x1,x2)
text(x=0.1, y=0.8, labels="new-obs", pos=4, cex=.7, col="blue")

Seeing the diagnosis plots:

par(mfrow=c(2,2))
plot(lm.fit.g.c)

By the graphs Scale-Location and Residuals vs Leverage, the new observation, 101, appears either as outlier and as high-leverage point.

Q15. This problem involves the Boston data set, which we saw in the lab for this chapter. We will now try to predict per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.

library(MASS)
data(Boston)

a) For each predictor, fit a simple linear regression model to preddict the response. Describe your results. In which of the models is there a statiscally significant association between the predictor and the response? Create some plots to back up your assertions.

coefs <- data.frame("predictor"=character(0), "Estimate"=numeric(0), "Std.Error"=numeric(0), "t.value"=numeric(0), "Pr.t"=numeric(0), "r.squared"=numeric(0), stringsAsFactors = FALSE)
j <- 1
for(i in names(Boston)){
  if(i != "crim"){
    summ.lm.fit <- summary(lm(crim ~ eval(parse(text=i)), data=Boston))
    coefs[j,] = c(i, summ.lm.fit$coefficients[2,], summ.lm.fit$r.squared)
    j <- j+1
  }
}

coefs[,-1] <- lapply(coefs[,-1], FUN=function(x) as.numeric(x))
coefs <- coefs[order(coefs$r.squared, decreasing = T),]
print(coefs)

##    predictor    Estimate   Std.Error   t.value         Pr.t   r.squared
## 8        rad  0.61791093 0.034331820 17.998199 2.693844e-56 0.391256687
## 9        tax  0.02974225 0.001847415 16.099388 2.357127e-47 0.339614243
## 12     lstat  0.54880478 0.047760971 11.490654 2.654277e-27 0.207590933
## 4        nox 31.24853120 2.999190381 10.418989 3.751739e-23 0.177217182
## 2      indus  0.50977633 0.051024332  9.990848 1.450349e-21 0.165310070
## 13      medv -0.36315992 0.038390175 -9.459710 1.173987e-19 0.150780469
## 11     black -0.03627964 0.003873154 -9.366951 2.487274e-19 0.148274239
## 7        dis -1.55090168 0.168330031 -9.213458 8.519949e-19 0.144149375
## 6        age  0.10778623 0.012736436  8.462825 2.854869e-16 0.124421452
## 10   ptratio  1.15198279 0.169373609  6.801430 2.942922e-11 0.084068439
## 5         rm -2.68405122 0.532041083 -5.044819 6.346703e-07 0.048069117
## 1         zn -0.07393498 0.016094596 -4.593776 5.506472e-06 0.040187908
## 3       chas -1.89277655 1.506115484 -1.256727 2.094345e-01 0.003123869

By p-value parameters, all predictors have a relevant association with response, rejecting the null hypothesis. By the \(R^2\) parameter, the response variance explained by the predictor, the most meaningful and also the best t-value is the rad variable. Either the tax variable is very well associated with the response, and it is the second of higher \(R^2\) value.

b) Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis \(H_0 : \beta_j = 0\)?

lm.fit.b <- lm(crim ~ ., data=Boston)
summary(lm.fit.b)

## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.924 -2.120 -0.353  1.019 75.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.033228   7.234903   2.354 0.018949 *  
## zn            0.044855   0.018734   2.394 0.017025 *  
## indus        -0.063855   0.083407  -0.766 0.444294    
## chas         -0.749134   1.180147  -0.635 0.525867    
## nox         -10.313535   5.275536  -1.955 0.051152 .  
## rm            0.430131   0.612830   0.702 0.483089    
## age           0.001452   0.017925   0.081 0.935488    
## dis          -0.987176   0.281817  -3.503 0.000502 ***
## rad           0.588209   0.088049   6.680 6.46e-11 ***
## tax          -0.003780   0.005156  -0.733 0.463793    
## ptratio      -0.271081   0.186450  -1.454 0.146611    
## black        -0.007538   0.003673  -2.052 0.040702 *  
## lstat         0.126211   0.075725   1.667 0.096208 .  
## medv         -0.198887   0.060516  -3.287 0.001087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4396 
## F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

We can reject the null hypothesis for: zn, nox, dis, rad, black, lstat and medv. They’re 7 from 14 of the predictors.

c) How do your results (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as single point in the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis.

df = data.frame("mult"=summary(lm.fit.b)$coefficients[-1,1])
df$simple <- NA
for(i in row.names(df)){
  df[row.names(df)==i, "simple"] = coefs[coefs[,1]==i, "Estimate"]
}

plot(df$simple, df$mult, xlab="Coef for Simple Linear Regression", ylab="Coef for Multiple Linear Regression")
text(x=df$simple, y=df$mult, labels=row.names(df), cex=.7, col="blue", pos=4)

The nox variable appears with a large displacement, messing the neatness of the graph, so i’ll cut-off the nox to enhance the visualization.

df.clean = df[!(row.names(df)%in%"nox"),]
plot(df.clean$simple, df.clean$mult, xlab="Coef for Simple Linear Regression", ylab="Coef for Multiple Linear Regression")
text(x=df.clean$simple, y=df.clean$mult, labels=row.names(df.clean), cex=.7, col="blue", pos=4)

d) Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor \(X\), fit a model of the form \[Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon\].

For this analysis, i take off the chas variable, because it is a qualitative variable and does not make sense and put it on polynomial order.

coefs.poly <- data.frame("predictor"=character(0), "Estimate"=numeric(0), "Std.Error"=numeric(0), "t.value"=numeric(0), "Pr.t"=numeric(0), "r.squared"=numeric(0), stringsAsFactors = FALSE)
j <- 1
for(i in names(Boston)){
  if(!(i %in% c("crim", "chas"))){
    summ.lm.fit <- summary(lm(crim ~ poly(eval(parse(text=i)),3), data=Boston))
    coefs.poly[j,] = c(i, summ.lm.fit$coefficients[2,], summ.lm.fit$r.squared)
    j <- j+1
  }
}

coefs.poly[,-1] <- lapply(coefs.poly[,-1], FUN=function(x) as.numeric(x))
coefs.poly <- coefs.poly[order(coefs.poly$r.squared, decreasing = T),]
print(coefs.poly)

##    predictor  Estimate Std.Error    t.value         Pr.t  r.squared
## 12      medv -75.05761  6.569152 -11.425768 4.930818e-27 0.42020026
## 7        rad 120.90745  6.682402  18.093412 1.053211e-56 0.40003687
## 8        tax 112.64583  6.853707  16.435751 6.976314e-49 0.36888208
## 3        nox  81.37202  7.233605  11.249165 2.457491e-26 0.29697790
## 6        dis -73.38859  7.331479 -10.010066 1.253249e-21 0.27782477
## 2      indus  78.59082  7.423121  10.587301 8.854243e-24 0.25965786
## 11     lstat  88.06967  7.629436  11.543404 1.678072e-27 0.21793243
## 5        age  68.18201  7.839703   8.697015 4.878803e-17 0.17423099
## 10     black -74.43120  7.954643  -9.356951 2.730082e-19 0.14983983
## 9    ptratio  56.04523  8.121583   6.900777 1.565484e-11 0.11378158
## 4         rm -42.37944  8.329676  -5.087766 5.128048e-07 0.06778606
## 1         zn -38.74984  8.372207  -4.628389 4.697806e-06 0.05824197

For better analysis, i plot a graph between the coefficients in the simple linear graph and simple linear model with polynomial order.

df = data.frame("simple"=coefs[,2])
row.names(df) <- coefs[, 1]
df$poly <- NA
for(i in coefs.poly[,1]){
  df[row.names(df)==i, "poly"] <- coefs.poly[coefs.poly[,1]==i, "Estimate"]
}

plot(df$simple, df$poly, xlab="Coef for Simple Linear Regression", ylab="Coef for Poly Linear Regression")
text(x=df$simple, y=df$poly, labels=row.names(df), cex=.7, col="blue", pos=4)

Again the nox variable is highly displacement in the graph.

df.clean = df[!(row.names(df)%in%"nox"),]
plot(df.clean$simple, df.clean$mult, xlab="Coef for Simple Linear Regression", ylab="Coef for Poly Linear Regression")
text(x=df.clean$simple, y=df.clean$mult, labels=row.names(df.clean), cex=.7, col="blue", pos=4)

Chapter 3 - Linear Regression

Rafael de Souza Toledo

September 24, 2016

Applied Exercises