-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03_linear_regression.Rmd
163 lines (137 loc) Β· 4.68 KB
/
03_linear_regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
title: "ISLR, Chapter 3 - Linear Regression"
output: github_document
---
```{r import libraries, message=FALSE, warning=FALSE}
library(MASS)
library(ISLR)
library(tidyverse)
```
First, we will use the `Boston` dataset in the ISLR package. Let's look at it:
```{r look at dataset}
names(Boston)
# ?Boston
```
# Simple Linear Regression
Look at the median value of homes wrt the percentage of lower status population:
```{r plot1}
Boston %>%
ggplot(aes(x = lstat, y = medv)) +
geom_point(size = 3, alpha = .3)
```
## Fit a linear model
```{r fit1}
fit1 = lm(medv ~ lstat, data = Boston) ## ~ = "is modelled as"
fit1 ## brief summary
summary(fit1) ## more detailed summary
```
`lstat` is significant.
## Add the linear model fit to the plot
```{r plot2}
plot(medv ~ lstat, Boston) ## add linear model fit to the plot
abline(fit1, col = "red")
## in ggplot2
Boston %>%
ggplot(aes(x = lstat, y = medv)) + ## add linear model fit to the plot
geom_point(size = 3, alpha = .3) +
geom_smooth(method = 'lm', formula = y ~ x)
```
## Get the confidence interval for the coefficients
```{r CI1}
confint(fit1) ## default is 95% CI
```
## Predict `medv` for 3 new values of `lstat` & get the CI
```{r predict1}
predict(fit1, data.frame(lstat = c(5,10,15)), interval = "confidence")
```
# Multiple Linear Regression
```{r fit2}
fit2 = lm(medv ~ lstat + age, Boston)
summary(fit2)
```
`age` is also significant.
Now, let's try to use all of the variables in `Boston` as predictors:
```{r fit3}
fit3 = lm(medv ~ ., Boston)
summary(fit3)
```
Now, `age` is no longer significant, along with `indus`. This means that othe predictors are correlated with `age` and in the presence of them `age` is no longer required.
## You can also plot linear models
```{r}
par(mfrow = c(2,2))
plot(fit3)
```
The curved residuals line in the Residuals vs Fitted plot indicates that the model is not quite caputring everything that's going on and there seems to be some non-linearity. The plot in the bottom left sees if the variance is changing with the mean or the fit.
You can use `update()` to update the model:
```{r fit4}
fit4 = update(fit3,~ . -age -indus) ## remove age and indus
summary(fit4)
```
Now eveyrthing left in the model is significant.
# Non-linear Terms and Interactions
Let's make an interaction between `lstat` and `age`:
```{r fit5}
fit5 = lm(medv ~ lstat*age, Boston) ## already automatically makes the main effects
summary(fit5) ## the interaction is indicated by the ':'
```
The main effect for `age` is still not significant, but the interaction is somewhat significant.
Since we saw that there is a non-linear looking scatter plot between `medv` and `lstat`, we can try a **quadratic term**:
```{r fit6}
fit6 = lm(medv ~ lstat + I(lstat^2), Boston) ## now we have to explicitly put in the main effect
summary(fit6)
```
Both coefficients are strongly significant.
## Let's plot it:
```{r plot3}
attach(Boston)
plot(medv ~ lstat)
points(lstat, fitted(fit6), col = "red", pch = 20) ## get fitted values from our model; pch = plotting character
## in ggplot2
Boston %>%
ggplot(aes(x = lstat, y = medv)) + ## add linear model fit to the plot
geom_point(size = 3, alpha = .3) +
geom_point(aes(x = lstat, y = fitted(fit6)), color = "blue")
```
There's actually an easier way of fitting polynomials using the `poly()` function:
```{r fit7}
fit7 = lm(medv ~ poly(lstat,4))
plot(medv ~ lstat)
points(lstat, fitted(fit6), col = "red", pch = 20)
points(lstat, fitted(fit7), col = "blue", pch = 20)
```
The fourth degree polynomial seems to over-fit the data a bit.
# Qualitative Predictors
We'll use the `Carseats` dataset for this:
```{r carseats}
summary(Carseats)
```
For the model, we'll add interaction between `Income` & `Advertising`, and `Age` & `Price`. `:` adds an interaction predictor, but not the main effects like `*` does:
```{r fit8}
fit8 = lm(Sales ~ . + Income:Advertising + Age:Price, Carseats)
summary(fit8)
```
`Income` & `Advertising` is significant, but `Age` & `Price` is not.
`ShelveLoc` is a **qualitative variable**, you can use `contrasts()` to see how R coded it for the linear model:
```{r qualitative predictors}
contrasts(Carseats$ShelveLoc)
```
Since it is a 3-factor variable, it coded it into two dummy variables, 'Good' and 'Medium'.
# Making a regression function
```{r function1, message=FALSE, warning=FALSE}
regplot = function(x,y) {
fit = lm(y ~ x)
plot(x,y)
abline(fit, col = "red")
}
attach(Carseats)
regplot(Price,Sales)
```
Let's try adding `...` so that we can add any amount of new arguments to our function:
```{r function 2}
regplot = function(x,y,...) {
fit = lm(y ~ x)
plot(x,y,...)
abline(fit, col = "red")
}
regplot(Price, Sales, xlab = "Price", ylab = "Sales", col = "blue", pch = 20)
```