03_linear_regression.Rmd

---
title: "ISLR, Chapter 3 - Linear Regression"
output: github_document
---

```{r import libraries, message=FALSE, warning=FALSE}
library(MASS)
library(ISLR)
library(tidyverse)
```
First, we will use the `Boston` dataset in the ISLR package. Let's look at it:
```{r look at dataset}
names(Boston) 
# ?Boston
```

# Simple Linear Regression

Look at the median value of homes wrt the percentage of lower status population:
```{r plot1}
Boston %>% 
  ggplot(aes(x = lstat, y = medv)) +
  geom_point(size = 3, alpha = .3)
```

## Fit a linear model
```{r fit1}
fit1 = lm(medv ~ lstat, data = Boston)  ## ~ = "is modelled as"
fit1 ## brief summary
summary(fit1) ## more detailed summary
```
`lstat` is significant.

## Add the linear model fit to the plot

```{r plot2}
plot(medv ~ lstat, Boston)  ## add linear model fit to the plot
abline(fit1, col = "red")
## in ggplot2
Boston %>% 
  ggplot(aes(x = lstat, y = medv)) +  ## add linear model fit to the plot
  geom_point(size = 3, alpha = .3) +
  geom_smooth(method = 'lm', formula = y ~ x)
```

## Get the confidence interval for the coefficients
```{r CI1}
confint(fit1)  ## default is 95% CI
```

## Predict `medv` for 3 new values of `lstat` & get the CI
```{r predict1}
predict(fit1, data.frame(lstat = c(5,10,15)), interval = "confidence")
```


# Multiple Linear Regression
```{r fit2}
fit2 = lm(medv ~ lstat + age, Boston)
summary(fit2)
```

`age` is also significant.

Now, let's try to use all of the variables in `Boston` as predictors:
```{r fit3}
fit3 = lm(medv ~ ., Boston)
summary(fit3)
```
Now, `age` is no longer significant, along with `indus`. This means that othe predictors are correlated with `age` and in the presence of them `age` is no longer required.

## You can also plot linear models
```{r}
par(mfrow = c(2,2))
plot(fit3)
```

The curved residuals line in the Residuals vs Fitted plot indicates that the model is not quite caputring everything that's going on and there seems to be some non-linearity. The plot in the bottom left sees if the variance is changing with the mean or the fit.

You can use `update()` to update the model:
```{r fit4}
fit4 = update(fit3,~ . -age -indus) ## remove age and indus
summary(fit4)
```
Now eveyrthing left in the model is significant.

# Non-linear Terms and Interactions
Let's make an interaction between `lstat` and `age`:
```{r fit5}
fit5 = lm(medv ~ lstat*age, Boston)  ## already automatically makes the main effects
summary(fit5) ## the interaction is indicated by the ':'
```
The main effect for `age` is still not significant, but the interaction is somewhat significant.

Since we saw that there is a non-linear looking scatter plot between `medv` and `lstat`, we can try a **quadratic term**:
```{r fit6}
fit6 = lm(medv ~ lstat + I(lstat^2), Boston)  ## now we have to explicitly put in the main effect
summary(fit6)
```
Both coefficients are strongly significant.

## Let's plot it:
```{r plot3}
attach(Boston)
plot(medv ~ lstat) 
points(lstat, fitted(fit6), col = "red", pch = 20) ## get fitted values from our model; pch = plotting character
## in ggplot2
Boston %>% 
  ggplot(aes(x = lstat, y = medv)) +  ## add linear model fit to the plot
  geom_point(size = 3, alpha = .3) +
  geom_point(aes(x = lstat, y = fitted(fit6)), color = "blue")
```

There's actually an easier way of fitting polynomials using the `poly()` function:
```{r fit7}
fit7 = lm(medv ~ poly(lstat,4))
plot(medv ~ lstat) 
points(lstat, fitted(fit6), col = "red", pch = 20)
points(lstat, fitted(fit7), col = "blue", pch = 20)
```
   
The fourth degree polynomial seems to over-fit the data a bit.

# Qualitative Predictors
We'll use the `Carseats` dataset for this:
```{r carseats}
summary(Carseats)
```

For the model, we'll add interaction between `Income` & `Advertising`, and `Age` & `Price`. `:` adds an interaction predictor, but not the main effects like `*` does:
```{r fit8}
fit8 = lm(Sales ~ . + Income:Advertising + Age:Price, Carseats)
summary(fit8)
```
`Income` & `Advertising` is significant, but `Age` & `Price` is not.
 
`ShelveLoc` is a **qualitative variable**, you can use `contrasts()` to see how R coded it for the linear model:
```{r qualitative predictors}
contrasts(Carseats$ShelveLoc)
```
Since it is a 3-factor variable, it coded it into two dummy variables, 'Good' and 'Medium'.

# Making a regression function

```{r function1, message=FALSE, warning=FALSE}
regplot = function(x,y) {
  fit = lm(y ~ x)
  plot(x,y)
  abline(fit, col = "red")
}
attach(Carseats)
regplot(Price,Sales)
```

Let's try adding `...` so that we can add any amount of new arguments to our function:
```{r function 2}
regplot = function(x,y,...) {
  fit = lm(y ~ x)
  plot(x,y,...)
  abline(fit, col = "red")
}
regplot(Price, Sales, xlab = "Price", ylab = "Sales", col = "blue", pch = 20)
```