Skip to content

Latest commit

 

History

History
902 lines (740 loc) · 40.5 KB

chapter2.org

File metadata and controls

902 lines (740 loc) · 40.5 KB

Statistical Learning

Notes

Inference is what statistics is mostly about, prediction is what machine learning is mostly about. – Statistics vs Machine Learning, fight!

Models

  • Parametric models reduce the problem of estimating \(f\) to estimating a few parameters. Easy to interpret, but might not fit the data well.
  • Non-parametric models do not assume any particular functional form for \(f\). Usually fit the data better than parametric models, given that there are large number of observations, much larger than that required by parametric models.

Learning

  • Supervised learning: For each observation \(i\) of predictor measurements \(x_i\) there is an associated response measurement \(y_i\).
  • Unsupervised learning: For each observation \(i\) there is a vector of measurements \(x_i\) but no associated response measurement.
  • Semi-supervised learning: Response measurements are available for some of the observations but not all.

Variables & Problems

  • Quantitative variables: numerical
  • Qualitative variables: categorical
  • Regression problems have quantitative response.
  • Classification problems have qualitative response.

Accuracy

There is no free lunch in statistics.

  • Mean squared error is typically used for regression problems.
  • Error rate, proportion of mistakes that are made if we apply the estimate $\hat{f}$ to the data, is typically used for classification problems.
  • Test error rate is minimized by the Bayes Classifier.
  • Cross-validation is used to estimate test MSE using training data.
  • Variance is the change in the estimate \(\hat{f}\) of \(f\) due to change in training data.
  • Bias is the error due to approximating a complicated problem with a simpler model.
  • To reduce test error we need a model with low variance and low bias.
  • Increasing flexibility of the model generally decreases bias but increases variance.

Exercises

Question 1

a) (n > p) We can expect the performance of the flexible method to be better. With the large sample size, the flexible method will be able to better fit the data than the inflexible method. b) (n < p) Since the sample size to small we can expect the flexible method to overfit. The inflexible method will perform better. c) The inflexible method will suffer from high bias. The flexible method will perform better. d) The flexible method might fit the erroneous observations. It will perform worse than the inflexible method.

Question 2

a) The response is quantitative. This is a regression problem. We are trying to infer how does the CEO salary depend on the various factors. We are not trying to predict the CEO salary. b) The response is qualitative. This is a classification problem. We are trying to predict whether the product will be a success or a failure. c) The response is qualitative. This is a regression problem. We are interested in prediction.

Question 3

The sketches are as follows:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("ticks")

flexibility = np.linspace(0, 10, 100)
squared_bias = 0.02 * (10 - flexibility) ** 2
variance = 0.02 * flexibility ** 2
training_error = 0.003 * (10 - flexibility) ** 3
test_error = 3 - 0.6 * flexibility + 0.06 * flexibility ** 2
bayes_error = np.ones_like(flexibility)

plt.close('all') # To prevent memory consumption
fig, ax = plt.subplots()
ax.plot(flexibility, squared_bias, label="Bias")
ax.plot(flexibility, variance, label="Variance")
ax.plot(flexibility, training_error, label="Training Error")
ax.plot(flexibility, test_error, label="Test Error")
ax.plot(flexibility, bayes_error, label="Bayes Error")
ax.set_xlabel("Flexibility")
ax.legend(loc="upper center")

sns.despine()

fig.savefig("img/bv-decomp.png", dpi=90)

These graphs are not exact representations of how actual bias, variance, etc. would look, but an estimation that conveys the general idea.

The (squared) bias decreases with increasing flexibility because the model fits to the data better and better. On the other hand the model is more sensitive to training data with increasing flexibility, resulting in the increasing trend for the variance.

The training error is similar to the bias, for classification problems. The test error initially decreases since with increasing flexibility the model has a better chance of predicting the test response. However beyond a certain flexibility it is overfitted to the training data and gives sub-optimal results with the test data.

The Bayes error is independent of the flexibility of the model. It completely depends on the data.

Question 4

a) Three real-life applications of classification are:

  • Predicting if an email is spam or non-spam.
  • Predicting if a customer will remain loyal to the brand or not.

- b) Three real-life applications of regression are:

  • Predicting real estate prices based on certain factors, like location, size, etc. The price will be the response, and location, size, etc. will be the predictors.

- - c) Three real-life applications of cluster analysis are:

  • Grouping galaxies based on the profile of the light that they emit.
  • Checking how similar two documents are. This could be useful in preventing plagiarism.

-

Question 5

Advantages:

  • Flexible approach is better able to fit the training data.
  • Flexible approach might be able to estimate the underlying function better than the inflexible approach.

Disadvantages:

  • Flexible approaches are prone to overfitting.
  • Flexible approaches are more difficult to interpret than inflexible approaches.

A more flexible approach is more suitable when we are interested in prediction and a large amount of training data is available. A less flexible approach is more suitable when we are interested in inference, or if we do not have sufficient data.

Question 6

In a parametric approach we first choose a model to fit the data to. This reduces the problem of estimating the true function to estimating the models of the parameter. Non-parametric approaches do not make any assumption about the form of the true function. One advantage of a parametric approach is that it does not require as much training data as a non-parametric approach. It also easier to interpret and less prone to overfitting. On the other hand the model used in a parametric approach might be nothing like the true function.

Question 7

The data is as follows:

Obs.X_1X_2X_3Y
1030Red
2200Red
3013Red
4012Green
5-101Green
6111Red

The test point is X_1 = X_2 = X_3 = 0. The Euclidean distance between the observations and the test point are calculated as follows:

import pandas as pd
from tabulate import tabulate

df = pd.DataFrame.from_dict({'X1': [0, 2, 0, 0, -1, 1], 'X2': [3, 0, 1, 1, 0, 1], 'X3': [0, 0, 3, 2, 1, 1], 'Y':['Red', 'Red', 'Red', 'Green', 'Green', 'Red']})
test = np.array([0, 0, 0])
df['Distance'] = np.linalg.norm(df[['X1', 'X2', 'X3']].values-test, axis=1)
pd.set_option('precision', 5)
print(tabulate(df, df.columns, tablefmt="orgtbl"))
|    |   X1 |   X2 |   X3 | Y     |   Distance |
|----+------+------+------+-------+------------|
|  0 |    0 |    3 |    0 | Red   |    3       |
|  1 |    2 |    0 |    0 | Red   |    2       |
|  2 |    0 |    1 |    3 | Red   |    3.16228 |
|  3 |    0 |    1 |    2 | Green |    2.23607 |
|  4 |   -1 |    0 |    1 | Green |    1.41421 |
|  5 |    1 |    1 |    1 | Red   |    1.73205 |

If \(K = 1\), then the prediction is Green. From the above table we see that the test point is closest to the fifth observations, and so classify it in the same group as the fifth observation.

For \(K = 3\), the neighbors are observations 2, 5, and 6. The responses for 2 and 6 are Red. The response for 5 is Green. The probability for being Red is higher than being Green (2/3 > 1/3). Using the idea of the Bayes classifier we predict that the response will be Red.

If the Bayes decision boundary is highly nonlinear then the best value for \(K\) will be small. A smaller \(K\) results in more granular grouping, that is for small \(K\) the decision boundary is better able to capture the local non-linearities, because there will be very few neighbors.

Question 8

The College data set

college = pd.read_csv("data/College.csv")
print(tabulate(college.head(), college.columns, tablefmt="orgtbl"))

College names as index

college.set_index("Unnamed: 0", inplace=True)
college.index.name = "Names"

headers = [college.index.name] + list(college.columns)
print(tabulate(college.head(), headers, tablefmt="orgtbl"))

Summary of data

print(tabulate(college.describe(), college.columns, tablefmt="orgtbl"))

Scatter plot matrix

plot_columns = list(college.columns)[:10]
plt.close('all')
spm = sns.pairplot(college[plot_columns])
spm.fig.set_size_inches(12, 12)
spm.savefig("img/college_scatter.png", dpi=90)

Box plots

plt.close('all')
bp1 = sns.boxplot(x="Private", y="Outstate", data=college)
sns.despine()
plt.tight_layout()
bp1.get_figure().savefig("img/college_outstate_private.png", dpi=90)

Elite universities

college["Elite"] = college["Top10perc"].apply(lambda x: "Yes" if x > 50 else "No")
print(college["Elite"].value_counts())

There are 78 elite universities, where more than 50% of their students come from the top 10% of their high school classes.

plt.close('all')
bp2 = sns.boxplot(x="Elite", y="Outstate", data=college)
sns.despine()
plt.tight_layout()
bp2.get_figure().savefig("img/college_outstate_elite.png", dpi=90)

Binning and histograms

We are going to produce histograms for some of the quantitative variables with differing number of bins. We first need to bin these quantitative variables.

print(college.info())
<class 'pandas.core.frame.DataFrame'>
Index: 777 entries, Abilene Christian University to York College of Pennsylvania
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   Private      777 non-null    object
 1   Apps         777 non-null    int64
 2   Accept       777 non-null    int64
 3   Enroll       777 non-null    int64
 4   Top10perc    777 non-null    int64
 5   Top25perc    777 non-null    int64
 6   F.Undergrad  777 non-null    int64
 7   P.Undergrad  777 non-null    int64
 8   Outstate     777 non-null    int64
 9   Room.Board   777 non-null    int64
 10  Books        777 non-null    int64
 11  Personal     777 non-null    int64
 12  PhD          777 non-null    int64
 13  Terminal     777 non-null    int64
 14  S.F.Ratio    777 non-null    float64
 15  perc.alumni  777 non-null    int64
 16  Expend       777 non-null    int64
 17  Grad.Rate    777 non-null    int64
 18  Elite        777 non-null    object
dtypes: float64(1), int64(16), object(2)
memory usage: 141.4+ KB
None

We see that there are 17 quantitative variables. For this activity I will choose Enroll, Books, PhD, and Grad.Rate as the quantitative variables to plot. To keep things simple we will bin these variables in to either 3 bins or 5 bins.

cut_bins3 = ["Low", "Medium", "High"]
cut_bins5 = ["Very Low", "Low", "Medium", "High", "Very High"]
college["Enroll2"] = pd.cut(college["Enroll"], 5, labels=cut_bins5)
college["Books2"] = pd.cut(college["Books"], 3, labels=cut_bins3)
college["PhD2"] = pd.cut(college["PhD"], 3, labels=cut_bins3)
college["Grad.Rate2"] = pd.cut(college["Grad.Rate"], 5, labels=cut_bins5)

plt.close("all")
fig, axs = plt.subplots(2, 2)
sns.countplot(college["Enroll2"], ax=axs[0, 0])
sns.countplot(college["Books2"], ax=axs[0, 1])
sns.countplot(college["PhD2"], ax=axs[1, 0])
sns.countplot(college["Grad.Rate2"], ax=axs[1, 1])
sns.despine()

axs[0, 0].set_xticklabels(axs[0, 0].get_xticklabels(), rotation=40, ha="right")
axs[0, 1].set_xticklabels(axs[0, 1].get_xticklabels(), rotation=40, ha="right")
axs[1, 0].set_xticklabels(axs[1, 0].get_xticklabels(), rotation=40, ha="right")
axs[1, 1].set_xticklabels(axs[1, 1].get_xticklabels(), rotation=40, ha="right")

plt.subplots_adjust(wspace=0.4, hspace=1)
fig.savefig("img/college_hist.png", dpi=90)

Question 9

Predictors of the Auto data set

auto = pd.read_csv("data/Auto.csv")
auto.dropna(inplace=True)
print(auto.info())

We see that there are two qualitative predictors, horsepower and name. While name is expected to be qualitative, horsepower should presumably be quantitative. We should check the data in the horsepower column and see if we can convert that to a numeric form.

print(auto["horsepower"].unique())

So the reason that horsepower is not numeric is because there are some missing values which are represented by “?”. We need to remove the rows containing the missing data, and then make this column numeric.

auto.drop(auto[auto.horsepower == "?"].index, inplace=True)
auto["horsepower"] = pd.to_numeric(auto["horsepower"])
print(auto.info())

Now only name is the qualitative predictor.

Range of quantitative predictors

from pprint import pprint

quant = auto.select_dtypes(exclude="object").columns
ranges = {col: (min(auto[col]), max(auto[col])) for col in quant}
pprint(ranges)

Mean and standard deviation of quantitative predictors

msd = {col: {"mean": round(np.mean(auto[col]), 2), "std": round(np.std(auto[col]), 2)} for col in quant}
pprint(msd)

# An alternative is to use the following aggregrate method:
# auto.agg(["mean", "std"])

Data subset

We remove the 10^th through 85^th observations, and then calculate the ranges, mean and standard deviation of the remaining data set.

auto2 = auto.drop(auto.index[10:85])

ranges = {col: (min(auto2[col]), max(auto2[col])) for col in quant}
pprint(ranges)
msd = {col: {"mean": round(np.mean(auto[col]), 2), "std": round(np.std(auto[col]), 2)} for col in quant}
pprint(msd)

Pair plots

plt.close('all')
spm = sns.pairplot(auto[["mpg", "horsepower", "weight", "displacement", "acceleration"]])
spm.fig.set_size_inches(6, 6)
spm.savefig("img/auto_pair.png")

We observe that the gas mileage mpg decreases somewhat linearly as horsepower, weight, and displacement increases. This seems reasonable. Similarly displacement is positively correlated to weight and horsepower. The relation between acceleration and the other variables is not easy to interpret from these plots.

Predicting gas mileage

As we observed earlier that mpg has a linear relation with horsepower, weight, and displacement. We can therefore use that to predict mpg.

Question 10

Boston data set

from sklearn.datasets import load_boston

lb = load_boston()
boston = pd.DataFrame(lb.data, columns=lb.feature_names)
boston['MEDV'] = lb.target
print(tabulate(boston.head(), boston.columns, tablefmt="orgtbl"))
print(lb['DESCR'])

There are 506 rows, and 14 columns in this data set. The last column shows the median value of owner-occupied homes in Boston suburbs, and the other columns show the values of the different factors / predictors, on which the median value presumably depends. The rows show the data collected for 506 houses in Boston suburbs.

Pair plots

plt.close("all")
spm = sns.pairplot(boston, plot_kws = {'s': 10})
spm.fig.set_size_inches(12, 12)
spm.savefig("img/boston_scatter.png", dpi=90)

Looking at the plots we can easily identify that the median value has a positive linear correlation with the number of rooms (RM), and a negative, possibly non-linear, correlation with the “% lower status of the population” (LSTAT). We also see that RM has a negative correlation with LSTAT. This makes sense, since houses with more rooms are expected to be more expensive, and someone belonging to the low-income group will not be able to afford such a house. It is harder to determine from the plot how does the median value depend on the other predictors.

Association with per capita crime rate

print(boston.corrwith(boston["CRIM"]).sort_values())

From the correlation values we can expect RAD (accessibility to radial highways) and TAX (property tax rates) to be associated with the per capita crime rate.

plt.close("all")
sns.scatterplot(x="TAX", y="CRIM", data=boston)
sns.despine()
plt.tight_layout()
plt.savefig("img/boston_crim_tax.png", dpi=90)
plt.close("all")
sns.boxplot(x="RAD", y="CRIM", data=boston)
sns.despine()
plt.tight_layout()
plt.savefig("img/boston_crim_rad.png", dpi=90)

These plots show that the average per capita crime rate is much higher when the tax rate is \(~ 660\) or the index of accessibility to radial highways is 24.

Predictor ranges

ranges = {col: (boston[col].min(), boston[col].max()) for col in boston.columns[:-1]}
pprint(ranges)

The per capita crime rate varies a lot across Boston suburbs, from a low of 0.00632 to a high of 88.9762. This shows that there are suburbs that have particularly high crime rates:

high_crime = boston.nlargest(5, "CRIM")
print(tabulate(high_crime, boston.columns, tablefmt="orgtbl"))

Similarly the tax rate also shows considerable variation from 187.0 to 711.0. There are suburbs with particularly high tax rates.

high_tax = boston.nlargest(5, "TAX")
print(tabulate(high_tax, boston.columns, tablefmt="orgtbl"))

On the other hand the pupil-to-teacher ratio does not vary much between the different Boston suburbs. There are no suburbs with a particularly high pupil-to-teacher ratio.

Suburbs bounding the Charles river

print(boston["CHAS"].value_counts())

There are 35 suburbs that bound the Charles river.

Median pupil to teacher ratio

print(boston["PTRATIO"].median())

The median pupil-to-teacher ratio is 19.5.

Suburb with lowest median value

print(tabulate(boston.nsmallest(1, "MEDV"), boston.columns, tablefmt="orgtbl"))

The 398^th suburb has the lowest median value. From the ranges that we obtained earlier we can see that this suburb has:

  • relatively high crime rate,
  • relatively high proportion of non-retail business acres,
  • relatively high tax rate,
  • relatively high nitric oxides concentration,
  • relatively high proportion of low-status people,
  • old houses.
print(tabulate(boston.describe(), boston.columns, tablefmt="orgtbl"))

We in fact see that for this suburb the crime rate, the nitric oxides concentration, and the proportion of low-status people are higher than their respective 75% quantile, while the proportion of non-retail business acres and tax rate are equal to their respective 75% quantile.

Average number of rooms

rm7 = np.sum(boston["RM"] > 7)
rm8 = np.sum(boston["RM"] > 8)
print(rm7, rm8)

There are 64 suburbs which average more than seven rooms per dwelling and 13 suburbs which average more than eight rooms per dwelling.

eight_rooms = boston[boston["RM"] > 8]
print(tabulate(eight_rooms.describe(), boston.columns, tablefmt="orgtbl"))

These suburbs have higher median values for homes compared to the other suburbs, and correspondingly lower crime rates, lower proportions of low-status people, and lower proportions of non-retail business acres.