Inference is what statistics is mostly about, prediction is what machine learning is mostly about. – Statistics vs Machine Learning, fight!
- Parametric models reduce the problem of estimating \(f\) to estimating a few parameters. Easy to interpret, but might not fit the data well.
- Non-parametric models do not assume any particular functional form for \(f\). Usually fit the data better than parametric models, given that there are large number of observations, much larger than that required by parametric models.
- Supervised learning: For each observation \(i\) of predictor measurements \(x_i\) there is an associated response measurement \(y_i\).
- Unsupervised learning: For each observation \(i\) there is a vector of measurements \(x_i\) but no associated response measurement.
- Semi-supervised learning: Response measurements are available for some of the observations but not all.
- Quantitative variables: numerical
- Qualitative variables: categorical
- Regression problems have quantitative response.
- Classification problems have qualitative response.
There is no free lunch in statistics.
- Mean squared error is typically used for regression problems.
-
Error rate, proportion of mistakes that are made if we apply the estimate
$\hat{f}$ to the data, is typically used for classification problems. - Test error rate is minimized by the Bayes Classifier.
- Cross-validation is used to estimate test MSE using training data.
- Variance is the change in the estimate \(\hat{f}\) of \(f\) due to change in training data.
- Bias is the error due to approximating a complicated problem with a simpler model.
- To reduce test error we need a model with low variance and low bias.
- Increasing flexibility of the model generally decreases bias but increases variance.
a) (n > p) We can expect the performance of the flexible method to be better. With the large sample size, the flexible method will be able to better fit the data than the inflexible method. b) (n < p) Since the sample size to small we can expect the flexible method to overfit. The inflexible method will perform better. c) The inflexible method will suffer from high bias. The flexible method will perform better. d) The flexible method might fit the erroneous observations. It will perform worse than the inflexible method.
a) The response is quantitative. This is a regression problem. We are trying to infer how does the CEO salary depend on the various factors. We are not trying to predict the CEO salary. b) The response is qualitative. This is a classification problem. We are trying to predict whether the product will be a success or a failure. c) The response is qualitative. This is a regression problem. We are interested in prediction.
The sketches are as follows:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("ticks")
flexibility = np.linspace(0, 10, 100)
squared_bias = 0.02 * (10 - flexibility) ** 2
variance = 0.02 * flexibility ** 2
training_error = 0.003 * (10 - flexibility) ** 3
test_error = 3 - 0.6 * flexibility + 0.06 * flexibility ** 2
bayes_error = np.ones_like(flexibility)
plt.close('all') # To prevent memory consumption
fig, ax = plt.subplots()
ax.plot(flexibility, squared_bias, label="Bias")
ax.plot(flexibility, variance, label="Variance")
ax.plot(flexibility, training_error, label="Training Error")
ax.plot(flexibility, test_error, label="Test Error")
ax.plot(flexibility, bayes_error, label="Bayes Error")
ax.set_xlabel("Flexibility")
ax.legend(loc="upper center")
sns.despine()
fig.savefig("img/bv-decomp.png", dpi=90)
These graphs are not exact representations of how actual bias, variance, etc. would look, but an estimation that conveys the general idea.
The (squared) bias decreases with increasing flexibility because the model fits to the data better and better. On the other hand the model is more sensitive to training data with increasing flexibility, resulting in the increasing trend for the variance.
The training error is similar to the bias, for classification problems. The test error initially decreases since with increasing flexibility the model has a better chance of predicting the test response. However beyond a certain flexibility it is overfitted to the training data and gives sub-optimal results with the test data.
The Bayes error is independent of the flexibility of the model. It completely depends on the data.
a) Three real-life applications of classification are:
- Predicting if an email is spam or non-spam.
- Predicting if a customer will remain loyal to the brand or not.
- b) Three real-life applications of regression are:
- Predicting real estate prices based on certain factors, like location, size, etc. The price will be the response, and location, size, etc. will be the predictors.
- - c) Three real-life applications of cluster analysis are:
- Grouping galaxies based on the profile of the light that they emit.
- Checking how similar two documents are. This could be useful in preventing plagiarism.
-
Advantages:
- Flexible approach is better able to fit the training data.
- Flexible approach might be able to estimate the underlying function better than the inflexible approach.
Disadvantages:
- Flexible approaches are prone to overfitting.
- Flexible approaches are more difficult to interpret than inflexible approaches.
A more flexible approach is more suitable when we are interested in prediction and a large amount of training data is available. A less flexible approach is more suitable when we are interested in inference, or if we do not have sufficient data.
In a parametric approach we first choose a model to fit the data to. This reduces the problem of estimating the true function to estimating the models of the parameter. Non-parametric approaches do not make any assumption about the form of the true function. One advantage of a parametric approach is that it does not require as much training data as a non-parametric approach. It also easier to interpret and less prone to overfitting. On the other hand the model used in a parametric approach might be nothing like the true function.
The data is as follows:
Obs. | X_1 | X_2 | X_3 | Y |
---|---|---|---|---|
1 | 0 | 3 | 0 | Red |
2 | 2 | 0 | 0 | Red |
3 | 0 | 1 | 3 | Red |
4 | 0 | 1 | 2 | Green |
5 | -1 | 0 | 1 | Green |
6 | 1 | 1 | 1 | Red |
The test point is X_1 = X_2 = X_3 = 0. The Euclidean distance between the observations and the test point are calculated as follows:
import pandas as pd
from tabulate import tabulate
df = pd.DataFrame.from_dict({'X1': [0, 2, 0, 0, -1, 1], 'X2': [3, 0, 1, 1, 0, 1], 'X3': [0, 0, 3, 2, 1, 1], 'Y':['Red', 'Red', 'Red', 'Green', 'Green', 'Red']})
test = np.array([0, 0, 0])
df['Distance'] = np.linalg.norm(df[['X1', 'X2', 'X3']].values-test, axis=1)
pd.set_option('precision', 5)
print(tabulate(df, df.columns, tablefmt="orgtbl"))
| | X1 | X2 | X3 | Y | Distance | |----+------+------+------+-------+------------| | 0 | 0 | 3 | 0 | Red | 3 | | 1 | 2 | 0 | 0 | Red | 2 | | 2 | 0 | 1 | 3 | Red | 3.16228 | | 3 | 0 | 1 | 2 | Green | 2.23607 | | 4 | -1 | 0 | 1 | Green | 1.41421 | | 5 | 1 | 1 | 1 | Red | 1.73205 |
If \(K = 1\), then the prediction is Green. From the above table we see that the test point is closest to the fifth observations, and so classify it in the same group as the fifth observation.
For \(K = 3\), the neighbors are observations 2, 5, and 6. The responses for 2 and 6 are Red. The response for 5 is Green. The probability for being Red is higher than being Green (2/3 > 1/3). Using the idea of the Bayes classifier we predict that the response will be Red.
If the Bayes decision boundary is highly nonlinear then the best value for \(K\) will be small. A smaller \(K\) results in more granular grouping, that is for small \(K\) the decision boundary is better able to capture the local non-linearities, because there will be very few neighbors.
college = pd.read_csv("data/College.csv")
print(tabulate(college.head(), college.columns, tablefmt="orgtbl"))
college.set_index("Unnamed: 0", inplace=True)
college.index.name = "Names"
headers = [college.index.name] + list(college.columns)
print(tabulate(college.head(), headers, tablefmt="orgtbl"))
print(tabulate(college.describe(), college.columns, tablefmt="orgtbl"))
plot_columns = list(college.columns)[:10]
plt.close('all')
spm = sns.pairplot(college[plot_columns])
spm.fig.set_size_inches(12, 12)
spm.savefig("img/college_scatter.png", dpi=90)
plt.close('all')
bp1 = sns.boxplot(x="Private", y="Outstate", data=college)
sns.despine()
plt.tight_layout()
bp1.get_figure().savefig("img/college_outstate_private.png", dpi=90)
college["Elite"] = college["Top10perc"].apply(lambda x: "Yes" if x > 50 else "No")
print(college["Elite"].value_counts())
There are 78 elite universities, where more than 50% of their students come from the top 10% of their high school classes.
plt.close('all')
bp2 = sns.boxplot(x="Elite", y="Outstate", data=college)
sns.despine()
plt.tight_layout()
bp2.get_figure().savefig("img/college_outstate_elite.png", dpi=90)
We are going to produce histograms for some of the quantitative variables with differing number of bins. We first need to bin these quantitative variables.
print(college.info())
<class 'pandas.core.frame.DataFrame'> Index: 777 entries, Abilene Christian University to York College of Pennsylvania Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Private 777 non-null object 1 Apps 777 non-null int64 2 Accept 777 non-null int64 3 Enroll 777 non-null int64 4 Top10perc 777 non-null int64 5 Top25perc 777 non-null int64 6 F.Undergrad 777 non-null int64 7 P.Undergrad 777 non-null int64 8 Outstate 777 non-null int64 9 Room.Board 777 non-null int64 10 Books 777 non-null int64 11 Personal 777 non-null int64 12 PhD 777 non-null int64 13 Terminal 777 non-null int64 14 S.F.Ratio 777 non-null float64 15 perc.alumni 777 non-null int64 16 Expend 777 non-null int64 17 Grad.Rate 777 non-null int64 18 Elite 777 non-null object dtypes: float64(1), int64(16), object(2) memory usage: 141.4+ KB None
We see that there are 17 quantitative variables. For this activity I will choose
Enroll
, Books
, PhD
, and Grad.Rate
as the quantitative variables to plot.
To keep things simple we will bin these variables in to either 3 bins or 5 bins.
cut_bins3 = ["Low", "Medium", "High"]
cut_bins5 = ["Very Low", "Low", "Medium", "High", "Very High"]
college["Enroll2"] = pd.cut(college["Enroll"], 5, labels=cut_bins5)
college["Books2"] = pd.cut(college["Books"], 3, labels=cut_bins3)
college["PhD2"] = pd.cut(college["PhD"], 3, labels=cut_bins3)
college["Grad.Rate2"] = pd.cut(college["Grad.Rate"], 5, labels=cut_bins5)
plt.close("all")
fig, axs = plt.subplots(2, 2)
sns.countplot(college["Enroll2"], ax=axs[0, 0])
sns.countplot(college["Books2"], ax=axs[0, 1])
sns.countplot(college["PhD2"], ax=axs[1, 0])
sns.countplot(college["Grad.Rate2"], ax=axs[1, 1])
sns.despine()
axs[0, 0].set_xticklabels(axs[0, 0].get_xticklabels(), rotation=40, ha="right")
axs[0, 1].set_xticklabels(axs[0, 1].get_xticklabels(), rotation=40, ha="right")
axs[1, 0].set_xticklabels(axs[1, 0].get_xticklabels(), rotation=40, ha="right")
axs[1, 1].set_xticklabels(axs[1, 1].get_xticklabels(), rotation=40, ha="right")
plt.subplots_adjust(wspace=0.4, hspace=1)
fig.savefig("img/college_hist.png", dpi=90)
auto = pd.read_csv("data/Auto.csv")
auto.dropna(inplace=True)
print(auto.info())
We see that there are two qualitative predictors, horsepower
and name
. While
name
is expected to be qualitative, horsepower
should presumably be
quantitative. We should check the data in the horsepower
column and see if we
can convert that to a numeric form.
print(auto["horsepower"].unique())
So the reason that horsepower
is not numeric is because there are some missing
values which are represented by “?”. We need to remove the rows containing the
missing data, and then make this column numeric.
auto.drop(auto[auto.horsepower == "?"].index, inplace=True)
auto["horsepower"] = pd.to_numeric(auto["horsepower"])
print(auto.info())
Now only name
is the qualitative predictor.
from pprint import pprint
quant = auto.select_dtypes(exclude="object").columns
ranges = {col: (min(auto[col]), max(auto[col])) for col in quant}
pprint(ranges)
msd = {col: {"mean": round(np.mean(auto[col]), 2), "std": round(np.std(auto[col]), 2)} for col in quant}
pprint(msd)
# An alternative is to use the following aggregrate method:
# auto.agg(["mean", "std"])
We remove the 10^th through 85^th observations, and then calculate the ranges, mean and standard deviation of the remaining data set.
auto2 = auto.drop(auto.index[10:85])
ranges = {col: (min(auto2[col]), max(auto2[col])) for col in quant}
pprint(ranges)
msd = {col: {"mean": round(np.mean(auto[col]), 2), "std": round(np.std(auto[col]), 2)} for col in quant}
pprint(msd)
plt.close('all')
spm = sns.pairplot(auto[["mpg", "horsepower", "weight", "displacement", "acceleration"]])
spm.fig.set_size_inches(6, 6)
spm.savefig("img/auto_pair.png")
We observe that the gas mileage mpg
decreases somewhat linearly as
horsepower
, weight
, and displacement
increases. This seems reasonable.
Similarly displacement
is positively correlated to weight
and horsepower
.
The relation between acceleration
and the other variables is not easy to
interpret from these plots.
As we observed earlier that mpg
has a linear relation with horsepower
,
weight
, and displacement
. We can therefore use that to predict mpg
.
from sklearn.datasets import load_boston
lb = load_boston()
boston = pd.DataFrame(lb.data, columns=lb.feature_names)
boston['MEDV'] = lb.target
print(tabulate(boston.head(), boston.columns, tablefmt="orgtbl"))
print(lb['DESCR'])
There are 506 rows, and 14 columns in this data set. The last column shows the median value of owner-occupied homes in Boston suburbs, and the other columns show the values of the different factors / predictors, on which the median value presumably depends. The rows show the data collected for 506 houses in Boston suburbs.
plt.close("all")
spm = sns.pairplot(boston, plot_kws = {'s': 10})
spm.fig.set_size_inches(12, 12)
spm.savefig("img/boston_scatter.png", dpi=90)
Looking at the plots we can easily identify that the median value has a positive
linear correlation with the number of rooms (RM
), and a negative, possibly
non-linear, correlation with the “% lower status of the population” (LSTAT
).
We also see that RM
has a negative correlation with LSTAT
. This makes sense,
since houses with more rooms are expected to be more expensive, and someone
belonging to the low-income group will not be able to afford such a house. It is
harder to determine from the plot how does the median value depend on the other
predictors.
print(boston.corrwith(boston["CRIM"]).sort_values())
From the correlation values we can expect RAD
(accessibility to radial
highways) and TAX
(property tax rates) to be associated with the per capita
crime rate.
plt.close("all")
sns.scatterplot(x="TAX", y="CRIM", data=boston)
sns.despine()
plt.tight_layout()
plt.savefig("img/boston_crim_tax.png", dpi=90)
plt.close("all")
sns.boxplot(x="RAD", y="CRIM", data=boston)
sns.despine()
plt.tight_layout()
plt.savefig("img/boston_crim_rad.png", dpi=90)
These plots show that the average per capita crime rate is much higher when the tax rate is \(~ 660\) or the index of accessibility to radial highways is 24.
ranges = {col: (boston[col].min(), boston[col].max()) for col in boston.columns[:-1]}
pprint(ranges)
The per capita crime rate varies a lot across Boston suburbs, from a low of 0.00632 to a high of 88.9762. This shows that there are suburbs that have particularly high crime rates:
high_crime = boston.nlargest(5, "CRIM")
print(tabulate(high_crime, boston.columns, tablefmt="orgtbl"))
Similarly the tax rate also shows considerable variation from 187.0 to 711.0. There are suburbs with particularly high tax rates.
high_tax = boston.nlargest(5, "TAX")
print(tabulate(high_tax, boston.columns, tablefmt="orgtbl"))
On the other hand the pupil-to-teacher ratio does not vary much between the different Boston suburbs. There are no suburbs with a particularly high pupil-to-teacher ratio.
print(boston["CHAS"].value_counts())
There are 35 suburbs that bound the Charles river.
print(boston["PTRATIO"].median())
The median pupil-to-teacher ratio is 19.5.
print(tabulate(boston.nsmallest(1, "MEDV"), boston.columns, tablefmt="orgtbl"))
The 398^th suburb has the lowest median value. From the ranges that we obtained earlier we can see that this suburb has:
- relatively high crime rate,
- relatively high proportion of non-retail business acres,
- relatively high tax rate,
- relatively high nitric oxides concentration,
- relatively high proportion of low-status people,
- old houses.
print(tabulate(boston.describe(), boston.columns, tablefmt="orgtbl"))
We in fact see that for this suburb the crime rate, the nitric oxides concentration, and the proportion of low-status people are higher than their respective 75% quantile, while the proportion of non-retail business acres and tax rate are equal to their respective 75% quantile.
rm7 = np.sum(boston["RM"] > 7)
rm8 = np.sum(boston["RM"] > 8)
print(rm7, rm8)
There are 64 suburbs which average more than seven rooms per dwelling and 13 suburbs which average more than eight rooms per dwelling.
eight_rooms = boston[boston["RM"] > 8]
print(tabulate(eight_rooms.describe(), boston.columns, tablefmt="orgtbl"))
These suburbs have higher median values for homes compared to the other suburbs, and correspondingly lower crime rates, lower proportions of low-status people, and lower proportions of non-retail business acres.