-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsm10_supervisedLearning.Rmd
597 lines (437 loc) · 30.4 KB
/
sm10_supervisedLearning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
---
title: 'STAT540 - Seminar 10: Supervised learning, classification, cross validation, variable selection'
output:
github_document
---
## Attributions
Contributors: Gabriela Cohen Freue, W. Evan Durno, Jasleen Grewal, and Keegan Korthauer
## Learning objectives
By the end of this tutorial, you should be able to
- Filter gene expression data to remove uninformative features.
- Further subset variables using a fixed information criteria, and know when it may not be required/possible to do so.
- Understand the concept of held-out test set, training set, and validation set.
- Select a suitable classifier (based on size of training dataset) and train a model with the input training data.
- Get predicted labels for a validation set, from the trained classifier model.
- Understand the utility of cross validation in selecting an optimal model
- Be able to identify at what part of the supervised learning process CV should be used.
- Explain how cross validation is different from bootstrapping.
- Assess the best selected model's performance on the held out test set.
- Recognize why we can't go back and reselect models after assessment on the held out test set.
- Be able to distinguish between the following scenarios for using supervised machine learning, and enumerate metrics for assessing performance in each case:
- 2 class classification problem
- Multi-class classification problem
- Be able to define accuracy, precision, sensitivity, specificity, F1-score, Kappa score.
- Replicate analysis in either MLInterface, GLMnet, CMA, or Caret (take home exercise, optional).
# Introduction
In this Seminar we go over packages and codes for performing supervised learning and evaluation in R. In supervised learning, one is given a training data set with a known response, or such as a class, and a set of covariates or variables. The goal is to use this set to generate a model that predicts values for a response given covariates in a new dataset. A supervised learning process can be decomposed into the following steps:
*Step 1*: Data preprocessing. Before getting started with feature selection and training our model, we must make sure we have adjusted the data to account for any batch effects (technical sources of variation like measuring samples in groups) or biases from outliers (individuals that look distinctly "different" from the rest of our sample).
*Step 2*: Select Features. Before training a model, in many applications, it is usually important to perform a pre-filtering step in which one retains only the most informative features (e.g. genes) as candidate "biomarkers". The amount of features retained to select and train a model is up to the analyst and the methods used in the next steps. For example,running some methods may be unfeasible or have an outrageously slow runtime with a large number of features.
*Step 3*: Select and train a classifier. Once the set of candidate markers have been selected, the next step is to select and train a model to predict the labels of a test data. We will also tune parameters for each classifier using cross-validation.
*Step 4*: Test. Finally, a model is chosen and used to predict labels of a test data. From there we evaluate the model's performance.
## R packages
There are many packages for performing supervised learning in R, each of which may implement one or more algorithms. There have also been at least two major efforts to unify these libraries under a common framework to make them easier to use: `MLInterfaces` and `CMA`. Although these may be useful and save you a lot of time in your analysis, it is important that you understand what these packages are doing and what they are *not* doing. Thus, I will not use these packages in this Seminar but I encourage you to reproduce the analysis using at least one of them! (I recommend `CMA`).
Install the following packages from Bioconductor: `CMA` and `GEOquery`, and from CRAN: `ROCR`, `car`, `e1071` (for SVM), and `glmnet` along with their dependencies.
```{r, eval=FALSE}
library(BiocManager)
install('GEOquery')
install('CMA')
install('ROCR')
install(c('e1071','glmnet','mlbench','gbm', 'car','dimRed'))
install('caret')
install('kernlab')
install('gbm')
```
```{r, message=F, warning=F}
library(MASS)
library(tidyverse)
theme_set(theme_bw())
library(car)
library(limma)
library(e1071)
library(glmnet)
library(ROCR)
library(CMA)
library(class)
library(GEOquery)
```
## Data Set
This seminar is based on a dataset that comes from a paper by [Smeets et al. 2010](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE23177), who studied Affymetrix expression profiles from primary breast tumors. Smeets group was interested in whether tumors which had spread to lymph nodes (LN positive, generally a bad sign) have different gene expression profiles than LN negative tumors. If so, a gene expression signature can be use to predict tumor class.
Their data set contains 24236 genes on 116 samples. The status of the lymph node is known for each sample, with 59 LN positive and 57 LN negative. Samples were divided into two parts: 96 samples (48 LN positive and 48 LN negative) were used as a "training" set and 20 samples (11 LN positive and 9 LN negative) were used as a "test" set. There is also a quantitative measure, "LnRatio", the fraction of affected lymph nodes, presumably reflecting "how bad" the LnStatus is. Thus, we can use this dataset to illustrate classification and regularization methods! This seminar will focus on the first task, i.e., classification. In the past, Paul selected this dataset to illustrate the challenges of supervised learning tasks!
In the paper, the authors trained a support vector machine classifier to distinguish between LN positive and LN negative tumors (i.e., classification), and evaluated the results using ROC curves. After some optimization, they got an area under the ROC curve (AUC) of 0.66 on the training set and 0.65 on the test data. This is better than chance, but still not very convincing results about the relevance of the derived molecular signature (random chance would give an AUC of 0.5; perfect classification would give an AUC of 1.0).
### Data Preparation
First, let's retrieve our datasets from GEO with `getGEO` from `GEOquery` package. Warning: this may take several minutes! So to avoid re-downloading in the future, save it as an RData object.
```{r fetchGEO, message = FALSE}
# Returns a list of expressionsets
datgeo <- getGEO('GSE23177', GSEMatrix = TRUE, AnnotGPL = TRUE)
dat <- datgeo[[1]] #Note that dat is an ExpressionSet
dat
```
Now we'll do some data wrangling to pull out the metadata variables we are interested in, and recode some of them.
```{r wrangle}
str(pData(dat), max.level = 0)
# extract only those variables of interest
pData(dat) <- pData(dat) %>%
rename(sample_id = geo_accession,
LnStatus = characteristics_ch1.2,
LnRatio = characteristics_ch1.3,
Set = characteristics_ch1) %>%
select(sample_id, LnStatus, LnRatio, Set) %>%
mutate(LnStatus = factor(gsub("ln: ", "", LnStatus))) %>%
mutate(LnRatio = as.numeric(gsub("lnratio: ", "", LnRatio))) %>%
mutate(Set = ifelse(Set == "patient type: training set", "training", "test"))
str(pData(dat))
#Note: LNRatio will not be used in this Seminar. However, you can use it to try some of the regularization techniques learned in class
```
Great! Next, let's split the `ExpressionSet` object into two different parts - one for the training and one for the test set.
```{r split}
# split the ExpressionSet into training and test sets.
table(pData(dat)$Set)
train.es <- dat[, dat$Set == "training"]
test.es <- dat[ , dat$Set == "test"]
```
Now, we can do some exploratory analysis of the data before trying some classification methods.
```{r eda}
# understand your data for classification
table(pData(train.es)$LnStatus)
table(pData(test.es)$LnStatus)
# understand the continuous response
tapply(pData(train.es)$LnRatio,pData(train.es)$LnStatus,summary)
tapply(pData(test.es)$LnRatio,pData(test.es)$LnStatus,summary)
# look at the expression of 3 randomly picked genes in both training and test sets
set.seed(1234)
rangenes <- sample(1:nrow(dat), size = 3)
# function to create tidy data table of expression and metadata
toLonger <- function(expset) {
stopifnot(class(expset) == "ExpressionSet")
expressionMatrix <- longExpressionMatrix <- exprs(expset) %>%
as.data.frame() %>%
rownames_to_column("gene") %>%
pivot_longer(cols = !gene,
values_to = "expression",
names_to = "sample_id") %>%
left_join(pData(expset), by = "sample_id")
return(expressionMatrix)
}
toLonger(dat[rangenes,]) %>%
ggplot(aes(y = expression, x = LnStatus)) +
facet_wrap(Set ~ gene) +
geom_jitter(width = 0.2, alpha = 0.5)
```
# Classification
The prediction of a discrete response is usually refer to as *classification*. A response taking values over a finite set of labels is essentially the same thing as a factor. We will use the dataset from Smeets et al. to find the *best-trained* classifier and use it to predict the `LnStatus` of the 20 samples in the test set, i.e., classify those as "lymph node positive" or "negative".
## Data preprocessing
We should check to ensure there are no missing values in our data.
```{r missing}
sum(is.na(exprs(train.es)))
sum(is.na(exprs(test.es)))
```
Here we see there are no missing values in our dataset, so we don't have to worry about that.
Other pre-processing operations that can be done are:
- centering, scaling, normalizing
- imputing missing data
- transforming individual features (like boolean measurements)
When you are working with expression data, you may also need to use some normalization methods to ensure your variables are comparable across all your samples. For example, when working with count data, it is advised to log transformed and normalize (e.g. quantile or library size normalization) your data, so that your samples have similar distributions.
## Feature and Model Selection
We will now use cross-validation to find the best set of features to predict our response. Thus, I will divide the training set into 6 folds (the authors used 10 folds). We also want the proportion of positive and negative examples in each split to be approximately the same as for the full data set (i.e., stratified 6-fold CV with 8 positive and 8 negative samples within each fold). For each round of cross-validation, we use one fold as the test data and the rest of the data as training to select features and train different classifier.
### Cross-validation
Although it makes sense to proceed as described above, many methods are available and many constants within methods need to be selected in these steps. Thus, *cross-validation* is usually required to *evaluate* how well different trained models work and select the *best* model to proceed. Note that although you may only want to select among different choices available in Step 2, the cross-validation needs to start in Step 1. Why? The results of the cross-validation will be over-optimistic and biased if the samples in the test sets of the cross-validation (i.e., left-out folds) were used to *select* the most promising features in Step 1!! For example, if the performance of a complex model is (artificially) good, you may not penalize regression coefficients enough in Step 2, and may yield to a poor performance in Step 3.
This is why, as a general rule, we **never evaluate a model on the same data we used to train that model.**
In many studies, in the absence of a test set, cross-validation is used to estimate performance. In those cases, *nested cross-validation* is required! The inner cross-validation will be used to select features and tune parameters, the outer cross-validation will be used to test each selected model.
In this seminar, we are working with a dataset that has both a training and a test set. Thus, we will not do a nested cross-validation. However, keep it in mind for your project or future work, especially if you have a small number of samples/subjects!
#### Making cross validation splits
This is not the only way to create splits of the training data to run a cross-validation. Note that if the samples can not be evenly divided into the nfolds you specified, then you need to complete the matrices below with NAs and call for entries different from NA at those folds.
```{r cvsplit}
nfold <- 6
splitData <- function(train.es, nfold, seed = NA){
tabTrain <- table(train.es$LnStatus)
indlist <- sapply(names(tabTrain), function(z) which(train.es$LnStatus == z), simplify = FALSE)
if(!is.na(seed)){
set.seed(seed)
}
#Each row contains 8 pos and 8 negative samples for 6 folds
res <- list()
res$fold.pos <- matrix(sample(indlist[["pos"]]),nrow=nfold)
res$fold.neg <- matrix(sample(indlist[["neg"]]),nrow=nfold)
return(res)
}
result <- splitData(train.es, nfold, seed = 1234)
fold.pos <- result$fold.pos
fold.neg <- result$fold.neg
```
*Note*: with `CMA` you can use the command `GenerateLearningsets` to split the training data into folds. However, it does not show you how the data was split. Thus, you either use CMA for all or you write your own script. Below is how we would generate the splits above using CMA:
```{r}
splits <- GenerateLearningsets(y = train.es$LnStatus, method="CV", fold=6, strat= TRUE)
```
### Loop for feature selection and modeling
To illustrate how to select a model, I will use the top-50 genes selected by `limma` (within each fold). Note that this number is very arbitrary and other options may make more sense like using a p-value threshold or testing different options with this CV. For this example, I'm using only the top-50 genes as methods like LDA and Logit can not be run on more features than samples. However, other methods like KNN or SVM will do well with more features.
In this example, I will compare 7 different models: KNN for k={1,5,10,15}, LDA, Logit, SVM. Feel free to add other methods to the list!
```{r compare}
#Define here the constants that you will not evaluate. For example, I will use the top-50 limma genes
set.seed(495)
ngenes <- 50
nmethod <- 7 #number of methods you plan to compare.
#Define an output object here to store results
pr.err <- matrix(-1, nfold,nmethod, dimnames=list(paste0("Fold",1:nfold),c("1NN","5NN","10NN", "15NN","LDA","Logit","SVM")))
for(i in 1:nfold){
#Test Fold for the i-th step
testDat.fold<-exprs(train.es)[,c(fold.pos[i,],fold.neg[i,])]
#I will create a factor of classes for the test set of the i_th fold
testclass.fold<-train.es$LnStatus[c(fold.pos[i,],fold.neg[i,])]
#The rest of the samples are the training set for the i-th step
trainDat.fold<-exprs(train.es)[,-c(fold.pos[i,],fold.neg[i,])]
trainclass.fold<-train.es$LnStatus[-c(fold.pos[i,],fold.neg[i,])]
#Step 1: feature selection (do you remember limma?).
# Note that a different set of genes will be selected for each fold! you can then compare how consistent these sets were.
limma.dat<-as.data.frame(trainDat.fold)
desMat <- model.matrix(~ trainclass.fold, limma.dat) #design matrix
trainFit <- lmFit(limma.dat, desMat)
eBtrainFit <- eBayes(trainFit)
# top-50 limma genes
top.fold <- topTable(eBtrainFit, coef = which(colnames(coef(trainFit)) != "(Intercept)"),
n = ngenes,sort.by="P")
#Retain the top-50 limma genes from the train and test sets
trainDat.fold <- trainDat.fold[rownames(top.fold),]
testDat.fold <- testDat.fold[rownames(top.fold),]
#STEP 2: select a classifier
#Set a counter for the method tested
l <- 0
#kNN classifiers
for(kk in c(1,5,10,15)) {
#every time you get inside this loop, the l counter gets redefined (i.e., 1, 2, etc for method 1, method 2, etc)
l <- l+1
#knn needs samples in rows
yhat.knn <- knn(train=t(trainDat.fold), test=t(testDat.fold), cl=trainclass.fold,
k = kk)
#Store the prediction error for each kk within this fold
pr.err[i,l]<- mean(testclass.fold != yhat.knn)
} #end of kNN loop
#LDA method. Note that you can change the prior parameter to reflect a different proportion of case and control samples. The default is to use the class proportions from the training set.
m.lda <- lda(x=t(trainDat.fold), group=trainclass.fold, prior=c(.5, .5))
yhat.lda <- predict(m.lda, newdata=t(testDat.fold))$class
pr.err[i,"LDA"] <-mean(testclass.fold != yhat.lda)
#Logit
glm.dat <- data.frame(t(trainDat.fold), group=trainclass.fold)
# 50 factors still will cause optimization warnings
# Try without warning suppression to see
# To further reduce parameters, regularized regression can be used
# To use regularized regression uncomment lines followed by "uncomment for regularized regression"
suppressWarnings( m.log <- glm(group ~ ., data=glm.dat,family=binomial) )
# uncomment for regularized regression
# m.log <- glmnet( t(trainDat.fold) , trainclass.fold ,family="binomial")
pr.log <- predict(m.log,newdata=data.frame(t(testDat.fold)),type="response")
# uncomment for regularized regression
# pr.log <- predict(m.log,newdata=data.frame(t(testDat.fold)),type="response",newx=t(testDat.fold))
pr.cl <- rep(0,length(testclass.fold))
pr.cl[pr.log > 1/2] <- "pos"
pr.cl[pr.log <= 1/2] <- "neg"
pr.cl <- factor(pr.cl)
pr.err[i,"Logit"] <- mean( pr.cl != testclass.fold )
#SVM
m.svm <- svm(x=t(trainDat.fold), y=trainclass.fold, cost=1, type="C-classification",
kernel="linear")
pr.svm <- predict(m.svm,newdata=t(testDat.fold))
pr.err[i,"SVM"] <- mean( pr.svm != testclass.fold )
} #end of CV loop
```
### Error Rates
Now you can get the average prediction error for all methods. Note that the prediction errors are high! not too much hope for the real test run!
```{r err}
cv.err <- colMeans(pr.err)
# mean - 1 sd (sd of the 6 error rates)
ls <- cv.err - apply(pr.err, 2, sd)
# mean + 1 sd (sd of the 6 error rates)
us <- cv.err + apply(pr.err, 2, sd)
# plot the results
plot(1:nmethod, cv.err, ylim=c(0, 1), xlim=c(1, (nmethod+.5)),type='n',
axes=FALSE, xlab='Classifier', ylab='Error rate',main="6-fold CV Error")
for(j in 1:ncol(pr.err))
points(jitter(rep(j, 6), factor=2), jitter(pr.err[,j]), cex=0.8, pch='X', col='gray')
for(i in 1:nmethod)
lines(c(i, i), c(ls[i], us[i]), lwd=2, col='gray')
points(1:nmethod, ls, pch=19, col='red')
points(1:nmethod, us, pch=19, col='green')
points(1:nmethod, cv.err, pch=19, cex=1.5, col='black')
axis(2, ylab='Error rate')
axis(1, 1:nmethod, colnames(pr.err))
box()
```
### Results of the CV
According to these results, LDA and 10NN may be the better classifier to try in the test data. However, remember that this CV results depend on the first split of the data we did. Thus, we need to repeat this CV
**Exercise 1** (not marked): perform 100 runs of this CV before selecting a model to test! We can do this by running different random splits using the `splitData` function we created without a seed, and selecting the best average error across each run of CV. Add at least one rule to select data for use in the underlying models, such as a P value threshold for genes selected in limma rather than just the top 50 genes by P value, or a different cost for the SVM model.
```{r ex1}
# your code here
```
**Exercise 2** (not marked): Use AUC as a criteria to select a model based on the training data! Tip: extract the predicted probabilities from each method and use the calculateAUC function defined below:
```{r ex2}
require(ROCR)
calculateAUC <- function(predictions,labels){
tmp_predictions <- ROCR::prediction(
as.numeric(predictions),
as.numeric(labels)
)
perf <- performance(tmp_predictions, measure = "auc")
return(unlist(slot(perf,"y.values")))
}
# Example: Calculate the AUC for our SVM predictions
calculateAUC(pr.svm,testclass.fold)
# your code here
```
If you have never heard of AUC, please work through the rest of this seminar, and the [Additional metrics for evaluation](#additional-metrics-for-evaluation) section in particular, before completing these two tasks.
## Testing the selected model
Now that we decided on which method we are going to use to classify samples in the test set, we need to train the model using the *FULL* training set and then classify samples of the test set. I will use the 10NN model.
```{r test}
yhat.knn <- knn(train=t(exprs(train.es)), test=t(exprs(test.es)), cl=train.es$LnStatus,
k = 10)
#Store the prediction error for each kk within this fold
pr.errTest<- mean(test.es$LnStatus != yhat.knn)
pr.errTest
```
What does the prediction error mean?
In this instance, we have evaluated how often the prediction matched the actual lymph node status, against the total number of cases. This is the **accuracy** metric.
Not good! In real practice, you should not keep trying until we get a good result! In fact, you must use cross-validation on the training dataset to evaluate different parameters and classifiers, and only evaluate the generalizability of the *best* model on the test set.
However, in this seminar, I encourage you to try different options **as an exercise** and to see how much the results can change.
## CMA
Many steps of the CV defined above can be easily done with CMA. For example, Step 1 in the loop above can also be done using 'CMA' with the function 'GeneSelection', which selects the most informative features (e.g., gene) to build a classifier within each of the splits generated by 'GenerateLearningsets'. Some learning algorithms do better if you only give them "useful" features.
```{r cma}
featureScores<-GeneSelection(X=t(exprs(train.es)), y=train.es$LnStatus, learningsets=splits, method="limma")
#Compare list of selected genes using:
toplist(featureScores)
#We can aggregate the results across the 6 splits.
seliter<-numeric()
for(i in 1:nfold) seliter<-c(seliter, toplist(featureScores, iter=i, top = 10, show=FALSE)$index)
(sort(table(seliter), dec=T)) # summarize
# Choose the 20 probes which are chosen most commonly in the 6 splits
bestprobes<-as.numeric(names(sort(table(seliter), dec=T)))[1:20]
# examine the feature data for the best probes
fData(dat)[bestprobes, c("ID", "Gene symbol", "Gene title", "Chromosome location")]
```
This looks promising since I get TTC3 and at least a couple of other genes that show up on Table 3 of the paper.
Similarly, you can use CMA to train and test a classifier within each CV fold (learningsets). However, there are things you can not do within CMA or that CMA is not doing right. For example, CMA can not do a full nested cross-validation. Additionally, it is not trivial to train the selected in the full dataset and then test it in the test set. CMA is more designed for CV. Thus, it is good to know how to do this things by hand as well.
Paul solved this problem in the following way: he made a `learningsets` object that has just one "split" defined by the samples in the training set.
```{r byhand}
m<-matrix(which(dat$Set == "training"), 1)
full.learningset<-new("learningsets", learnmatrix=m, method="my own", ntrain=96, iter=1)
fullFeatureScores<-GeneSelection(X=t(exprs(dat)), learningsets= full.learningset, y=dat$LnStatus, method="t.test")
testclassif<-classification(X=t(exprs(dat)), y=dat$LnStatus, learningsets= full.learningset, genesel=fullFeatureScores, nbgene = 100, classifier =pknnCMA, k=5)
#Evaluation:
tres<-testclassif[[1]]
ftable(tres)
roc(tres)
```
Note: his optimized classifier did terribly as well.
## Multiclass learning
You won't always have a binary learning problem, where you are classifying a sample into 1 of 2 classes. Sometimes we might want to train a classifier a classifier with more than two classes.
Here we will use the *caret* and *mlbench* packages for classification on a multi-class problem.
```{r, message = FALSE}
library(caret)
library(mlbench)
```
We will be using the Soybean dataset, where our prediction is for the different problems associated with soybean crops. Our dataset has 683 samples, and 35 features being measured for each sample.
Our categories are 19.
```{r loaddata}
cv_folds <- trainControl(method="cv", number=5)
data(Soybean)
unique(Soybean$Class)
summary(Soybean)
```
Let us pre-process our data.
```{r preprocess}
#Remove rows (samples) with missing values
soybean_x = Soybean[rowSums(is.na(Soybean)) == 0,]
#Then remove columns (features) with missing values
soybean_x = soybean_x[,colSums(is.na(soybean_x)) == 0]
dim(soybean_x)
```
We are left with 562 samples and 35 attributes. The first column, `Class`, describes the categories. We will refactor this column since we have removed certain columns (and possibly some of the 19 classes)
```{r refactor}
soybean_x$Class = (as.factor(as.character(soybean_x$Class)))
unique(soybean_x$Class)
```
Now we have 15 classes!
In this instance, we don't have an external test set for evaluation, so we will be assessing the performance on a held-out test set. First, we create the held-out test set.
*Note* that we are holding out this test set from our training data, and then performing data-splitting for validation within our training subset. In practise, like we did earlier, you would want to loop this multiple times with holding out a test set and training on the remainder dataset (using cross validation or bootstrapping to estimate your model accuracy on the test set).
```{r holdout}
trainIndex <- createDataPartition(soybean_x$Class, p = .8, list = FALSE, times = 1)
soyTrain <- soybean_x[ trainIndex,]
soyTest <- soybean_x[-trainIndex,]
```
**Cross validation results**
We set up our cross-validation folds. Note we can also choose the option 'cv' or 'LOOCV' instead of 'repeatedcv'.
With 'repeatedcv', we don't have to manually set up the multiple loops we did when we were using CMA.
```{r cvresults, message= FALSE,warning=FALSE}
#Prepare resampling method for Cross-Validation folds
set.seed(7)
cv_control = trainControl(method="repeatedcv", number=5, repeats=10, classProbs=FALSE) #summaryFunction=mnLogLoss)
modelSvm_cv <- train(Class~., data=soyTrain, method="svmRadial", metric="Accuracy", trControl=cv_control, trace=FALSE)
# display results
print(modelSvm_cv)
y_pred_cv = predict(modelSvm_cv, soyTest[,-1])
```
We can also use bootstrap re-sampling instead of cross-validation folds. This means we take random samples from the dataset (with re-selection), against which to evaluate our model. This gives us an idea about the variance of the model itself.
Bootstrapping is different from cross validation in that the latter splits the entire dataset into folds *without re-selection*.It is a robust method to estimate the accuracy of our model.
**Bootstrapped results**
This part will take some time.
```{r bootstrap, message=FALSE,warning=FALSE}
#Prepare resampling method for Bootstrapping
set.seed(7)
cv_boot = trainControl(method="boot", number=20, classProbs=FALSE)
modelSvm_boot <- train(Class~., data=soyTrain, method="svmRadial",
metric="Accuracy", trControl=cv_boot, trace=FALSE)
# display results
print(modelSvm_boot)
y_pred_boot = predict(modelSvm_boot, soyTest[,-1])
```
**Fitting other models**
This part will take a while.
You can consider other algorithmic approaches for classifiers too, as follows:
```{r othermodels, message=FALSE,warning=FALSE}
# trace = FALSE or vocal = FALSE silences these functions depending on the method. it's stupid but blame the train function
# train the GBM model
set.seed(7)
modelGbm <- train(Class~., data=soyTrain, method="gbm", metric="Accuracy",
trControl=cv_boot, verbose = FALSE, preProcess = "zv")
# train the SVM model
set.seed(7)
modelSvm <- train(Class~., data=soyTrain, method="svmRadial",
metric="Accuracy", trControl=cv_boot, trace=FALSE)
# collect resamples
results <- resamples(list(GBM=modelGbm, SVM=modelSvm))
# summarize the distributions
summary(results)
# boxplots of results
bwplot(results)
```
> You can finetune the parameters for different models in Caret using tuneGrid.
### Additional metrics for evaluation
We can also consider other metrics to assess the performance of a model. This becomes particularly important when, for example, your your test set is imbalanced. In that scenario, evaluating the accuracy of a model might not be the best indication that your classifier works well. In fact, it may be biased by the over-represented class in your test set.
#### Kappa-score
Overall accuracy is a misleading statistic in case of unbalanced datasets. The kappa statistic overcomes this by taking the expected error into account.
#### ROC, Precision & Recall
Receiver-Operator Characteristic (ROC) Curves can be used to characterize the performance of our model in a binary classification setting.
For binary classification, we might also be interested in precision and recall, i.e. a metric of our True Positive and True Negative rates.
```{r definemetrics}
binary_preds = as.integer(y_pred_cv)-1
binary_true = as.integer(soyTest$Class)-1
precision <- sum(binary_preds & binary_true) / sum(binary_preds)
recall <- sum(binary_preds & binary_true) / sum(binary_true)
```
Precision tells us how often, when we predict a class 'y', do we get it correct.
Recall tells us how often, out of all our 'y' instances, do we predict them correctly.
```{r showmetrics}
#Precision
print(precision)
#Recall
print(recall)
```
In case of multi-class classification, another metric that comes in handy is the F1-score. This is the harmonic mean of precision and recall. A high F1-score will tell you that you are quite precise and sensitive in your prediction of *all* classes.
```{r f1}
Fmeasure <- 2 * precision * recall / (precision + recall)
Fmeasure
```
> In the models we fit for our multiclass problem, what was the Kappa statistic on our test set? What was the Accuracy?
> Which model selection appraoch worked better, the bootstrapped SVM or the repeated 5-CV approach? How do we assess this (on the test set, or from the cross-validation results?)
## Discussion points
How good do you think a classifier will have to be to be clinically relevant? What level of specificity or sensitivity do you think is "good enough"? The author's data set has half lymph-node positive and half negative. The incidence of lymph-node-positive breast cancer is about 33% at the time of diagnosis (according to [http://seer.cancer.gov/statfacts/html/breast.html]). How does that affect your opinion about the classifier?
# References
1: Varma S, Simon R: Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics 2006, 7:91.
2: Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005, 21:631-643.