-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathKNN.Rmd
106 lines (75 loc) · 2.9 KB
/
KNN.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
title: "Building KNN Function"
author: "Nouman Riaz"
date: "October 25, 2016"
output: pdf_document
---
#The objective is to implement KNN at 'Default' dataset from ISLR package.
#### First we will load the dataset from ISLR package.
```{r}
require(ISLR)
data("Default")
summary(Default)
```
#### Now we will replace the labels by 'Yes' as '1' and 'No' as '0'
```{r}
Default$default <- as.factor(gsub("Yes","1",Default$default))
Default$default <- as.factor(gsub("No","0",Default$default))
```
#### To split the dataset as 70% training and 30% testing, we will use 'set.seed' function to keep the reproducibility.
```{r}
smp_size <- floor(0.70 * nrow(Default))
set.seed(123)
train_ind <- sample(seq_len(nrow(Default)), size = smp_size)
train <- Default[train_ind, ]
test <- Default[-train_ind, ]
```
#### We now need to introduce new variables for new labels as per our k-values.
```{r}
test$nlabel_k7 <- as.factor(c(1,0)) #results for k=7
test$nlabel_k11 <- as.factor(c(1,0)) #results for k=11
test$nlabel_k15 <- as.factor(c(1,0)) #results for k=15
```
#### We will now find new labels using KNN technique. That is, first find nearest k-points for the test datapoints and then assign new label as per voting.
#####a) For k=7
```{r}
for(i in 1:nrow(test))
{
#First we need to calculate distance for each test data point.
dist <- sqrt((test$income[i]-train$income)^2+(test$balance[i]-train$balance)^2)
ind <- which(dist %in% sort(dist)[1:7]) #finding index of k-nearest points
labels <- train$default[ind] #Getting labels of nearest points
test$nlabel_k7[i] <- labels[which.max(labels)] #Selecting the top from voting
}
#Now we will calculate error
error_k7 <- sum(as.numeric(test$default!=test$nlabel_k7))/nrow(test)
cat(error_k7)
```
#####b) For k=11
```{r}
for(i in 1:nrow(test))
{
#First we need to calculate distance for each test data point.
dist <- sqrt((test$income[i]-train$income)^2+(test$balance[i]-train$balance)^2)
ind <- which(dist %in% sort(dist)[1:11]) #finding index of k-nearest points
labels <- train$default[ind] #Getting labels of nearest points
test$nlabel_k11[i] <- labels[which.max(labels)] #Selecting the top from voting
}
#Now we will calculate error
error_k11 <- sum(as.numeric(test$default!=test$nlabel_k11))/nrow(test)
cat(error_k11)
```
#####c) For k=15
```{r}
for(i in 1:nrow(test))
{
#First we need to calculate distance for each test data point.
dist <- sqrt((test$income[i]-train$income)^2+(test$balance[i]-train$balance)^2)
ind <- which(dist %in% sort(dist)[1:15]) #finding index of k-nearest points
labels <- train$default[ind] #Getting labels of nearest points
test$nlabel_k11[i] <- labels[which.max(labels)] #Selecting the top from voting
}
#Now we will calculate error
error_k15 <- sum(as.numeric(test$default!=test$nlabel_k15))/nrow(test)
cat(error_k15)
```