-
Notifications
You must be signed in to change notification settings - Fork 42
/
basisRData.Rmd
317 lines (276 loc) · 9.47 KB
/
basisRData.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
Basic Data Manipulation
========================================================
Reading and Saving data
-----------------------
Let's say we want to:
* read this [http://data.princeton.edu/wws509/datasets/effort.dat](http://data.princeton.edu/wws509/datasets/effort.dat) file
* make some summaries
* save it to a cvs file in our computer
```{r}
fpe <- read.table("http://data.princeton.edu/wws509/datasets/effort.dat")
head(fpe)
names(fpe)
nrow(fpe)
ncol(fpe)
summary(fpe)
write.table(fpe, file="./effort.dat", sep=";")
write.csv(fpe, file="./effort.csv") # we can also save as a csv
```
Other functions are
+ dput(x, file): opens file and deparses the object x into that file
+ dget(file): parses the file and returns an object (eg: df <- dget("file.txt"))
+ dump(vector of objects, file): takes a vector of names of R objects and produces text representations of the objects on a file
+ source(file): recovers the objects saved by dump
+ save: writes an external representation of R objects to the specified file
+ load: reload datasets written with the function save
+ serialize(x, file): turns object x into a binary file
+ unserialize(file): recovers binary object kept on file
Vectors
-------
Vectors contain only data from one class.
```{r}
1:6
rep(1,6)
seq(0,1,.1)
v <- -5:5
v[1] # use operator [] to access vector elements, indexes start at 1
v <- c(.1,.2,.3,.4,.8,.9,1.0,1.5) # function c() is used to join vectors
v
v[2:4] # subvector
v[-1] # vector except the first element
v[-3:-1] # vector except the first three elements
length(v) # size of the vector
v[length(v)] # last element
v[-length(v)] # all except last element
sum(v) # sum all vector elements
vector() # empty vector
vector("numeric",10)
c(1.7,"a") # implicit coercion for vectors
c(T,2)
c("a",T)
as.numeric(1:6) # explicit coercion
as.logical(0:6)
as.character(1:6)
as.complex(1:6)
as.numeric(c("a","b","c")) #but...
v1 <- 1:3
names(v1) <- c("data1","data2","data3") # add names to elems
v1
v1["data1"] # accessing elements using names
letters # pre-defined vector
# more complex operations
vector <- seq(1,100,3)
vector
u <- vector %% 2 == 0 # only T for pairs
u
v <- vector[u] # subset only with pairs
v
# str gives the structure of a data structure
str(v)
# typeof gives the type
typeof(v)
# vectors are homogenuous structures, but R coerces to the most flexibe type
c("a",1)
c(TRUE,2)
# subsetting
v <- 1:5
v[c(1,2)] <- 10:11
v
v[-1] <- 20:23 # The length of the LHS needs to match the RHS
v
v[c(T,F)] <- 0 # the subsetting cycles if it reaches the end
v
# subsetting can be used for lookup tables:
x <- c("m", "f", "u", "f", "f", "m", "m")
lookup <- c(m = "Male", f = "Female", u = NA)
lookup[x]
unname(lookup[x])
# Matching and merging by hand
grades <- c(1, 2, 2, 3, 1)
info <- data.frame(
grade = 3:1,
desc = c("Excellent", "Good", "Poor"),
fail = c(F, F, T)
)
id <- match(grades, info$grade) # returns a vector of the positions of (1st) matches of its 1st argument in its 2nd
id
info
info[id,]
# NA is a logical vector!
typeof(NA)
NA & TRUE
NA & FALSE
# There are also constants NA_integer_, NA_real_, NA_complex_ and NA_character_ (all are reserved words)
```
Matrixes
--------
Matrixes are vectors with dimensions
```{r}
m <- 1:16 # just a vector for now
m
class(m) # function to determine the object's type
dim(m) <- c(4,4) # make a matrix out of it (rows, columns)
m
class(m)
dim(m) <- c(2,8) # make a diff matrix
m
dim(m) <- c(4,2,2) # make a 3D matrix
m
m <- matrix(1:16, nrow=2, ncol=8, byrow=T)
m
m <- matrix(1:16, nrow=2, ncol=8,
dimnames = list(c("row.1","row.2"),letters[1:8]) )
m
m <- matrix(1:6,3,2)
dim(m)
m
dim(m) <- c(2,3)
m
m[1,] # first row
m[,2] # second column
m[1,2] # element in first row, 2nd col
m[,c(1,3)] # the first and third column
m1 <- 1:3
m2 <- 10:12
cbind(m1,m2) # matrix formation with binding cols or rows
rbind(m1,m2)
m1 <- matrix(1:9, nrow=3, ncol=3)
m2 <- matrix(seq(18,2,-2), nrow=3, ncol=3)
m1
m2
m1+m2
m1*m2 # product item by item
m1%*%m2 # real matrix multiplication
t(matrix(1:6, nrow=2, ncol=3)) # transpose
sum(1:5 * 5:1) # inner vector product
outer(1:5,5:1) # outer vector product
diag(x=1,nrow=5,ncol=3)
m3 <- diag(1:4) # makes a diagonal matrix using the vector to initialize diagonal
m3
m3[upper.tri(m3, diag=T)] <- NA
m3
```
Lists
-----
Lists can contain values of different types (including lists)
```{r}
l1 <- list(atr1=1:4, atr2=0.6)
l1
l1[1]
l1["atr1"] # same thing
l1[[1]] # operator [[]] extracts a single element
class(l1[[1]])
class(l1[1])
l1$atr1 # operator $ extracts part of the object
l1[["atr1"]] # same thing, except $ does partial matching
l2 <- list(atr1=1:4, atr2=0.6, atr3="hello")
l2
l2[c(1,3)]
l3 <- list(a=list(10,12,14), b = c(3.14,2.81))
l3
l3[[c(1,3)]]
l3[[1]][[3]]
l3a <- list(a = list(b = list(c = list(d = 1))))
l3a
l3a[[c("a", "b", "c", "d")]] # Same as l3a[["a"]][["b"]][["c"]][["d"]]
l4 <- list(aarvark=1.5, ox=3.4)
l4$a # partial matching is possible with $ (proceed with caution!)
l5 <- list(list(list(list())))
str(l5)
is.recursive(l5) # returns TRUE if arg has a recursive (list-like) structure
# c() will combine several lists into one. If given a combination of atomic vectors and lists, c() will coerce the vectors to list before combining them.
l6 <- list(list(1, 2), c(3, 4))
l7 <- c(list(1, 2), c(3, 4))
str(l6)
str(l7)
# coerce with as.list(...)
# check with is.list(...)
# convert to vector with unlist()
```
> Lists are used to build up many of the more complicated data structures in R. For example, both data frames (described below), and linear models objects (as produced by lm()) are lists [ref](http://adv-r.had.co.nz/Data-structures.html)
Data Frames
-----------
Data frames are used to store tabular data, they are lists of same-length vectors vertically aligned. Useful to keep datasets
```{r}
df <- data.frame(col1=1:4,col2=c(TRUE,TRUE,FALSE,TRUE))
df
df$col1 # show a column, ie, an attribute
df[,2]
df[1,] # show a row, ie, an observation
df$newAtr <- letters[1:4] # add a new attribute
df
names(df) # the name of the columns
row.names(df) # the name of the rows
row.names(df) <- c("first","second","3rd","4th")
df
nrow(df) # number of rows
ncol(df) # number of cols
df[5,] = list(5,FALSE) # add a new observation
df[df$col2 == T,] # select observations where col2 is true
mean(df$col1) # find statistics over a certain column
df <- data.frame(x = 1:3) # it is possible for a data frame to have a column that is a list:
df$y <- list(1:2, 1:3, 1:4)
df
```
Attributes
----------
All objects can have arbitrary additional attributes. These can be thought of as a named list (with unique names). Attributes can be accessed individually with attr() or all at once (as a list) with attributes().
```{r}
v1 <- 1:5
attr(v1, "text") <- "this is a vector"
v1
str(v1)
# The structure() function returns a new object with modified attributes
structure(1:10, my_attribute = "This is a vector")
# There are 3 special attributes:
# names(), character vector of element names
# class(), used to implement the S3 object system, described in the next section
# dim(), used to turn vectors into high-dimensional structures
```
Names
-----
You can name a vector in three ways:
+ During creation: x <- c(a = 1, b = 2, c = 3)
+ By modifying an existing vector: x <- 1:3; names(x) <- c("a", "b", "c")
+ By creating a modified vector: x <- setNames(1:3, c("a", "b", "c"))
Names should be unique
```{r}
v1 <- c(a=1,2,3)
v1
names(v1) <- c("a","b","c")
v1
names(v1) <- NULL # erase names
v1
```
Factors
-------
A factor is a vector that can contain only predefined values.
Factors have two key attributes: their class(), "factor", which controls their behaviour; and their levels(), the set of allowed values.
Factors represent categorical data, can be ordered or not
can be seen an integer vector where each int has a label
used to store tabular data.
Check [www.stat.berkeley.edu/classes/s133/factors.html](http://www.stat.berkeley.edu/classes/s133/factors.html) for more information
```{r}
f1 <- factor(c("yes","no","yes","yes"))
f1
# make a contingency table, ie, displays the frequency distribution
# of the variables
table(f1)
f1a <- factor(c("yes","no","yes","yes"), levels=c("yes","no")) # redefine the order of the levels
f1a
levels(f1)
# Egs of use
set.seed(143) # deterministic random generation
lets = factor(sample(letters,size=15,replace=T))
lets
levels(lets)
table(lets[1:10])
# A strange eg: each value of the factor is translated into i, where
# i is its i-th level. Since "p" is a vector of one position, only
# where 'lets' as values "a" (which are elements of the 1st level)
# does the result is not NA
"p"[lets]
levels(lets)[lets] # left as an eg :-)
```
> While factors look (and often behave) like character vectors, they are actually integers under the hood and you need to be careful when treating them like strings. Some string methods (like gsub() and grepl()) will coerce factors to strings, while others (like nchar()) will throw an error, and still others (like c()) will use the underlying integer IDs. For this reason, it's usually best to explicitly convert factors to strings when modifying their levels.
> Unfortunately, most data loading functions in R automatically convert character vectors to factors. This is suboptimal, because there's no way for those functions to know the set of all possible levels and their optimal order. Instead, use the argument stringsAsFactors = FALSE to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the data. [ref](http://adv-r.had.co.nz/Data-structures.html)