basisRData.Rmd

Basic Data Manipulation
========================================================

Reading and Saving data
-----------------------

Let's say we want to:
* read this [http://data.princeton.edu/wws509/datasets/effort.dat](http://data.princeton.edu/wws509/datasets/effort.dat) file
* make some summaries
* save it to a cvs file in our computer

```{r}
fpe <- read.table("http://data.princeton.edu/wws509/datasets/effort.dat")

head(fpe)
names(fpe)
nrow(fpe)
ncol(fpe)
summary(fpe)
write.table(fpe, file="./effort.dat", sep=";")
write.csv(fpe, file="./effort.csv") # we can also save as a csv
```

Other functions are
+ dput(x, file): opens file and deparses the object x into that file
+ dget(file): parses the file and returns an object (eg: df <- dget("file.txt"))
+ dump(vector of objects, file): takes a vector of names of R objects and produces text representations of the objects on a file
+ source(file): recovers the objects saved by dump
+ save: writes an external representation of R objects to the specified file
+ load: reload datasets written with the function save
+ serialize(x, file): turns object x into a binary file
+ unserialize(file): recovers binary object kept on file

Vectors
-------

Vectors contain only data from one class.

```{r}
1:6         
rep(1,6)    
seq(0,1,.1) 
v <- -5:5
v[1]  # use operator [] to access vector elements, indexes start at 1
v <- c(.1,.2,.3,.4,.8,.9,1.0,1.5)  # function c() is used to join vectors
v
v[2:4]  # subvector
v[-1] # vector except the first element
v[-3:-1] # vector except the first three elements
length(v) # size of the vector
v[length(v)] # last element
v[-length(v)] # all except last element
sum(v) # sum all vector elements
vector() # empty vector
vector("numeric",10) 
c(1.7,"a")  # implicit coercion for vectors
c(T,2)      
c("a",T)    
as.numeric(1:6)  # explicit coercion
as.logical(0:6)   
as.character(1:6)
as.complex(1:6)   
as.numeric(c("a","b","c"))  #but...
v1 <- 1:3
names(v1) <- c("data1","data2","data3") # add names to elems
v1
v1["data1"]  # accessing elements using names
letters # pre-defined vector
# more complex operations
vector <- seq(1,100,3)
vector
u <- vector %% 2 == 0 # only T for pairs
u
v <- vector[u]  # subset only with pairs
v
# str gives the structure of a data structure
str(v)
# typeof gives the type
typeof(v)
# vectors are homogenuous structures, but R coerces to the most flexibe type
c("a",1)
c(TRUE,2)
# subsetting
v <- 1:5
v[c(1,2)] <- 10:11
v
v[-1] <- 20:23 # The length of the LHS needs to match the RHS
v
v[c(T,F)] <- 0 # the subsetting cycles if it reaches the end
v
# subsetting can be used for lookup tables:
x <- c("m", "f", "u", "f", "f", "m", "m")
lookup <- c(m = "Male", f = "Female", u = NA)
lookup[x]
unname(lookup[x])
# Matching and merging by hand 
grades <- c(1, 2, 2, 3, 1)

info <- data.frame(
  grade = 3:1,
  desc = c("Excellent", "Good", "Poor"),
  fail = c(F, F, T)
)

id <- match(grades, info$grade) # returns a vector of the positions of (1st) matches of its 1st argument in its 2nd
id
info
info[id,]
# NA is a logical vector!
typeof(NA)
NA & TRUE
NA & FALSE
# There are also constants NA_integer_, NA_real_, NA_complex_ and NA_character_ (all are reserved words)
```

Matrixes
--------

Matrixes are vectors with dimensions

```{r}
m <- 1:16  # just a vector for now
m
class(m) # function to determine the object's type
dim(m) <- c(4,4) # make a matrix out of it (rows, columns)
m
class(m) 
dim(m) <- c(2,8) # make a diff matrix 
m
dim(m) <- c(4,2,2) # make a 3D matrix
m
m <- matrix(1:16, nrow=2, ncol=8, byrow=T)
m
m <- matrix(1:16, nrow=2, ncol=8,     
            dimnames = list(c("row.1","row.2"),letters[1:8]) )
m
m <- matrix(1:6,3,2)
dim(m)
m
dim(m) <- c(2,3)
m
m[1,]  # first row
m[,2]  # second column
m[1,2] # element in first row, 2nd col
m[,c(1,3)] # the first and third column
m1 <- 1:3
m2 <- 10:12
cbind(m1,m2)  # matrix formation with binding cols or rows
rbind(m1,m2)
m1 <-  matrix(1:9, nrow=3, ncol=3)
m2 <-  matrix(seq(18,2,-2), nrow=3, ncol=3)
m1
m2
m1+m2
m1*m2    # product item by item
m1%*%m2  # real matrix multiplication
t(matrix(1:6, nrow=2, ncol=3)) # transpose

sum(1:5 * 5:1) # inner vector product
outer(1:5,5:1) # outer vector product

diag(x=1,nrow=5,ncol=3)
m3 <- diag(1:4) # makes a diagonal matrix using the vector to initialize diagonal
m3
m3[upper.tri(m3, diag=T)] <- NA  
m3
```

Lists
-----

Lists can contain values of different types (including lists)

```{r}
l1 <- list(atr1=1:4, atr2=0.6)
l1
l1[1]
l1["atr1"] # same thing
l1[[1]] # operator [[]] extracts a single element
class(l1[[1]])
class(l1[1])
l1$atr1 # operator $ extracts part of the object
l1[["atr1"]]  # same thing, except $ does partial matching
l2 <- list(atr1=1:4, atr2=0.6, atr3="hello")
l2
l2[c(1,3)]
l3 <- list(a=list(10,12,14), b = c(3.14,2.81))
l3
l3[[c(1,3)]]  
l3[[1]][[3]]  
l3a <- list(a = list(b = list(c = list(d = 1))))
l3a
l3a[[c("a", "b", "c", "d")]]  # Same as l3a[["a"]][["b"]][["c"]][["d"]]
l4 <- list(aarvark=1.5, ox=3.4)  
l4$a  # partial matching is possible with $ (proceed with caution!)
l5 <- list(list(list(list())))
str(l5)
is.recursive(l5)  # returns TRUE if arg has a recursive (list-like) structure
# c() will combine several lists into one. If given a combination of atomic vectors and lists, c() will coerce the vectors to list before combining them.
l6 <- list(list(1, 2), c(3, 4))
l7 <- c(list(1, 2), c(3, 4))
str(l6)
str(l7)
# coerce with as.list(...)
# check with is.list(...)
# convert to vector with unlist()
```

> Lists are used to build up many of the more complicated data structures in R. For example, both data frames (described below), and linear models objects (as produced by lm()) are lists [ref](http://adv-r.had.co.nz/Data-structures.html)

Data Frames
-----------

Data frames are used to store tabular data, they are lists of same-length vectors vertically aligned. Useful to keep datasets

```{r}
df <- data.frame(col1=1:4,col2=c(TRUE,TRUE,FALSE,TRUE))
df
df$col1 # show a column, ie, an attribute
df[,2]
df[1,] # show a row, ie, an observation
df$newAtr <- letters[1:4] # add a new attribute 
df
names(df)     # the name of the columns
row.names(df) # the name of the rows
row.names(df) <- c("first","second","3rd","4th")
df
nrow(df)  # number of rows
ncol(df)  # number of cols
df[5,] = list(5,FALSE)  # add a new observation

df[df$col2 == T,]  # select observations where col2 is true
mean(df$col1) # find statistics over a certain column
df <- data.frame(x = 1:3)  # it is possible for a data frame to have a column that is a list:
df$y <- list(1:2, 1:3, 1:4)
df
```

Attributes
----------

All objects can have arbitrary additional attributes. These can be thought of as a named list (with unique names). Attributes can be accessed individually with attr() or all at once (as a list) with attributes().

```{r}
v1 <- 1:5
attr(v1, "text") <- "this is a vector"
v1
str(v1)
# The structure() function returns a new object with modified attributes
structure(1:10, my_attribute = "This is a vector")
# There are 3 special attributes:
# names(), character vector of element names
# class(), used to implement the S3 object system, described in the next section
# dim(), used to turn vectors into high-dimensional structures
```

Names
-----

You can name a vector in three ways:

+ During creation: x <- c(a = 1, b = 2, c = 3)
+ By modifying an existing vector: x <- 1:3; names(x) <- c("a", "b", "c")
+ By creating a modified vector: x <- setNames(1:3, c("a", "b", "c"))

Names should be unique

```{r}
v1 <- c(a=1,2,3)
v1
names(v1) <- c("a","b","c")
v1
names(v1) <- NULL # erase names
v1
```

Factors
-------

A factor is a vector that can contain only predefined values.

Factors have two key attributes: their class(), "factor", which controls their behaviour; and their levels(), the set of allowed values.

Factors represent categorical data, can be ordered or not
can be seen an integer vector where each int has a label
used to store tabular data.

Check [www.stat.berkeley.edu/classes/s133/factors.html](http://www.stat.berkeley.edu/classes/s133/factors.html) for more information

```{r}
f1 <- factor(c("yes","no","yes","yes"))
f1
# make a contingency table, ie, displays the frequency distribution 
# of the variables
table(f1) 
f1a <- factor(c("yes","no","yes","yes"), levels=c("yes","no")) # redefine the order of the levels
f1a
levels(f1)
# Egs of use
set.seed(143)  # deterministic random generation
lets = factor(sample(letters,size=15,replace=T))
lets
levels(lets)
table(lets[1:10])
# A strange eg: each value of the factor is translated into i, where 
# i is its i-th level. Since "p" is a vector of one position, only
# where 'lets' as values "a" (which are elements of the 1st level)
# does the result is not NA
"p"[lets] 
levels(lets)[lets] # left as an eg :-)
```

> While factors look (and often behave) like character vectors, they are actually integers under the hood and you need to be careful when treating them like strings. Some string methods (like gsub() and grepl()) will coerce factors to strings, while others (like nchar()) will throw an error, and still others (like c()) will use the underlying integer IDs. For this reason, it's usually best to explicitly convert factors to strings when modifying their levels. 

> Unfortunately, most data loading functions in R automatically convert character vectors to factors. This is suboptimal, because there's no way for those functions to know the set of all possible levels and their optimal order. Instead, use the argument stringsAsFactors = FALSE to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the data. [ref](http://adv-r.had.co.nz/Data-structures.html)