forked from hadley/adv-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathOO-essentials.Rmd
106 lines (68 loc) · 7.96 KB
/
OO-essentials.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# OO field guide {#oo}
This chapter is a field guide for recognising and working with R's objects in the wild. R has three object oriented systems (plus the base types), so it can be a bit intimidating. The goal of this guide is not to make you an expert in all four systems, but to help you identify which system you're working with and to help you use it effectively. \index{object-oriented programming}
Central to any object-oriented system are the concepts of class and method. A __class__ defines the behaviour of __objects__ by describing their attributes and their relationship to other classes. The class is also used when selecting __methods__, functions that behave differently depending on the class of their input. Classes are usually organised in a hierarchy: if a method does not exist for a child, then the parent's method is used instead; the child __inherits__ behaviour from the parent.
R's three OO systems differ in how classes and methods are defined:
* __S3__ implements a style of OO programming called generic-function OO.
This is different from most programming languages, like Java, C++, and C#,
which implement message-passing OO. With message-passing, messages (methods)
are sent to objects and the object determines which function to call.
Typically, this object has a special appearance in the method call, usually
appearing before the name of the method/message: e.g.,
`canvas.drawRect("blue")`. S3 is different. While computations are still
carried out via methods, a special type of function called a
__generic function__ decides which method to call, e.g.,
`drawRect(canvas, "blue")`. S3 is a very casual system. It has no
formal definition of classes.
* __S4__ works similarly to S3, but is more formal. There are two major
differences to S3. S4 has formal class definitions, which describe the
representation and inheritance for each class, and has special helper
functions for defining generics and methods. S4 also has multiple dispatch,
which means that generic functions can pick methods based on the class of
any number of arguments, not just one.
* __Reference classes__, called RC for short, are quite different from S3
and S4. RC implements message-passing OO, so methods belong to classes,
not functions. `$` is used to separate objects and methods, so method calls
look like `canvas$drawRect("blue")`. RC objects are also mutable: they don't
use R's usual copy-on-modify semantics, but are modified in place. This
makes them harder to reason about, but allows them to solve problems that
are difficult to solve with S3 or S4.
There's also one other system that's not quite OO, but it's important to mention here:
* __base types__, the internal C-level types that underlie the other OO
systems. Base types are mostly manipulated using C code, but they're
important to know about because they provide the building blocks for the
other OO systems.
The following chapters describe each system in turn, starting with base types. You'll learn how to recognise the OO system that an object belongs to, how method dispatch works, and how to create new objects, classes, generics, and methods for that system. The chapter concludes with a few remarks on when to use each system.
## Why OO?
The primary use of OO programming in R is for print, summary and plot methods. These methods allow us to have one generic function, e.g. `print()`, that displays the object differently depending on its type: printing a linear model is very different to printing a data frame.
## Why generic functions?
There are too big differences between S3 and S4, and R6 and most other modern programming languages: mutability and namespacing. These are trade-offs, and like all tradeoffs they have pros and cons. In my opinion, however, the trade-offs are particularly well suited for data analysis.
In encapsulated OO, you should consider carefully when you need a new class; but you should create methods freely. In generic function OO, you should carefully consider when you need a new generic; but you should create classes freely.
### Namespacing
```{r, eval = FALSE}
method(arg1, arg2, arg3)
class$method(arg1, arg2)
```
In encapsulated OO languages, these two methods may have nothing in common apart from the name.
```{r, eval = FALSE}
strawberries$dust("sugar")
table$dust(duster)
```
(inspired by <https://www.grammarly.com/blog/10-verbs-contronyms/>)
Generic functions don't have this property: they are global. That means use must define them carefully, and you should avoid using broadly applicable verb names (instead add a prefix, or assume people will use via a namespace.)
The reason that this works well is in data analyses you often want to do the same thing to different types of objects. For example, every model function in R understands `summary()` and `predict()`.
This is also supports the use of pipes. In contrast to method chaining (where only the class author can add a new method), anyone can write a function that works in a chain, and it will do the right thing. This is a small but pervasive tension that in python tends to lead to large monolithic packages.
This is a different school of though to most popular programming languages, but is a good fit to the problem of data analysis. Knowing this fact probably won't help you much in your day-to-day programming, but it will avoid some fundamental confusing if you're coming from another OO programmming language. \index{functions!generics|see{generics}} \index{S3!generics} \index{generics!S3}
(In fact this message is so powerful that I've talked to programmers who moved to R from javascript and it took them a while to figure out that they're not calling the `frame` method of the `data` object.)
### Mutability
## Picking a system {#picking-a-system}
Three OO systems is a lot for one language, but for most R programming, S3 suffices. In R you usually create fairly simple objects and methods for pre-existing generic functions like `print()`, `summary()`, and `plot()`. S3 is well suited to this task, and the majority of OO code that I have written in R is S3. S3 is a little quirky, but it gets the job done with a minimum of code. \index{objects!which system?}
```{r, eval = FALSE, echo = FALSE}
packageVersion("Matrix")
library(Matrix)
gs <- getGenerics("package:Matrix")
sum(gs@package == "Matrix")
length(getClasses("package:Matrix", FALSE))
```
If you are creating more complicated systems of interrelated objects, S4 may be more appropriate. A good example is the `Matrix` package by Douglas Bates and Martin Maechler. It is designed to efficiently store and compute with many different types of sparse matrices. As of version 1.1.3, it defines 102 classes and 20 generic functions. The package is well written and well commented, and the accompanying vignette (`vignette("Intro2Matrix", package = "Matrix")`) gives a good overview of the structure of the package. S4 is also used extensively by Bioconductor packages, which need to model complicated interrelationships between biological objects. Bioconductor provides many [good resources](https://www.google.com/search?q=bioconductor+s4) for learning S4. If you've mastered S3, S4 is relatively easy to pick up; the ideas are all the same, it is just more formal, more strict, and more verbose.
If you've programmed in a mainstream OO language, RC will seem very natural. But because they can introduce side effects through mutable state, they are harder to understand. For example, when you call `f(a, b)` in R you can usually assume that `a` and `b` will not be modified. But if `a` and `b` are RC objects, they might be modified in the place. Generally, when using RC objects you want to minimise side effects as much as possible, and use them only where mutable states are absolutely required. The majority of functions should still be "functional", and free of side effects. This makes code easier to reason about and easier for other R programmers to understand.
It is possible to have mutable generic function OO, and immutable encapsulated OO, but they don't feel as natural.