-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy path04-collective_geoms.Rmd
296 lines (179 loc) · 9.15 KB
/
04-collective_geoms.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
# Collective Geoms
```{r 04-load-libraries, include=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
```
## General Housekeeping Items
- This is a learning opportunity so feel free to ask any question at any time.
- Take time to learn the theory, in particular Grammar of Graphics.
- Please do the chapter exercises. Second-best learning opportunity!
- Please plan to facilitate one of the discussions. Best learning opportunity!
------------------------------------------------------------------------
## Learning Objectives
- Understand the difference between individual geoms and collective geoms
- Explore some plots that use individual and collective geoms together
- Reinforce understand of the Grammar of Graphics (particularly the use of layers) to create plots
------------------------------------------------------------------------
## Quick Intuition on Collective Geoms
- Last chapter was on individual geoms. This chapter is on collective geoms.
- Oversimplification (but maybe useful)
- individual numbers vs the sum of the numbers
- sum converts a series of numbers ("individual"): `4, 7, 9, 3, 3`
- to a single number ("collective"): `26`
- home prices
- under *individual geoms* each home price has a point on a plot/table
- under *collective geoms* we may use `median` as a single number that summarizes all individuals
[This blog post](https://drsimonj.svbtle.com/plotting-individual-observations-and-group-means-with-ggplot2) by Simon Jackson illustrates these foundations using `mtcars`. The points are individual geoms and the bars are a collective geom showing the average of the individual observations.
```{r 04-drsimonj-blog-post, message=FALSE, warning=FALSE}
id <- mtcars %>%
tibble::rownames_to_column() %>%
as_tibble() %>%
mutate(am = factor(am, levels = c(0, 1), labels = c("automatic", "manual")))
gd <- id %>%
group_by(am) %>%
summarise(hp = mean(hp))
ggplot(id, aes(x = am, y = hp, color = am, fill = am)) +
geom_bar(data = gd, stat = "identity", alpha = 0.3) +
ggrepel::geom_text_repel(aes(label = rowname), color = "black", size = 2.5, segment.color = "grey") +
geom_point() +
guides(color = "none", fill = "none") +
theme_bw() +
labs(
title = "Car horespower by transmission type",
x = "Transmission",
y = "Horsepower"
)
```
Next, a separate longitudinal study from the blog post (because the book example is also a longitudinal study). This example uses the `ourworldindata` dataset which shows healthcare spending per country over time.
```{r 04-plot-ourworldindata, message=FALSE, warning=FALSE}
#library(devtools)
#install_github("drsimonj/ourworldindata")
library(ourworldindata)
id <- financing_healthcare %>%
filter(continent %in% c("Oceania", "Europe") & between(year, 2001, 2005)) %>%
select(continent, country, year, health_exp_total) %>%
na.omit()
```
- raw data
```{r 04-id-raw-data, message=FALSE, warning=FALSE}
id
```
- individual observations are at the combined country-year level. For the purposes of plotting, though, the "individual geom" will just be the country and all of the yearly observations for each country.
```{r 04-ourworldindata-plot, message=FALSE, warning=FALSE}
gd <- id %>%
group_by(continent, year) %>%
summarise(health_exp_total = mean(health_exp_total))
ggplot(id, aes(x = year, y = health_exp_total, color = continent)) +
geom_line(aes(group = country), alpha = 0.3) +
geom_line(data = gd, alpha = 0.8, size = 3) +
theme_bw() +
labs(
title = "Changes in healthcare spending\nacross countries and world regions",
x = NULL,
y = "Total healthcare investment ($)",
color = NULL
)
```
## From the ggplot2 book
- dataset called Oxboys which shows the age and corresponding height of 26 boys from Oxford.
- also a longitudinal study.
- note that the age is standardized.
```{r 04-Oxboys-data, message=FALSE, warning=FALSE}
data(Oxboys, package = "nlme")
head(Oxboys, 9)
```
### Multiple Groups, One Aesthetic
As the book says:
> In many situations, you want to separate your data into groups, but render them in the same way. In other words, you want to be able to distinguish individual subjects but not identify them.
- sometimes you want the individual geom to be a **group** of observations *for the same individual*.
- you do this by adding a group argument to the aesthetic.
- If you're trying to figure out which variable to use as the grouping variable, fill in the blank "I have multiple observations for each \_\_\_\_\_". Or for longitudinal studies, "I want to plot one line over time for each \_\_\_\_\_".
<details>
<summary> What's the grouping variable for Oxboys?</summary>
In the case of Oxboys, we want to plot a line over time for each boy, so `Subject` is the grouping variable in the aesthetic.
</details>
```{r 04-plot-Oxboys, message=FALSE, warning=FALSE}
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_point() +
geom_line()
```
- incorrectly specifying the grouping variable leads to a "characteristic sawtooth appearance".
```{r 04-sawtooth}
ggplot(Oxboys, aes(age, height)) +
geom_point() +
geom_line()
```
### Different Groups on Different Layers
From the book:
> Sometimes we want to plot summaries that use different levels of aggregation: one layer might display individuals, while another displays an overall summary.
- now that we have plotted individual geoms, let's add a collective geom which is the trendline for all boys together.
```{r 04-group-at-ggplot-layer}
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_line() +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
#> `geom_smooth()` using formula 'y ~ x'
```
- something doesn't look right
- expecting a collective geom (one summary line for all subjects), but we got individual geoms again -- a trendline for each individual instead of a trendline for all individuals.
- "grouping controls both the display of the geoms, and the operation of the stats: one statistical transformation is run for each group".
- we got multiple `geom_smooth`s because we had the grouping variable in the `ggplot` line so the grouping flows down to all layers of the plot
- to get what we intend, we need to uncouple the grouping variable at the `ggplot` layer and add it where we want the grouping to happen, namely only at the `geom_line` layer. That allows the default grouping from the `ggplot` layer (i.e., no special grouping or just group on the whole dataset) to flow down to the `geom_smooth` layer.
```{r 04-group-at-line-layer}
ggplot(Oxboys, aes(age, height)) +
geom_line(aes(group = Subject)) +
geom_point() +
geom_smooth(method = "lm", size = 2, se = FALSE)
#> `geom_smooth()` using formula 'y ~ x'
```
### Overriding the Default Grouping
In the last exercise, we finally got the grouping right.
This hints at the approach of overriding the default grouping.
By adding the grouping to `geom_line`, we overrode the default grouping, which was "no special grouping".
Here's another example to help illustrate this point a little better. Thanks to [this](https://www.gl-li.com/2017/08/13/ggplot2-group-overrides-default-grouping/) blog post.
Subtitles are added to these plots to describe what's going on.
```{r 04-overriding}
ggplot(mpg, aes(drv, hwy)) +
geom_jitter() +
stat_boxplot(fill = NA) +
labs(subtitle = "stat_boxplot automatically uses the groups set by the categorical variable drv.\nNotice that there is only one boxplot for each value of drv.")
ggplot(mpg, aes(drv, hwy, color = factor(year))) +
geom_jitter() +
stat_boxplot(fill = NA) +
labs(subtitle = "by now adding color based on year, it creates a new group for the boxplots as well,\nand there are now two for each categorical. This may not be what you want.")
ggplot(mpg, aes(drv, hwy, color = factor(year))) +
geom_jitter() +
stat_boxplot(fill = NA, aes(group = drv)) +
labs(subtitle = "we override the default or earlier grouping by adding\na group -- inside the aes -- on the layer where we want it")
```
### A couple of exercises
```{r 04-exercises-1}
mpg %>% head(2)
#Draw a boxplot of hwy for each value of cyl, without turning cyl into a factor. What extra aesthetic do you need to set?
# Wrong... but cyl is an integer data type -- are integers considered continuous?
ggplot(mpg, aes(cyl, hwy)) +
geom_boxplot()
# Right
ggplot(mpg, aes(cyl, hwy, group = as.factor(cyl))) +
geom_boxplot()
```
```{r 04-exercises-2}
#Modify the following plot so that you get one boxplot per integer value of displ.
ggplot(mpg, aes(displ, cty)) +
geom_boxplot()
# probably better ways to do this, especially ways to make the boxplot line up with the x-axis
ggplot(mpg, aes(x = ceiling(displ), cty, group = ceiling(displ))) +
geom_boxplot()
```
### Matching Aesthetics to Graphic Objects
(Not covered in the preso)
## Meeting Videos
### Cohort 1
`r knitr::include_url("https://www.youtube.com/embed/1xOs5oZqzz4")`
<details>
<summary> Meeting chat log </summary>
```
00:21:57 Michael Haugen: only thing I can think of is if 1 equals the first column of the data frame.
01:02:43 priyanka gagneja: thanks Ryan and everyone else
01:06:10 Jiwan Heo: https://github.com/r4ds/bookclub-ps4ds
```
</details>