-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy path05-Statistical_summaries.Rmd
329 lines (253 loc) · 13.2 KB
/
05-Statistical_summaries.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
# Statistical Summaries
```{r 05-01, include = FALSE}
library(ggplot2)
```
- ***Learning Objectives:***
- Use ggplot2 to plot possible uncertainty in your data
- Determine which geometric object (geom) best presents your type of data
## Defintions (in this Chapter)
> **discrete value**: a finite number, something that is countable with beginning and end (input user definition welcomed)
> **continuous value**: infinite number, something that never ends. Infinity is continous. (input user definition welcomed)
> **grobs**: graphical object
> **overplotting**: too much data on scatterplot making underlying relationships obscure
## Revealing Uncertainty
Four primary types of geometric objects (geom) are used:
1. Discrete x, range: `geom_errorbar()`, `geom_linerange()`
2. Discrete x, range & center: `geom_crossbar()`, `geom_pointrange()`
3. Continuous x, range: `geom_ribbon()`
4. Continuous x, range & center: `geom_smooth(stat = "identity")`
```{r 05 first group of plots, include=TRUE}
y <- c(18, 11, 16)
df <- data.frame(x = 1:3, y = y, se = c(1.2, 0.5, 1.0))
base <- ggplot(df, aes(x, y, ymin = y - se, ymax = y + se))
base + geom_crossbar()
base + geom_pointrange()
base + geom_smooth(stat = "identity")
df
```
```{r 05 lots of plots, include=TRUE}
base + geom_errorbar()
base + geom_linerange()
base + geom_ribbon()
```
## Weighted Data
If each row of your dataframe contains multiple observations, we can use a weight to visually give scale to observations
```{r 05 miss-basic, include=TRUE}
# Unweighted
ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point()
# Weight by population
ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point(aes(size = poptotal / 1e6)) +
scale_size_area("Population\n(millions)", breaks = c(0.5, 1, 2, 4))
```
```{r 05 weight-lm, include=TRUE}
# Unweighted
ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point() +
geom_smooth(method = lm, size = 1)
# Weighted by population
ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point(aes(size = poptotal / 1e6)) +
geom_smooth(aes(weight = poptotal), method = lm, size = 1) +
scale_size_area(guide = "none")
```
```{r 05 weight-hist, include=TRUE}
ggplot(midwest, aes(percbelowpoverty)) +
geom_histogram(binwidth = 1) +
ylab("Counties")
ggplot(midwest, aes(percbelowpoverty)) +
geom_histogram(aes(weight = poptotal), binwidth = 1) +
ylab("Population (1000s)")
```
>Question for the group: Is the above `ylab` correct? Check out the next two figures, can you see the difference?
```{r 05 weight-hist-option-1, include=TRUE}
ggplot(midwest, aes(percbelowpoverty)) +
geom_histogram(aes(weight = poptotal/1e3), binwidth = 1) +
ylab("Population (1000s)")
```
```{r 05 weight-hist-option-2, include=TRUE}
ggplot(midwest, aes(percbelowpoverty)) +
geom_histogram(aes(weight = poptotal/1e6), binwidth = 1) +
ylab("Population (millions)")
```
## Displaying distributions
Using built-in `diamonds` dataset
```{r 05-diamond-dimensions, fig.cap="Diamond Dimensions", echo=FALSE}
knitr::include_graphics("images/diamond-dimensions.png", dpi = 300)
```
For 1-Dimensional continuous data (1d), the histogram is arguably the most important geom
```{r 05 geom-1d-con, include=TRUE}
ggplot(diamonds, aes(depth)) +
geom_histogram()
ggplot(diamonds, aes(depth)) +
geom_histogram(binwidth = 0.1) +
xlim(55, 70)
```
Never rely on the defaults. Always adjust your `bin` or `xlim` to "zoom" in our out of your data. There is no hard or fast rule, only experimentation to discover coorelation in your plot.
For your audience/reader, ensure you add a caption for your scale, for example `binwidth`.
Three ways to compare distribution:
- Show small multiples of the histogram, `facet_wrap(~ var)`.
- Use colour and a frequency polygon, `geom_freqpoly()`.
- Use a “conditional density plot”, `geom_histogram(position = "fill")`.
```{r 05 compare-dist, include=TRUE}
ggplot(diamonds, aes(depth)) +
geom_freqpoly(aes(colour = cut), binwidth = 0.1, na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
ggplot(diamonds, aes(depth)) +
geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill",
na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
```
You can also plot density using `geom_density()`. Use a density plot when you know that the underlying density is smooth, continuous and unbounded.
```{r 05 geom-density, include=TRUE}
ggplot(diamonds, aes(depth)) +
geom_density(na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
ggplot(diamonds, aes(depth, fill = cut, colour = cut)) +
geom_density(alpha = 0.2, na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
```
It is often the case and advisable to sacrifice quality for quantity. The following three types of graph provide examples of this thought.
`geom_boxplot()`:
```{r 05 geom-boxplot, include=TRUE}
ggplot(diamonds, aes(clarity, depth)) +
geom_boxplot()
ggplot(diamonds, aes(carat, depth)) +
geom_boxplot(aes(group = cut_width(carat, 0.1))) +
xlim(NA, 2.05)
```
`geom_violin()`:
```{r 05 geom_violin(), include=TRUE}
ggplot(diamonds, aes(clarity, depth)) +
geom_violin()
ggplot(diamonds, aes(carat, depth)) +
geom_violin(aes(group = cut_width(carat, 0.1))) +
xlim(NA, 2.05)
```
`geom_dotplot()`:
### Exercise:
1. What binwidth tells you the most interesting story about the distribution
of `carat`?
>The number of bins or the binwidth should be exploration exercise. There is not direct hard or fast rule for scaling the binwidth. What is important is to find the appropriate size that best captures the representation (or distribution) of your analysis. This correlates to your story as you are explaining the importance. Find a binwidth that best captures your ideas.
2. Draw a histogram of `price`. What interesting patterns do you see?
```{r 05 Exercise #2, include=TRUE}
ggplot(diamonds, aes(price)) +
geom_histogram(binwidth = 5)
```
>The smaller the quantity (assuming quality), the higher the price. I presume that carat size would also have a strong correlation with quantity and price.
3. How does the distribution of `price` vary with `clarity`?
```{r 05 Exercise #3, include=TRUE}
ggplot(diamonds, aes(clarity, price)) +
geom_violin()
ggplot(diamonds, aes(clarity, price)) +
geom_boxplot()
```
>I presume using different geoms, the higher the clarity, the higher the price, the fewer the quantity.
4. Overlay a frequency polygon and density plot of `depth`. What computed
variable do you need to map to `y` to make the two plots comparable?
(You can either modify `geom_freqpoly()` or `geom_density()`.)
>Not completed.
## Dealing with overplotting
- Scatterplot is a very important tool for assessing relationship
- Too large a dataset may obscure any true relationship
- This is called *Over plotting*
To compensate for Over plotting, tweaking the aesthetic can help. Techniques like hollow glyphs can help.
```{r 05 overp-glyph, dev = "png", include=TRUE}
df <- data.frame(x = rnorm(2000), y = rnorm(2000))
norm <- ggplot(df, aes(x, y)) + xlab(NULL) + ylab(NULL)
norm + geom_point()
norm + geom_point(shape = 1) # Hollow circles
norm + geom_point(shape = 96) # Pixel sized
```
Alternative ways using large data sets, you can use alpha blending (transparency). If you specify `alpha` as a ratio, the denominator gives the number of points that must be over plotted to give a solid color.
```{r 05 overp-alpha, dev = "png", include=TRUE}
norm + geom_point(alpha = 1 / 3)
norm + geom_point(alpha = 1 / 5)
norm + geom_point(alpha = 1 / 10)
```
`geom_jitter()` can be used if your data has *some* discreteness. By default, 40% is used. You can overide the default with `width` and `height` arguments.
Alternatively, we can think of overplotting as a 2d density estimation problem, which gives rise to two more approaches:
- Bin the points and count the number in each bin, then visualise that count (the 2d generalisation of the histogram), `geom_bin2d()`.
- The code below compares square and hexagonal bins, using parameters `bins` and `binwidth` to control the number and size of the bins.
```{r 05 overp-bin, include=TRUE}
norm + geom_bin2d()
norm + geom_bin2d(bins = 10)
```
```{r overp-bin-hex, include=TRUE}
library(hexbin)
norm + geom_hex()
norm + geom_hex(bins = 10)
```
Another approach to dealing with overplotting is to add data summaries to help guide the eye to the true shape of the pattern within the data.
## Statistical Summaries
`geom_histogram()` and `geom_bin2d()` use a familiar geom, `geom_bar()` and `geom_raster()`, combined with a new statistical transformation, `stat_bin()` and `stat_bin2d()`. `stat_bin()` and `stat_bin2d()` combine the data into bins and count the number of observations in each bin. But what if we want a summary other than count? So far, we've just used the default statistical transformation associated with each geom. Now we're going to explore how to use `stat_summary_bin()` to `stat_summary_2d()` to compute different summaries.
```{r 05 geom_bar(), include=TRUE}
ggplot(diamonds, aes(color)) +
geom_bar()
ggplot(diamonds, aes(color, price)) +
geom_bar(stat = "summary_bin", fun = mean)
```
```{r 05 geom_bin2d(), include=TRUE}
ggplot(diamonds, aes(table, depth)) +
geom_bin2d(binwidth = 1, na.rm = TRUE) +
xlim(50, 70) +
ylim(50, 70)
ggplot(diamonds, aes(table, depth, z = price)) +
geom_raster(binwidth = 1, stat = "summary_2d", fun = mean,
na.rm = TRUE) +
xlim(50, 70) +
ylim(50, 70)
```
So far we've considered two classes of geoms:
- Simple geoms where there's a one-on-one correspondence between rows in the data frame and physical elements of the geom
- Statistical geoms where introduce a layer of statistical summaries in between the raw data and the result
Although ggplot2 does not have direct 3d support, it does provide the ability to plot 2d images representing 3d data. These include: contours, colored tiles, and bubble plots.
```{r 05 geom_contour(), include=TRUE}
ggplot(faithfuld, aes(eruptions, waiting)) +
geom_contour(aes(z = density, colour = ..level..))
```
```{r 05 geom_raster(), include = TRUE}
ggplot(faithfuld, aes(eruptions, waiting)) +
geom_raster(aes(fill = density))
```
```{r 05 bubble plot, include=TRUE}
# Bubble plots work better with fewer observations
small <- faithfuld[seq(1, nrow(faithfuld), by = 10), ]
ggplot(small, aes(eruptions, waiting)) +
geom_point(aes(size = density), alpha = 1/3) +
scale_size_area()
```
## Meeting Videos
### Cohort 1
`r knitr::include_url("https://www.youtube.com/embed/zufpaidIkGU")`
<details>
<summary> Meeting chat log </summary>
```
00:32:41 Michael Haugen: geom_errorbar() otherwise known as tie fighter plot
00:33:12 Gustavo R. Brito: There's some good explanations about geom_smooth (and se too) in rdocumentation: https://rdocumentation.org/packages/ggplot2/versions/3.3.5/topics/geom_smooth
00:46:10 priyanka gagneja: the Grey area is the confidence interval
00:48:32 Federica Gazzelloni: The geom_smooth() function in ggplot2 can plot fitted lines from models with a simple structure. Supported model types include models fit with lm() , glm() , nls() , and mgcv::gam() . ... By default you will get confidence intervals plotted in geom_smooth() .
00:49:53 Federica Gazzelloni: This is a linear model fit, so I use method = "lm".
00:50:21 Stan Piotrowski: You can use “scale_y_continuous()” and some of the functions from the “scales” package to modify axes.
00:50:27 June Choe: the model gets fitted by StatSmooth$compute_group() here, if you're curious about the code! https://github.com/tidyverse/ggplot2/blob/759c63c2fd9e00ba3322c1b74b227f63c98d2e06/R/stat-smooth.r#L156-L173
00:51:31 Federica Gazzelloni: https://aosmith.rbind.io/2018/11/16/plot-fitted-lines/
00:56:58 Federica Gazzelloni: some formula from the documentation: Formula to use in smoothing function, eg. y ~ x, y ~ poly(x, 2), y ~ log(x). NULL by default, in which case method = NULL implies formula = y ~ x when there are fewer than 1,000 observations and formula = y ~ s(x, bs = "cs") otherwise.
00:57:52 priyanka gagneja: thats ok , keep going. we can pick up the rest next time we meet.
00:58:21 Lydia Gibson: I’m going to run to my appointment. See you all next week!
00:58:41 Lydia Gibson: Sorry, in two weeks.
00:59:26 Ryan S: as a side note -- it seemed like the topic of stats (i.e., stat = "identity")…. this topic seemed to get very light treatment in the text. to me it seems like this idea of how stats work is a huge topic that requires a lot of understanding and practice.
00:59:52 Stan Piotrowski: I agree, Ryan.
01:00:18 Ryan S: suggest someone who understands this topic (and has the capacity to talk to it) may be willing to take 15 mins on it next time?
01:01:13 June Choe: I also agree (and would be happy to do this, just not this month!) I feel like we could use another week on stat before we're thrown into the Extending ggplot2 section - maybe around when we cover scales
01:02:25 Michael Haugen: Stat part 2 next week?
01:02:33 Stan Piotrowski: That sounds like a good idea to me! That’ll give us some time to dig into the code and figure out what’s going on
01:02:48 Ryan S: I think we're two weeks away (US holiday next week)
01:02:49 priyanka gagneja: +1 Michael
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/GiXu9zCKQCc")`