-
Notifications
You must be signed in to change notification settings - Fork 41
/
Copy path26-iteration.Rmd
278 lines (217 loc) · 7.87 KB
/
26-iteration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
# Iteration
**Learning objectives:**
- Modify multiple columns using the same patterns.
- Filter based on the contents of multiple columns.
- Process multiple files.
- Write multiple files.
```{r iteration-packages_used, message=FALSE, warning=FALSE}
library(tidyverse)
```
## Intro to iteration {-}
**Iteration** = repeatedly performing the same action on different objects
R is full of *hidden* iteration!
- `ggplot2::facet_wrap()` / `ggplot::facet_grid()`
- `dplyr::group_by()` + `dplyr::summarize()`
- `tidyr::unnest_wider()` / `tidyr::unnest_longer()`
- Anything with a vector!
- `1:10 + 1` requires loops in other languages!
## Summarize w/ `across()`: setup {-}
```{r iteration-summarize-setup}
df <- tibble(a = rnorm(10), b = rnorm(10), c = rnorm(10))
glimpse(df)
```
## Summarize w/ `across()`: motivation {-}
```{r iteration-summarize-motivation}
messy <- df |> summarize(
n = n(),
a = median(a),
b = median(b),
c = median(c)
)
```
## Summarize w/ `across()`: cleaner {-}
```{r iteration-summarize-across}
clean <- df |> summarize(
n = n(),
dplyr::across(a:c, median)
)
identical(clean, messy)
```
## Selecting columns {-}
- `everything()` for all non-grouping columns
- `where()` to select based on a condition
- `where(is.numeric)` = all numeric columns
- `where(is.character)` = all character columns
- `starts_with("a") & !where(is_numeric)` = all columns that start with "a" and are not numeric
- `where(\(x) any(stringr::str_detect("name")))` = all columns that contain the word "name" in at least one value
## Passing functions {-}
Pass actual function to `across()`, ***not*** a call to the function!
- ✅ `across(a:c, mean)`
- ❌ `across(a:c, mean())`
## Multiple functions {-}
```{r iteration-multiple_functions}
df |> summarize(
across(a:c, list(mean, median))
)
```
## Multiple functions with names {-}
```{r iteration-multiple_functions-names}
df |> summarize(across(a:c,
list(mean = mean, median = median)
))
```
## Multiple functions with names & args {-}
```{r iteration-multiple_functions-args}
df |> summarize(across(a:c,
list(
mean = \(x) mean(x, na.rm = TRUE),
median = \(x) median(x, na.rm = TRUE)
)
))
```
## Fancier naming {-}
```{r iteration-across-glue_names}
df |> summarize(across(a:c,
list(
mean = \(x) mean(x, na.rm = TRUE),
median = \(x) median(x, na.rm = TRUE)
),
.names = "{.fn}_of_{.col}"
))
```
## Filtering with if_any() and if_all() {-}
```{r iteration-if_any}
df2 <- tibble(x = 1:3, y = c(1, 2, NA), z = c(NA, 2, 3))
df2 |> filter(if_any(everything(), is.na))
df2 |> filter(if_all(everything(), \(x) !is.na(x)))
```
## across() in functions: setup {-}
```{r iteration-across-in-functions}
summarize_datatypes <- function(df) {
df |> summarize(
across(
where(is.numeric),
list(mean = \(x) mean(x, na.rm = TRUE))
),
across(
where(is.factor) | where(is.character),
list(n_distinct = n_distinct)
)
) |>
glimpse()
}
```
## across() in functions: mpg {-}
```{r iteration-across_in_functions-mpg}
mpg |> summarize_datatypes()
```
## across() in functions: diamonds {-}
```{r iteration-across_in_functions-diamonds}
diamonds |> summarize_datatypes()
```
## Iterate over files {-}
```{r iteration-map_files, eval = FALSE}
list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE) |>
set_names(basename) |>
purrr::map(readxl::read_excel) |>
map(\(df) "Fix something that might be weird in each df") |>
map(\(df) "Fix a different thing") |>
purrr::list_rbind(names_to = "filename")
```
## One vs everything {-}
> We recommend this approach [perform each step on each file instead of in a function] because it stops you getting fixated on getting the first file right before moving on to the rest. By considering all of the data when doing tidying and cleaning, you’re more likely to think holistically and end up with a higher quality result.
Discuss!
- Jon's preference: Do 1-2 files first, iterate on iteration
- Book: Do everything on everything
## Walk vs map {-}
- Use `purrr::walk()` to do things without keeping result
- Book example: Saving things
- `purrr::map2()` & `purrr::walk2()`: 2 inputs
- `purrr::pmap()` & `purrr::pwalk()`: list of inputs (largely replaced by `across()`)
## Meeting Videos {-}
### Cohort 5 {-}
`r knitr::include_url("https://www.youtube.com/embed/0rsV1jlxhws")`
<details>
<summary> Meeting chat log </summary>
```
00:03:23 Becki R. (she/her): I'm having trouble with a buzz again
00:03:37 Njoki Njuki Lucy: I so look forward to the discussion. I have been struggling with understanding this particular chapter! :)
00:26:58 Jon Harmon (jonthegeek): > x <- purrr::set_names(month.name, month.abb)
> x
00:27:21 Jon Harmon (jonthegeek): x[["Jan"]]
00:30:12 Jon Harmon (jonthegeek): results <- purrr::set_names(vector("list", length(x)), names(x))
00:35:10 lucus w: A data frame is simply a list in disguise
00:45:37 Njoki Njuki Lucy: It makes sense now, thanks!
00:48:05 Ryan Metcalf: Sorry team, I have to drop. Sister-in-law is stranded and needs a jump-start. I’ll finish watching the recording and catch any questions.
00:51:19 Jon Harmon (jonthegeek): > paste(month.name, collapse = "")
[1] "JanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecember"
> paste(month.name, collapse = " ")
[1] "January February March April May June July August September October November December"
01:09:10 Njoki Njuki Lucy: that's so cool! I wondered! Ha!!
01:10:30 Federica Gazzelloni: thanks Jon!!
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/CEKPAUWTA3c")`
<details>
<summary> Meeting chat log </summary>
```
00:10:17 Becki R. (she/her): brb!
00:26:00 Njoki Njuki Lucy: does this mean, in this case, we will have 3 regression models?
00:26:21 Federica Gazzelloni: can you specify: mtcars%>%map(.f=
00:27:12 Federica Gazzelloni: mtcars%>%split(.$cyl)%>%map(.f=lm(….))
00:27:41 Ryan Metcalf: I’m reading it as `mpg` being dependent (y) and `wt` being independent.
00:29:08 Jon Harmon (jonthegeek): # A more realistic example: split a data frame into pieces, fit a
# model to each piece, summarise and extract R^2
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map(summary) %>%
map_dbl("r.squared")
00:29:55 Jon Harmon (jonthegeek): mtcars %>%
split(.$cyl) %>%
map(.f = ~ lm(mpg ~ wt, data = .x))
00:30:22 Jon Harmon (jonthegeek): mtcars %>%
split(.$cyl) %>%
map(.f = lm(mpg ~ wt, data = .x))
00:45:11 Federica Gazzelloni: coalesce()
01:17:01 Ryan Metcalf: Great Job Becki!!!
01:17:25 Becki R. (she/her): Thanks :)
```
</details>
### Cohort 6 {-}
`r knitr::include_url("https://www.youtube.com/embed/NVUHFpYUmA4")`
<details>
<summary> Meeting chat log </summary>
```
00:04:39 Marielena Soilemezidi: I'll be back in 3'! No haste with the setup:)
00:04:52 Adeyemi Olusola: Ok
00:07:22 Adeyemi Olusola: Let me know when you return
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/YnZSfzMGhTE")`
<details>
<summary> Meeting chat log </summary>
```
00:11:10 Marielena Soilemezidi: hello! :)
00:20:31 Marielena Soilemezidi: yep, it looks good!
00:46:18 Daniel Adereti: How does it get the list of 3?
00:47:55 Marielena Soilemezidi: sorry, got disconnected for some time, but I'm back!
00:48:05 Daniel Adereti: No worries!
00:58:28 Adeyemi Olusola: [email protected]
```
</details>
### Cohort 7 {-}
`r knitr::include_url("https://www.youtube.com/embed/vPEgWgs0q7s")`
<details>
<summary> Meeting chat log </summary>
```
00:06:31 Oluwafemi Oyedele: Hi Tim!!!
00:06:47 Tim Newby: Hi Oluwafemi, can you here me?
00:06:48 Oluwafemi Oyedele: We will start in 6 minutes time!!!
00:13:26 Oluwafemi Oyedele: start
00:56:02 Oluwafemi Oyedele: https://adv-r.hadley.nz/functionals.html
00:56:25 Oluwafemi Oyedele: stop
```
</details>
### Cohort 8 {-}
`r knitr::include_url("https://www.youtube.com/embed/TQabUIBbJKs")`