-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy patheda.Rmd
126 lines (114 loc) · 2.69 KB
/
eda.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
title: "eda"
author: "Manye Dong"
date: "2023-10-05"
output: github_document
---
```{r}
library(tidyverse)
library(dplyr)
```
```{r}
weather_df =
rnoaa::meteo_pull_monitors(
c("USW00094728", "USW00022534", "USS0023B17S"),
var = c("PRCP", "TMIN", "TMAX"),
date_min = "2021-01-01",
date_max = "2022-12-31") |>
mutate(
name = recode(
id,
USW00094728 = "CentralPark_NY",
USW00022534 = "Molokai_HI",
USS0023B17S = "Waterhole_WA"),
tmin = tmin / 10,
tmax = tmax / 10,
# floor_date is to round down the date in one month
month = lubridate::floor_date(date, unit = "month")) |>
select(name, id, everything())
```
```{r}
weather_df
```
## Initial Numeric work
```{r}
weather_df |>
ggplot(aes(x = prcp)) +
geom_histogram()
```
Here are the big outliers: any formal analyses involving precipitation as a predictor or outcome might be influenced by this fact
```{r}
weather_df |> filter(prcp > 1000)
```
```{r}
weather_df |>
filter(tmax >= 20, tmax <= 30) |>
ggplot(aes(x = tmin, y = tmax, color = name, shape = name)) +
geom_point(alpha = .75)
```
Waterhole is doing sth fundamentally different
## Grouping
```{r}
weather_df |>
# often somewhat invisible, but can only see the note "groups: name[3], denoting the unique groups"
group_by(name, month) |>
# the next line is what gives you a new aggregated column
count(month, month = "n_obs")
```
Count produces a dataframe you can use or manipulate directly
```{r}
weather_df |>
count(name, month) |>
# pivot_wider is to untidy something
pivot_wider(
# names from means the columns
names_from = name,
values_from = n
) |>
# make this in the knitted file look like a table
knitr::kable(digits=2)
```
## General Summaries
```{r}
weather_df |>
group_by(name) |>
summarize(
mean_tmax = mean(tmax, na.rm = TRUE),
std_tmax = sd(tmax, na.rm = TRUE),
median_tmax = median(tmax, na.rm = TRUE)
)
# by default, na.rm is FALSE, it will take NA as NA
```
Plot the line plot:
```{r}
weather_df |>
group_by(name, month) |>
summarize(
mean_tmax = mean(tmax, na.rm = TRUE)
) |>
ggplot(aes(x=month, y=mean_tmax, color=name)) +
geom_point() +
geom_line()
```
## Grouped Mutate
```{r}
weather_df |>
group_by(name) |>
mutate(
mean_tmax = mean(tmax, na.rm = TRUE),
centered_tmax = tmax - mean_tmax) |>
ggplot(aes(x = date, y = centered_tmax, color = name)) +
geom_point()
```
```{r}
weather_df |>
group_by(name, month) |>
mutate(temp_ranking = min_rank(tmax)) |>
filter(min_rank(tmax) < 2)
```
### previous day, later day -> window function
```{r}
weather_df |>
group_by(name) |> # must do grouping
mutate(yesterday_tmax = lag(tmax, 3))
```