eda.Rmd

---
title: "eda"
author: "Manye Dong"
date: "2023-10-05"
output: github_document
---
```{r}
library(tidyverse)
library(dplyr)
```

```{r}
weather_df = 
  rnoaa::meteo_pull_monitors(
    c("USW00094728", "USW00022534", "USS0023B17S"),
    var = c("PRCP", "TMIN", "TMAX"), 
    date_min = "2021-01-01",
    date_max = "2022-12-31") |>
  mutate(
    name = recode(
      id, 
      USW00094728 = "CentralPark_NY", 
      USW00022534 = "Molokai_HI",
      USS0023B17S = "Waterhole_WA"),
    tmin = tmin / 10,
    tmax = tmax / 10,
    # floor_date is to round down the date in one month
    month = lubridate::floor_date(date, unit = "month")) |>
  select(name, id, everything())
```
```{r}
weather_df
```

## Initial Numeric work
```{r}
weather_df |> 
  ggplot(aes(x = prcp)) + 
  geom_histogram()
```

Here are the big outliers: any formal analyses involving precipitation as a predictor or outcome might be influenced by this fact
```{r}
weather_df |> filter(prcp > 1000)
```

```{r}
weather_df |> 
  filter(tmax >= 20, tmax <= 30) |> 
  ggplot(aes(x = tmin, y = tmax, color = name, shape = name)) + 
  geom_point(alpha = .75)
```
Waterhole is doing sth fundamentally different

## Grouping
```{r}
weather_df |>
  # often somewhat invisible, but can only see the note "groups: name[3], denoting the unique groups"
  group_by(name, month) |>
  # the next line is what gives you a new aggregated column
  count(month, month = "n_obs")
```
Count produces a dataframe you can use or manipulate directly

```{r}
weather_df |>
  count(name, month) |>
  # pivot_wider is to untidy something
  pivot_wider(
    # names from means the columns
    names_from = name,
    values_from = n
  ) |>
  # make this in the knitted file look like a table
  knitr::kable(digits=2)
```

## General Summaries
```{r}
weather_df |>
  group_by(name) |>
  summarize(
    mean_tmax = mean(tmax, na.rm = TRUE), 
    std_tmax = sd(tmax, na.rm = TRUE), 
    median_tmax = median(tmax, na.rm = TRUE)
  )
# by default, na.rm is FALSE, it will take NA as NA
```

Plot the line plot:
```{r}
weather_df |>
  group_by(name, month) |>
  summarize(
    mean_tmax = mean(tmax, na.rm = TRUE)
    ) |>
  ggplot(aes(x=month, y=mean_tmax, color=name)) + 
  geom_point() +
  geom_line()
```

## Grouped Mutate
```{r}
weather_df |>
  group_by(name) |>
  mutate(
    mean_tmax = mean(tmax, na.rm = TRUE),
    centered_tmax = tmax - mean_tmax) |> 
  ggplot(aes(x = date, y = centered_tmax, color = name)) + 
    geom_point()
```

```{r}
weather_df |>
  group_by(name, month) |>
  mutate(temp_ranking = min_rank(tmax)) |>
  filter(min_rank(tmax) < 2)
```

### previous day, later day -> window function
```{r}
weather_df |>
  group_by(name) |> # must do grouping
  mutate(yesterday_tmax = lag(tmax, 3))
```