05-Statistical_summaries.Rmd

# Statistical Summaries

```{r 05-01, include = FALSE}
library(ggplot2)
```

- ***Learning Objectives:***
  - Use ggplot2 to plot possible uncertainty in your data
  - Determine which geometric object (geom) best presents your type of data

## Defintions (in this Chapter)

> **discrete value**: a finite number, something that is countable with beginning and end (input user definition welcomed)

> **continuous value**: infinite number, something that never ends. Infinity is continous. (input user definition welcomed)

> **grobs**: graphical object

> **overplotting**: too much data on scatterplot making underlying relationships obscure

## Revealing Uncertainty

Four primary types of geometric objects (geom) are used:
1. Discrete x, range: `geom_errorbar()`, `geom_linerange()`
2. Discrete x, range & center: `geom_crossbar()`, `geom_pointrange()`
3. Continuous x, range: `geom_ribbon()`
4. Continuous x, range & center: `geom_smooth(stat = "identity")`

```{r 05 first group of plots, include=TRUE}
y <- c(18, 11, 16)
df <- data.frame(x = 1:3, y = y, se = c(1.2, 0.5, 1.0))

base <- ggplot(df, aes(x, y, ymin = y - se, ymax = y + se))
base + geom_crossbar()
base + geom_pointrange()
base + geom_smooth(stat = "identity")

df
```

```{r 05 lots of plots, include=TRUE}
base + geom_errorbar()
base + geom_linerange()
base + geom_ribbon()
```

## Weighted Data
If each row of your dataframe contains multiple observations, we can use a weight to visually give scale to observations

```{r 05 miss-basic, include=TRUE} 
# Unweighted
ggplot(midwest, aes(percwhite, percbelowpoverty)) + 
  geom_point()
# Weight by population
ggplot(midwest, aes(percwhite, percbelowpoverty)) + 
  geom_point(aes(size = poptotal / 1e6)) + 
  scale_size_area("Population\n(millions)", breaks = c(0.5, 1, 2, 4))
```

```{r 05 weight-lm, include=TRUE} 
# Unweighted
ggplot(midwest, aes(percwhite, percbelowpoverty)) + 
  geom_point() + 
  geom_smooth(method = lm, size = 1)
# Weighted by population
ggplot(midwest, aes(percwhite, percbelowpoverty)) + 
  geom_point(aes(size = poptotal / 1e6)) + 
  geom_smooth(aes(weight = poptotal), method = lm, size = 1) +
  scale_size_area(guide = "none")
```

```{r 05 weight-hist, include=TRUE} 
ggplot(midwest, aes(percbelowpoverty)) +
  geom_histogram(binwidth = 1) + 
  ylab("Counties")
ggplot(midwest, aes(percbelowpoverty)) +
  geom_histogram(aes(weight = poptotal), binwidth = 1) +
  ylab("Population (1000s)")
```

>Question for the group: Is the above `ylab` correct? Check out the next two figures, can you see the difference?

```{r 05 weight-hist-option-1, include=TRUE} 
ggplot(midwest, aes(percbelowpoverty)) +
  geom_histogram(aes(weight = poptotal/1e3), binwidth = 1) +
  ylab("Population (1000s)")
```

```{r 05 weight-hist-option-2, include=TRUE} 
ggplot(midwest, aes(percbelowpoverty)) +
  geom_histogram(aes(weight = poptotal/1e6), binwidth = 1) +
  ylab("Population (millions)")
```

## Displaying distributions
Using built-in `diamonds` dataset

```{r 05-diamond-dimensions, fig.cap="Diamond Dimensions", echo=FALSE}
knitr::include_graphics("images/diamond-dimensions.png", dpi = 300)
```

For 1-Dimensional continuous data (1d), the histogram is arguably the most important geom

```{r 05 geom-1d-con, include=TRUE} 
ggplot(diamonds, aes(depth)) + 
  geom_histogram()
ggplot(diamonds, aes(depth)) + 
  geom_histogram(binwidth = 0.1) + 
  xlim(55, 70)
```

Never rely on the defaults. Always adjust your `bin` or `xlim` to "zoom" in our out of your data. There is no hard or fast rule, only experimentation to discover coorelation in your plot.

For your audience/reader, ensure you add a caption for your scale, for example `binwidth`.

Three ways to compare distribution:
- Show small multiples of the histogram, `facet_wrap(~ var)`.
- Use colour and a frequency polygon, `geom_freqpoly()`.
- Use a “conditional density plot”, `geom_histogram(position = "fill")`.

```{r 05 compare-dist, include=TRUE} 
ggplot(diamonds, aes(depth)) + 
  geom_freqpoly(aes(colour = cut), binwidth = 0.1, na.rm = TRUE) +
  xlim(58, 68) + 
  theme(legend.position = "none")
ggplot(diamonds, aes(depth)) + 
  geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill",
    na.rm = TRUE) +
  xlim(58, 68) + 
  theme(legend.position = "none")
```

You can also plot density using `geom_density()`. Use a density plot when you know that the underlying density is smooth, continuous and unbounded.

```{r 05 geom-density, include=TRUE} 
ggplot(diamonds, aes(depth)) +
  geom_density(na.rm = TRUE) + 
  xlim(58, 68) + 
  theme(legend.position = "none")
ggplot(diamonds, aes(depth, fill = cut, colour = cut)) +
  geom_density(alpha = 0.2, na.rm = TRUE) + 
  xlim(58, 68) + 
  theme(legend.position = "none")
```

It is often the case and advisable to sacrifice quality for quantity. The following three types of graph provide examples of this thought.

`geom_boxplot()`:

```{r 05 geom-boxplot, include=TRUE} 
    ggplot(diamonds, aes(clarity, depth)) + 
      geom_boxplot()
    ggplot(diamonds, aes(carat, depth)) + 
      geom_boxplot(aes(group = cut_width(carat, 0.1))) + 
      xlim(NA, 2.05)
```
    
`geom_violin()`:

```{r 05 geom_violin(), include=TRUE}
ggplot(diamonds, aes(clarity, depth)) + 
  geom_violin()
ggplot(diamonds, aes(carat, depth)) + 
  geom_violin(aes(group = cut_width(carat, 0.1))) + 
  xlim(NA, 2.05)
```

`geom_dotplot()`:

### Exercise:

1.  What binwidth tells you the most interesting story about the distribution
    of `carat`?
    >The number of bins or the binwidth should be exploration exercise. There is not direct hard or fast rule for scaling the binwidth. What is important is to find the appropriate size that best captures the representation (or distribution) of your analysis. This correlates to your story as you are explaining the importance. Find a binwidth that best captures your ideas.
    
2.  Draw a histogram of `price`. What interesting patterns do you see?

```{r 05 Exercise #2, include=TRUE} 
ggplot(diamonds, aes(price)) + 
  geom_histogram(binwidth = 5)
```

>The smaller the quantity (assuming quality), the higher the price. I presume that carat size would also have a strong correlation with quantity and price.

3.  How does the distribution of `price` vary with `clarity`?

```{r 05 Exercise #3, include=TRUE}
    ggplot(diamonds, aes(clarity, price)) + 
      geom_violin()

    ggplot(diamonds, aes(clarity, price)) + 
      geom_boxplot()
```

>I presume using different geoms, the higher the clarity, the higher the price, the fewer the quantity. 

4.  Overlay a frequency polygon and density plot of `depth`. What computed
    variable do you need to map to `y` to make the two plots comparable? 
    (You can either modify `geom_freqpoly()` or `geom_density()`.)
    
>Not completed.

## Dealing with overplotting

- Scatterplot is a very important tool for assessing relationship
  - Too large a dataset may obscure any true relationship
  - This is called *Over plotting*
  
To compensate for Over plotting, tweaking the aesthetic can help. Techniques like hollow glyphs can help.

```{r 05 overp-glyph, dev = "png", include=TRUE} 
df <- data.frame(x = rnorm(2000), y = rnorm(2000))
norm <- ggplot(df, aes(x, y)) + xlab(NULL) + ylab(NULL)
norm + geom_point()
norm + geom_point(shape = 1) # Hollow circles
norm + geom_point(shape = 96) # Pixel sized
```

Alternative ways using large data sets, you can use alpha blending (transparency). If you specify `alpha` as a ratio, the denominator gives the number of points that must be over plotted to give a solid color.

```{r 05 overp-alpha, dev = "png", include=TRUE} 
norm + geom_point(alpha = 1 / 3)
norm + geom_point(alpha = 1 / 5)
norm + geom_point(alpha = 1 / 10)
```

`geom_jitter()` can be used if your data has *some* discreteness. By default, 40% is used. You can overide the default with `width` and `height` arguments.

Alternatively, we can think of overplotting as a 2d density estimation problem, which gives rise to two more approaches:

- Bin the points and count the number in each bin, then visualise that count (the 2d generalisation of the histogram), `geom_bin2d()`.
  - The code below compares square and hexagonal bins, using parameters `bins` and `binwidth` to control the number and size of the bins.
  
```{r 05 overp-bin, include=TRUE}
norm + geom_bin2d()
norm + geom_bin2d(bins = 10)
```

```{r overp-bin-hex, include=TRUE}
library(hexbin)
norm + geom_hex()
norm + geom_hex(bins = 10)
```

Another approach to dealing with overplotting is to add data summaries to help guide the eye to the true shape of the pattern within the data.

## Statistical Summaries

`geom_histogram()` and `geom_bin2d()` use a familiar geom, `geom_bar()` and `geom_raster()`, combined with a new statistical transformation, `stat_bin()` and `stat_bin2d()`. `stat_bin()` and `stat_bin2d()` combine the data into bins and count the number of observations in each bin. But what if we want a summary other than count? So far, we've just used the default statistical transformation associated with each geom. Now we're going to explore how to use  `stat_summary_bin()` to `stat_summary_2d()` to compute different summaries.

```{r 05 geom_bar(), include=TRUE}
ggplot(diamonds, aes(color)) + 
  geom_bar()
ggplot(diamonds, aes(color, price)) + 
  geom_bar(stat = "summary_bin", fun = mean)
```

```{r 05 geom_bin2d(), include=TRUE}
ggplot(diamonds, aes(table, depth)) + 
  geom_bin2d(binwidth = 1, na.rm = TRUE) + 
  xlim(50, 70) + 
  ylim(50, 70)
ggplot(diamonds, aes(table, depth, z = price)) + 
  geom_raster(binwidth = 1, stat = "summary_2d", fun = mean, 
    na.rm = TRUE) + 
  xlim(50, 70) + 
  ylim(50, 70)
```

So far we've considered two classes of geoms:

- Simple geoms where there's a one-on-one correspondence between rows in the data frame and physical elements of the geom

- Statistical geoms where introduce a layer of statistical summaries in between the raw data and the result

Although ggplot2 does not have direct 3d support, it does provide the ability to plot 2d images representing 3d data. These include: contours, colored tiles, and bubble plots.

```{r 05 geom_contour(), include=TRUE}
ggplot(faithfuld, aes(eruptions, waiting)) + 
  geom_contour(aes(z = density, colour = ..level..))
```

```{r 05 geom_raster(), include = TRUE}
ggplot(faithfuld, aes(eruptions, waiting)) + 
  geom_raster(aes(fill = density))
```

```{r 05 bubble plot, include=TRUE}
# Bubble plots work better with fewer observations
small <- faithfuld[seq(1, nrow(faithfuld), by = 10), ]
ggplot(small, aes(eruptions, waiting)) + 
  geom_point(aes(size = density), alpha = 1/3) + 
  scale_size_area()
```


## Meeting Videos

### Cohort 1

`r knitr::include_url("https://www.youtube.com/embed/zufpaidIkGU")`

<details>
  <summary> Meeting chat log </summary>
```
00:32:41	Michael Haugen:	geom_errorbar() otherwise known as tie fighter plot
00:33:12	Gustavo R. Brito:	There's some good explanations about geom_smooth (and se too) in rdocumentation: https://rdocumentation.org/packages/ggplot2/versions/3.3.5/topics/geom_smooth
00:46:10	priyanka gagneja:	the Grey area is the confidence interval
00:48:32	Federica Gazzelloni:	The geom_smooth() function in ggplot2 can plot fitted lines from models with a simple structure. Supported model types include models fit with lm() , glm() , nls() , and mgcv::gam() . ... By default you will get confidence intervals plotted in geom_smooth() .
00:49:53	Federica Gazzelloni:	This is a linear model fit, so I use method = "lm".
00:50:21	Stan Piotrowski:	You can use “scale_y_continuous()” and some of the functions from the “scales” package to modify axes.
00:50:27	June Choe:	the model gets fitted by StatSmooth$compute_group() here, if you're curious about the code! https://github.com/tidyverse/ggplot2/blob/759c63c2fd9e00ba3322c1b74b227f63c98d2e06/R/stat-smooth.r#L156-L173
00:51:31	Federica Gazzelloni:	https://aosmith.rbind.io/2018/11/16/plot-fitted-lines/
00:56:58	Federica Gazzelloni:	some formula from the documentation: Formula to use in smoothing function, eg. y ~ x, y ~ poly(x, 2), y ~ log(x). NULL by default, in which case method = NULL implies formula = y ~ x when there are fewer than 1,000 observations and formula = y ~ s(x, bs = "cs") otherwise.
00:57:52	priyanka gagneja:	thats ok , keep going. we can pick up the rest next time we meet.
00:58:21	Lydia Gibson:	I’m going to run to my appointment. See you all next week!
00:58:41	Lydia Gibson:	Sorry, in two weeks.
00:59:26	Ryan S:	as a side note -- it seemed like the topic of stats (i.e., stat = "identity")…. this topic seemed to get very light treatment in the text.  to me it seems like this idea of how stats work is a huge topic that requires a lot of understanding and practice.
00:59:52	Stan Piotrowski:	I agree, Ryan.
01:00:18	Ryan S:	suggest someone who understands this topic (and has the capacity to talk to it) may be willing to take 15 mins on it next time?
01:01:13	June Choe:	I also agree (and would be happy to do this, just not this month!) I feel like we could use another week on stat before we're thrown into the Extending ggplot2 section - maybe around when we cover scales
01:02:25	Michael Haugen:	Stat part 2 next week?
01:02:33	Stan Piotrowski:	That sounds like a good idea to me!  That’ll give us some time to dig into the code and figure out what’s going on
01:02:48	Ryan S:	I think we're two weeks away (US holiday next week)
01:02:49	priyanka gagneja:	+1 Michael
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/GiXu9zCKQCc")`