-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path12-dealing-with-missing-data.Rmd
109 lines (68 loc) · 6.51 KB
/
12-dealing-with-missing-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# Dealing with missing data
```{r setup, include = FALSE}
pacman::p_load(webex)
```
Missing data can pose serious problems for any data analysis.
`r video_code("FT6oYGyWTGY")`
## Example
Let's imagine that we want to analyse the effect of a mentoring programme on boys and girls. For that, they take a test (score1), then experience the mentoring, and then take another test (score2). Let's load the required packages.
```{r message=FALSE, warning=FALSE}
pacman::p_load(tidyverse) #As always
pacman::p_load(naniar) #To summarise and visualise missing data
pacman::p_load(simputation) #For imputation of missing data
```
I will simulate the data from our experiment as follows.
```{r message=FALSE, warning=FALSE}
set.seed(1234)
results <- data.frame(gender = c(rep("M", 250), rep("F", 250)),
score1 = round(c(rnorm(250, 45, 10), rnorm(250,60,10))),
score2 = round(c(rnorm(250, 55, 10), rnorm(250,70,10))))
results$score2[results$score1<40] <- NA
```
Note that I have assumed that only students who pass the first test (i.e. get at least 40) return to the second test. For all others, that data is missing. What pattern of missingness is that? `r mcq(c("MCAR", answer="MAR", "MNAR"))`
`r hide("Explanation")`
The data is **missing at random (MAR)** - whether score2 is missing depends on the value of score1, so that it is *not* missing completely at random (MCAR). Whether score2 is missing does not depend on the value of score2 (beyond any association it might have with score1) - therefore it is also *not* missing not at random (MNAR).
`r unhide()`
Let's see how much missing data that results in.
```{r message=FALSE, warning=FALSE}
miss_var_summary(results)
```
### Case deletion
15% missing data is quite a lot, but still within the range where analysts might choose to use case deletion and move on.
Given that the data is MAR, do you think that case deletion can bias the results? `r mcq(c(answer="Yes", "No"))`
Let's try case deletion, and test whether the effects on boys and girls are different.
```{r}
resultsDel <- results %>% na.omit() %>% #Deletes all rows with missing values
mutate(change = score2 - score1)
t.test(change ~ gender, resultsDel)
```
Here we would conclude with great confidence (*p* < .001) that the intervention is more effective for girls than for boys. But given that we simulated the data ourselves, we know that the conclusion is false. It just arises because among the boys, the participants with the lowest scores who were most likely to score higher the next time round (*regression to the mean*) did not show up for the second test. This shows that in such a case, imputation is critical.
### Imputation
Imputation means filling in the gaps, to try to ensure that the dataset we use for our analysis looks as similar as possible to the complete dataset, without missing data.
For imputation, we need to decide how the missing data should be predicted based on the observed data. Here, it would seem to make most sense to use linear regression to predict score2 from score1. However, before we do that, we need to make sure that we can identify imputed values later on by saving which values were missing to begin with. We use two functions from the `naniar` package for that.
```{r}
resultsImp <- results %>% bind_shadow(only_miss = TRUE) %>% add_label_missings()
head(resultsImp)
```
`bind_shadow()` has added a new column for each column that had missing values - in this case, `score2_NA` was added, with two values showing whether the value was missing (NA) or not missing (!NA). Without the `only_miss = TRUE` argument, it would add one 'shadow' column for each column. `add_label_missings()` added just one new column (`any_missing`) indicating whether there were any missing values in that row.
Now we can impute the missing data. Functions from the `simputation` package help with that.
```{r}
resultsImp <- resultsImp %>% group_by(gender) %>%
impute_lm(score2 ~ score1, add_residual = "observed")
```
This tells `simputation` to split the data by gender, calculate a regression to predict score2 from score1 for each gender, use that to impute score2 values and then *add a residual* to each score2 value. The last part is crucial - if we don't do that, then all imputed values will be right on the regression line, which will artificially increase the correlation between score1 and score2 and the fit of any regression model we want to calculate later. `add_residual` can either be `"observed"` or `"normal"`. In the first case, one of the observed residuals is randomly sampled, in the second case, the added residuals follow a normal distribution. I would always recommend `"observed"` as that makes one fewer assumption about the distribution of our data.
Let's see whether the imputed values look reasonable and whether they are different from the observed values in a way that they are likely to influence our results.
```{r fig.cap='Visualisation shows that imputed data matters'}
resultsImp <- resultsImp %>% mutate(change = score2 - score1)
ggplot(resultsImp, aes(x=gender, y=change)) + geom_boxplot() + geom_jitter(aes(color = score2_NA))
```
Note that the colour aesthetic is only assigned inside `geom_jitter()` (which is the same as geom_point() but adds some random noise to the points to reduce overlap). If it was assigned within `ggplot()`, we would get separate boxplots for observed and imputed values, which is not the aim here.
Looking at the distribution of values that were missing and have now been imputed, we can see that the values are generally plausible, in that they fall within the observed range. However, particularly for boys, they are higher than the observed values and are therefore likely to influence our conclusion.
Let's test again whether there is a difference between boys and girls.
```{r}
t.test(change ~ gender, resultsImp)
```
This time, we get the correct result - there is no difference between boys and girls. This shows that when data is MAR (i.e. when missingness depends on other variables we are interested in) we have to consider imputation.
## Further resources {#further-resources-missing-data}
* The [official vignette](https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html) for `naniar` provides provides good step-by-step instructions for exploring and visualising missing data
* Similarly, the [official vignette](https://cran.r-project.org/web/packages/simputation/vignettes/intro.html) for `simputation` contains details on how to use different imputation models and how to impute several variables at once