-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathGroup_1_ADS_ICA.Rmd
498 lines (480 loc) · 32 KB
/
Group_1_ADS_ICA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
---
title: "Group_1_ADS_ICA"
author: "Group 1"
date: "2024-04-08"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(tidy = TRUE)
knitr::opts_chunk$set(message=FALSE)
knitr::opts_chunk$set(warning=FALSE)
```
We use `caption` to label figures as **Figure a.b**, representing the figure as the **b^th^** image of part **a**.
```{r library}
library(tidyverse)
library(effsize)
library(knitr)
```
### Data import and clean
```{r collapse = TRUE, tidy=FALSE}
subuse = read.csv('substance_use.csv')
# Brief check, No reformat as out of box does not affect for str() result judgement
str(subuse)
anyNA(subuse)
sum(duplicated(subuse))
```
```{r, eval=FALSE}
# Detailed check, results too long so we don't show them here.
for(i in c(1:7)){
writeLines(colnames(subuse)[i])
print(table(subuse[, i]))
}
for(i in c(8:10)){
writeLines(colnames(subuse)[i])
print(summary(subuse[,i]))
}
```
```{r, collapse = TRUE}
# Check for violations of "upper >= val >= lower"
which(!subuse$upper>=subuse$val)
which(!subuse$upper>=subuse$lower)
which(!subuse$val>=subuse$lower)
```
According to the results, we found and decide to treat problems in following sequences:
1. " - WB" after each location could be deleted.
2. The metric column isn't essential, so we would delete it and add "_percent" to last 3 columns.
3. Some columns' classes are wrong(`chr`), so we would convert them to factor class and relocate the general sequence of columns.
```{r, tidy=FALSE}
# delete the " - WB"
subuse_noWB = subuse %>% as_tibble()
subuse_noWB$location = gsub(" - WB", "", subuse$location)
# delete the "metric" column and add "_percent"
subuse_noWB_nometric = subuse_noWB[,-6]
colnames(subuse_noWB_nometric)[7:9] = paste(colnames(subuse_noWB_nometric)[7:9],
"percent", sep = "_")
# change class to factor and change column order
subuse_nono_factor_sequence = subuse_noWB_nometric %>%
mutate_at(.,c(1:5),as.factor) %>% relocate(.,c(6,2,3,4,5,1,9,7,8))
subuse_final=subuse_nono_factor_sequence
# check result
head(subuse_final, 1)
```
### Part 1: Exploring the data
**Q1: In 2019, what region of the world has the highest rate of alcohol-related deaths among men aged 40-44?**
To solve this question, we first filter out the interested subset(`subuse_1`). We draw a plot(Figure 1.1) to show the upper, val and lower percent of each location to have a more thorough comparison.
```{r, tidy=FALSE, out.height="35%", fig.align='center'}
# subset filter out
subuse_1 = subuse_final %>% filter(year == 2019, age == '40 to 44', sex == 'Male',
measure == 'Deaths', cause == 'Alcohol use disorders')
# deaths in different location (Figure 1.1)
ggplot(subuse_1) +
geom_errorbar(aes(x = location, y = val_percent, ymax = upper_percent,
ymin = lower_percent, col = location), width=0.2) +
geom_point(aes(x = location, y = val_percent, col = location),size = 0.7) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Alcohol-Related Deaths Among Men Aged 40-44 by Location in 2019",
x = "", y = "Deaths (%)", caption = "Figure 1.1")
```
According to the Figure 1.1, even when considering "upper" and "lower" percent, **Europe & Central Asia** has the highest rate of alcohol-related deaths among men aged 40-44 in 2019.
**Q2: Looking at the prevalence of alcohol-related disease in the East Asia and Pacific region, how has this changed over time and in the different age groups? Is there a difference between men and women?**
To solve this question, we first filter out the interested subset(`subuse_2`) and subsets in each sex.
```{r, tidy=FALSE}
subuse_2 = subuse_final %>% filter(location == 'East Asia & Pacific',
cause == 'Alcohol use disorders', measure == 'Prevalence')
subuse_2_female = subuse_2 %>% filter(sex == "Female")
subuse_2_male = subuse_2 %>% filter(sex == "Male")
```
As $\frac{Male_{prev}}{Male}+\frac{Female_{prev}}{Female}\neq\frac{Male_{prev}+Female_{prev}}{Male+Female}$, we can't simply add two gender's prevalence together if we don't know their population size. We divide them to different group for better analysis and clearer comparison. Therefore, we first focus on prevalence relationship between different age groups. We plot both male and female's prevalence in Figure 1.2. As each sex's prevalence range differs, we than facet each sex's data under their proper scales in Figure 1.3.
```{r, tidy=FALSE, fig.show = 'hold', out.width = '50%'}
# Both sex (Figure 1.2)
ggplot(data = subuse_2, aes(x = year, y = val_percent, color = age, shape = sex)) +
geom_point(size = 2) +
geom_line(aes(group = interaction(age, sex)), size = 1) +
labs(title = "Prevalence of Alcohol-Related Disease in East Asia and Pacific Over Time",
x = "Year", y = "Prevalence (%)", caption="Figure 1.2") +
theme(plot.title = element_text(size = 12))
# Facet for male and female (Figure 1.3)
ggplot(data = subuse_2, aes(x = year, y = val_percent, color = age, shape = sex)) +
geom_point(size = 1.3) +
geom_line(aes(group = interaction(age, sex)), size = 0.5) +
facet_wrap( ~ sex, ncol = 1, scales = "free") +
labs(title = "Prevalence of Alcohol-Related Disease in East Asia and Pacific Over Time by sex",
x = "Year", y = "Prevalence (%)", caption = "Figure 1.3") +
theme(plot.title = element_text(size = 10.5))
```
According to Figure 1.3, we can see: Generally, the younger the age group is, the higher the prevalence is. For exception in **female**, **25 to 29**'s prevalence is between age group **30 to 34** and **35 to 39**, and becomes the highest between 1997 and 2010. In **male**, **25 to 29** is between **40 to 44** and **45 to 49**. **30 to 34** and **35 to 39** age groups have similar prevalence, which are also the highest. To have a better understanding of how has this over-time prevalence trend varies in different age groups, we divided the patterns into following subsets in each gender. **First in Female**:
```{r, tidy=FALSE, out.height="30%", fig.width=10, fig.align='center'}
# divide data according to different pattern
subuse_2_female_pattern = subuse_2_female %>% mutate(age_pattern = case_when(
age %in% c("25 to 29","30 to 34", "35 to 39" ) ~ "pattern_1",
age %in% c("40 to 44", "45 to 49", "50 to 54",
"55 to 59", "60 to 64", "65 to 69") ~ "pattern_2"))
# Pattern 1 and pattern 2 in female (Figure 1.4)
ggplot(data = subuse_2_female_pattern, aes(x = year, y = val_percent, color = age)) +
geom_point(size = 1) +
geom_line(aes(group = age),size = 0.4) +
facet_wrap(~age_pattern) +
labs(title = "Female Prevalence Pattern of Alcohol-Related Disease in East Asia and Pacific",
x = "Year", y = "Prevalence (%)", caption = "Figure 1.4") +
theme(axis.text.x = element_text(angle = 45, size=7))
```
**Female Pattern 1: 25 to 29, 30 to 34 and 35 to 39**
Their prevalence value reach the highest in year 2000. From 1990 to 2000, the prevalence continuously rise. The 25 to 29 group has the highest increasing and decreasing rate with the highest value of prevalence among all the age groups in 2000. After 2000, the 30 to 34 group shows a continuous downstream trend, while the other two groups decreases with periodic increases. For the 35 to 39 group, there is an increase from 2010 to 2015 and for the youngest group, a small increase appears in 2015-2017.
**Female Pattern 2: 40 to 44, 45 to 49, 50 to 54, 55 to 59, 60 to 64, 65 to 69**
Their prevalence approximately goes up before 1995; from 1995 to around 2015, they firstly decrease and later increase and after around 2015 they all go downstream. Notably, 65 to 69 group manifested a slightly decrease before 1995. The oldest four groups have the minimum prevalence at 2005 while the other two groups at 2010. The 60 to 64 and 65 to 69 age groups has a gentle decrease from 2000 to 2005.
**We then check Male's pattern:**
```{r, tidy=FALSE, out.height="30%", fig.width=10, fig.align='center'}
# divide data according to different pattern
subuse_2_male_pattern = subuse_2_male %>% mutate(age_pattern = case_when(
age %in% c("25 to 29", "30 to 34", "35 to 39") ~ "pattern_1",
age %in% c("40 to 44", "45 to 49", "50 to 54") ~ "pattern_2",
age %in% c("55 to 59", "60 to 64", "65 to 69") ~ "pattern_3"))
# Pattern 1, 2, 3 in male (Figure 1.5)
ggplot(data = subuse_2_male_pattern, aes(x = year, y = val_percent, color = age)) +
geom_point(size = 1) +
geom_line(aes(group = age),size = 0.4) +
facet_wrap(~age_pattern) +
labs(title = "Male Prevalence Pattern of Alcohol-Related Disease in East Asia and Pacific",
x = "Year", y = "Prevalence (%)", caption = "Figure 1.5") +
theme(axis.text.x = element_text(angle = 45, size=7))
```
According to Figure 1.5, the trends of prevalence in each age group were similar from 1990 to 2005. There was a rapid increase from 1990 to 1995, with the peak around 1995, followed by a rapid decline from 1995 to 2005. After 2005, we roughly categorize the age groups into three patterns:
**Male Pattern 1: 25 to 29, 30 to 34, 35 to 39**
From 2005 to 2015, there was minimal change in prevalence among males aged "30 to 34" and "35 to 39", with slight fluctuations and a slight upward trend. However, among males aged "25 to 29", there was an initial increase followed by a decline. From 2015 to 2019, there was a rapid increase in prevalence among males in all three age groups.
**Male Pattern 2: 40 to 44, 45 to 49, 50 to 54**
From 2005 to 2010, only males in the "50 to 54" age group experienced a slight increase in prevalence, while the other two age groups experienced slight declines. From 2010 to 2019, there was a rapid increase in prevalence among males in all three age groups.
**Male Pattern 3: 55 to 59, 60 to 64, 65 to 69**
From 2005 to 2017, there was a rapid increase in prevalence among males in all three age groups. From 2017 to 2019, prevalence among males aged "55 to 59" remained relatively unchanged, while prevalence among males aged "60 to 64" and "65 to 69" rapidly declined.
**To have a better understanding of difference between men and women, we compare their prevalence in each age groups in Figure 1.6.**
```{r, tidy=FALSE}
# Sexes' prevalence comparison in each age group (Figure 1.6)
ggplot(data = subuse_2, aes(x = year, y = val_percent, color = sex, shape = sex)) +
geom_point(size = 1) +
facet_wrap(~age) +
geom_smooth(aes(ymin = lower_percent, ymax = upper_percent,fill=sex),
stat = "identity", alpha = 0.2, linewidth=0.5) +
labs(title = "Prevalence of Alcohol-Related Disease in East Asia and Pacific Over Time",
x = "Year", y = "Prevalence (%)", caption = "Figure 1.6") +
theme(axis.text.x = element_text(size=7.5, angle = 45), plot.title = element_text(size = 12))
```
The top end of the ribbons in the graph represents the upper_percent and the bottom end the lower_percent. This makes it possible to analyse in detail the differences in the prevalence of alcohol-related diseases between genders and different age groups.
We can observe differences in patterns and trends in the prevalence of alcohol-related diseases between male and female from 1990 to 2010. Firstly, the prevalence varies over time, but peaks around 1995 and declines after this date until a gradual increase after 2005, with a general trend of increasing and decreasing increments. The prevalence is much higher in males than in females in all age groups. Additionally, the prevalence is always higher in the younger than in the older age groups, and this is particularly evident in the changing profile of males, except in the 25-29 group where it is lower than in the 30-34. However, this is not so evident in women. It can also be seen that the difference in prevalence with age is higher in males than in females.
**Q3: In the United States, there is talk of an “Opioid epidemic”. Part of the problem is that since the late 1990s, doctors have increasingly been prescribing pain killers which can be highly addictive. Looking at the data from the United States, can you confirm an increase in the prevalence of diseases related to opioid use? What age group is the most affected?**
The question refers to the highly addictive painkiller that has been increasingly prescribed **since the late 1990s**, which is considered the cause of the **opioid epidemic**. To confirm the increase in disease prevalence due to opioid use, we chose **1997** as the breakpoint and separated the data into pre-opioid-epidemic and post-opioid-epidemic. Though prescribing of opioids may take time to affect patients, if we want to capture the subtle changes, we need to collect data earlier. So it's reasonable to choose 1997.
Same as Q2, we would analyse separately in each gender. We first filter out the data(`subuse_3`) and separate the data to pre-data(`subuse_pre`) and post-data(`subuse_post`).
```{r}
subuse_3=filter(subuse_final, measure=="Prevalence", location=="North America", cause=="Opioid use disorders")
subuse_pre = filter(subuse_3, year >= 1990 & year < 1997)
subuse_post = filter(subuse_3, year >= 1997)
```
We first plot the data to check prevalence change over year (Figure 1.7).
```{r, tidy=FALSE, out.height="40%", fig.align='center'}
# prevalence in all age groups in male and female (Figure 1.7)
ggplot(data = subuse_3,
aes(x = year, y = val_percent, color = age, group = interaction(age, sex))) +
geom_point(aes(shape = sex), size = 0.7) +
geom_line(linewidth = 0.3) +
labs(title = "Opioid-related Disease Prevalance in United States Over time",
x = "Year", y = "Prevalence (%)", caption = "Figure 1.7") +
facet_grid( ~ sex) +
theme(axis.text.x = element_text(angle = 45, size = 7))
```
In each sex and each age group, the opioid-related prevalence increase over time. To confirm an increase in prevalence of diseases related to opioid use, we need to compare the pre- and post-data in a more quantitative way. To preserve the integrity and inter-connectedness of the data as much as possible, we decided to **use the pre-data to get the predicted post-data, and compare the predicted post-data with the real post-data to determine the difference**.
From the above plot, we can see the there is a generally **linear relationship** between year and prevalence in all age groups and sexes before 1997. So we decided to model the data by linear regression. We first plot the general relationship of predicted value(dotdashed line) and real value (Figure 1.8)
```{r, tidy=FALSE}
# each age group figure with predicted line (Figure 1.8)
ggplot(data = subuse_3) +
geom_line(aes(x = year, y = val_percent, color = sex, group = sex)) +
geom_smooth(data = subuse_pre, aes(x = year, y = val_percent, color = sex, group = sex),
method = "lm", formula = y~x, linetype = "dotdash", size = 0.4,
fullrange = TRUE, se = FALSE) +
facet_wrap(~age, scales="free") +
labs(title = "Opioid-related Prevalance in United States Over time by Age",
x = "Year", y = "Prevalence (%)", caption = "Figure 1.8")+
theme(axis.text.x = element_text(angle = 45, size=6.5))
```
We then quantify all the linear models for each age groups. We write a function `lm_quantify` to make the process reusable.
```{r}
lm_quantify <- function(gender) {
# get all the age groups
age_groups <- unique(subuse_final$age)
# create Rable to store each age groups value
Rable = data.frame(matrix(nrow = length(age_groups),ncol = 3))
# Rable setup
rownames(Rable)=age_groups
colnames(Rable)=c("adjusted R square", "intercept_pvalue", "year_pvalue")
# do linear regression, get and store each age group's value
for (i in age_groups) {
g_pre <- filter(subuse_pre,age==i&sex==gender)
reg <- lm(val_percent~year,g_pre)
Rable[i, "adjusted R square"] <- summary(reg)$adj.r.squared
Rable[i, "intercept_pvalue"] <- summary(reg)$coefficients["(Intercept)", "Pr(>|t|)"]
Rable[i, "year_pvalue"] = summary(reg)$coefficients["year", "Pr(>|t|)"]
}
return(kable(Rable))
}
lm_quantify("Male")
lm_quantify("Female")
```
According to lecture and source we found online, generally, all age groups in both sexes looks nice, except age group **60 to 64** and **65 to 69** in **Male**. We decided to plot the relationship between linear model and original data to see more closely what had happened(Figure 1.9).
```{r, tidy=FALSE, out.height="35%", fig.align='center'}
# filter out special age groups' data
special_age=filter(subuse_3,age %in% c("60 to 64", "65 to 69"), year<1997)
# plot original data and predicted line together (Figure 1.9)
ggplot(data = special_age) +
geom_point(aes(x = year, y = val_percent, color = sex, group = sex)) +
geom_smooth(aes(x = year, y = val_percent, color = sex, group = sex),
method = "lm", formula = y~x, linetype = "dotdash",
size = 0.4, fullrange = TRUE, se = FALSE) +
facet_wrap(~age) +
labs(title = "Opioid-related Prevalance in United States in special age group",
x = "Year", y = "Prevalence (%)", caption = "Figure 1.9") +
theme(axis.text.x = element_text(angle = 45, size=6.5))
```
According to the plot, the linear regression is generally fine. We also find that in some situation, small R^2^ may be due to the size or distribution of the data itself. As our purpose of the model is to understand relationships between year and val_percent that are evident in visual analyses, then the model may still be useful. So we would still use the models for these two age groups.
To determine the relationship, we then do paired `t.test` or `wilcox.test` in different situation. We created function `clever_test` for better reusability.
```{r}
# x, y, al=alternative, pair_v=TRUE or FALSE(pair or not)
clever_test <- function(x, y, al, pair_v) {
if (pair_v) {
shapiro_result <- shapiro.test(x - y)$p.value # test normal
if (shapiro_result <= 0.05) {
return(wilcox.test(x, y, alternative = al, paired = TRUE))
} else {
return(t.test(x, y, alternative = al, paired = TRUE))
}
} else {
# test normal and var_equal
shapiro_result_x <- shapiro.test(x)$p.value
shapiro_result_y <- shapiro.test(y)$p.value
var_result <- var.test(x, y)$p.value
if (shapiro_result_x > 0.05 & shapiro_result_y > 0.05) {
if (var_result > 0.05) {
return(t.test(x, y, alternative = al, paired = FALSE, var.equal = TRUE))
} else {
return(t.test(x, y, alternative = al, paired = FALSE, var.equal = FALSE))
}
} else {
return(wilcox.test(x, y, alternative = al, paired = FALSE))
}
}
}
```
Here are our null hypothesis and alternative hypothesis.
H~0~: The **real prevalence** is lower or no difference when compared to **predicted prevalence**.
H~A~: The **real prevalence** is larger when compared to **predicted prevalence**.
```{r}
age_groups <- unique(subuse_final$age)
# create pable to store each age groups' value
pable <- data.frame(matrix(nrow = length(age_groups), ncol = 2))
rownames(pable) <- age_groups
colnames(pable) <- c("Male_p_value", "Female_p_value")
for (g in age_groups) {
for (gender in c("Male", "Female")) {
# get specific gender's specific age group's value
g_pre <- filter(subuse_pre, age == g, sex == gender)
g_post <- filter(subuse_post, age == g, sex == gender)
# Order the data.frames based on year
g_pre <- g_pre[order(g_pre$year), ]
g_post <- g_post[order(g_post$year), ]
# make the model and get predicted value
reg <- lm(val_percent ~ year, g_pre)
years <- g_post$year
pre_data <- predict(reg, newdata = data.frame(year = years))
real_data <- g_post$val_percent
# calculate the p.value
pable[g, paste0(gender, "_p_value")] <- clever_test(real_data, pre_data, "greater", TRUE)$p.value
}
}
kable(pable)
```
According to pable, we can't reject H~0~ in **male in 25 to 29** and **female in 40 to 44 and 45 to 49**, whereas we can reject H~0~ in rest groups. Therefore we can confirm a prevalence increase due to opioid epidemic in them. We then use cohen'd, which is a nice method to measure effect size, to measure difference size between predicted and real data to find the most affected age group.
```{r}
# create cable to store each age groups' value
cable <- data.frame(matrix(nrow = length(age_groups), ncol = 2))
rownames(cable) <- age_groups
colnames(cable) <- c("Male_cohens_d", "Female_cohens_d")
# calculate each age groups cohen.d
for (g in age_groups) {
for (gender in c("Male", "Female")) {
if (!((gender == "Male" && g %in% c("25 to 29")) ||
(gender == "Female" && g %in% c("40 to 44", "45 to 49")))) {
g_pre <- filter(subuse_pre, age == g, sex == gender)
g_post <- filter(subuse_post, age == g, sex == gender)
g_pre <- g_pre[order(g_pre$year), ]
g_post <- g_post[order(g_post$year), ]
reg <- lm(val_percent ~ year, g_pre)
years <- g_post$year
pre_data <- predict(reg, newdata = data.frame(year = years))
real_data <- g_post$val_percent
cable[g, paste0(gender, "_cohens_d")] <- cohen.d(real_data,pre_data)$estimate
}
}
}
kable(cable)
```
According to cable, we determined the age group that has biggest change is both **65 to 69** in male and female.
### Part 2: Ask own question
When we were searching parameters of `ggplot2` package, we found that `ggplot2` allows us to plot data onto world map. We thought that this was very similar to our part1Q1 situation, so we want to **visualize the data from part1Q1 in a more intuitive way and do some comparison with it**.
However, during our search, we found that there is no existed function that can map data directly to World Bank location, so we designed a new function to achieve this goal(`WBdataMerge`). We can then use this function to solve our own question mentioned above.
Note: Exact name in each vector are omitted here for clearer presentation of function logic and some countries(regions) are omitted as they are not shown in world bank website.
```{r, include=FALSE}
WBdataMerge <- function(UseData){
# Following vectors stores the mapping relationship between Worldbank location and real country(region) name that are stored in map_data("world")
# some countriess(regions) are omitted as they are not shown in world bank website.
South_Asia=c("Afghanistan", "Bangladesh", "Bhutan", "India", "Maldives", "Nepal", "Pakistan", "Sri Lanka")
Middle_East_North_Africa=c("Algeria", "Bahrain", "Djibouti", "Egypt", "Iran", "Iraq", "Israel", "Jordan", "Kuwait", "Lebanon", "Libya", "Malta", "Morocco", "Oman", "Qatar", "Saudi Arabia", "Syria", "Tunisia", "United Arab Emirates", "Yemen")
East_Asia_Pacific=c("American Samoa", "Australia", "Brunei", "Cambodia", "China", "Fiji", "French Polynesia", "Guam", "Indonesia", "Japan", "Kiribati", "North Korea", "South Korea", "Laos", "Malaysia", "Marshall Islands", "Micronesia", "Mongolia", "Myanmar", "Nauru", "New Caledonia", "New Zealand", "Palau", "Papua New Guinea", "Philippines", "Samoa", "Singapore", "Solomon Islands", "Thailand", "Timor-Leste", "Tonga", "Vanuatu", "Vietnam", "Cook Islands", "Taiwan")
North_America=c("Bermuda", "Canada", "USA")
Sub_Saharan_Africa=c("Angola", "Benin", "Botswana", "Burkina Faso", "Burundi", "Cape Verde", "Cameroon", "Central African Republic", "Chad", "Comoros", "Democratic Republic of the Congo", "Republic of Congo", "Ivory Coast", "Equatorial Guinea", "Eritrea", "Ethiopia", "Gabon", "Gambia", "Ghana", "Guinea", "Guinea-Bissau", "Kenya", "Lesotho", "Liberia", "Madagascar", "Malawi", "Mali", "Mauritania", "Mauritius", "Mozambique", "Namibia", "Niger", "Nigeria", "Rwanda", "Sao Tome and Principe", "Senegal", "Seychelles", "Sierra Leone", "Somalia", "South Africa", "South Sudan", "Sudan", "Tanzania", "Togo", "Uganda", "Zambia", "Zimbabwe", "Swaziland")
Europe_Central_Asia=c("Albania", "Andorra", "Armenia", "Austria", "Azerbaijan", "Belarus", "Belgium", "Bosnia and Herzegovina", "Bulgaria", "Croatia", "Cyprus", "Czech Republic", "Denmark", "Estonia", "Faroe Islands", "Finland", "France", "Georgia", "Germany", "Greece", "Greenland","Hungary", "Iceland", "Ireland","Isle of Man", "Italy", "Kazakhstan", "Kosovo", "Kyrgyzstan", "Latvia", "Liechtenstein", "Lithuania", "Luxembourg", "North Macedonia", "Moldova", "Monaco", "Montenegro", "Netherlands", "Norway", "Poland", "Portugal", "Romania", "Russia", "San Marino", "Serbia", "Slovakia", "Slovenia", "Spain", "Sweden", "Switzerland", "Tajikistan", "Turkey", "Turkmenistan", "Ukraine", "UK", "Uzbekistan", "Guernsey", "Jersey")
Latin_America_Caribbean=c("Antigua", "Barbuda", "Argentina", "Aruba", "Bahamas", "Barbados", "Belize", "Bolivia", "Brazil","Virgin Island","Cayman Islands", "Chile", "Colombia", "Costa Rica", "Cuba", "Dominica", "Dominican Republic", "Ecuador", "El Salvador", "Grenada", "Guatemala", "Guyana", "Haiti", "Honduras", "Jamaica", "Mexico", "Nicaragua", "Panama", "Paraguay", "Peru","Puerto Rico", "Sint Maarten", "Saint Kitts", "Nevis", "Saint Lucia", "Saint Martin", "Saint Vincent", "Grenadines", "Suriname", "Trinidad", "Tobago","Turks and Caicos Islands", "Uruguay", "Venezuela", "Anguilla", "Saint Barthelemy","Curacao", "French Guiana")
# wname to store country(region) name
wname = c()
# wdata to store location value
wdata = c()
# although following dataframe construction process are quite repetitive, as we want to update the wname and wdata, we didn't use function
# map location data to country(region)
AddData = filter(UseData, location=="South Asia")$val_percent
for(i in South_Asia){
wname = c(wname, i)
wdata = c(wdata, AddData)
}
AddData = filter(UseData, location=="Middle East & North Africa")$val_percent
for(i in Middle_East_North_Africa){
wname = c(wname, i)
wdata = c(wdata, AddData)
}
AddData = filter(UseData, location=="Middle East & North Africa")$val_percent
for(i in Middle_East_North_Africa){
wname = c(wname, i)
wdata = c(wdata, AddData)
}
AddData = filter(UseData, location=="East Asia & Pacific")$val_percent
for(i in East_Asia_Pacific){
wname = c(wname, i)
wdata = c(wdata, AddData)
}
AddData = filter(UseData, location=="North America")$val_percent
for(i in North_America){
wname = c(wname, i)
wdata = c(wdata, AddData)
}
AddData = filter(UseData, location=="Sub-Saharan Africa")$val_percent
for(i in Sub_Saharan_Africa){
wname = c(wname, i)
wdata = c(wdata, AddData)
}
AddData = filter(UseData, location=="Europe & Central Asia")$val_percent
for(i in Europe_Central_Asia){
wname = c(wname, i)
wdata = c(wdata, AddData)
}
AddData = filter(UseData, location=="Latin America & Caribbean")$val_percent
for(i in Latin_America_Caribbean){
wname = c(wname, i)
wdata = c(wdata, AddData)
}
# combine wname and wdata together to dataframe
result_data = as.data.frame(cbind(wname, wdata))
# change the data column to numeric class
result_data$wdata = as.numeric(result_data$wdata)
# get the world map
world_map = map_data("world")
# merge according to the name
merged_data <- left_join(world_map, result_data, by = c('region' = 'wname'))
return(merged_data)
}
```
```{r, eval=FALSE, tidy=FALSE}
WBdataMerge <- function(UseData){
# following vectors store how Worldbank data map to country names.
South_Asia = c("...")
Middle_East_North_Africa = c("...")
East_Asia_Pacific = c("...")
North_America = c("...")
Sub_Saharan_Africa = c("...")
Europe_Central_Asia = c("...")
Latin_America_Caribbean = c("...")
# wname to store country(region) name
wname = c()
# wdata to store location value
wdata = c()
# though following "map location data to country" process are quite repetitive, as we
# want to update the wname and wdata, we didn't use function.
# Same content are shown as ... for clearer logic since 2nd for-loop.
AddData = filter(UseData, location == "South Asia")$val_percent
for(i in South_Asia){
wname = c(wname, i)
wdata = c(wdata, AddData)
}
AddData = filter(UseData, location == "Middle East & North Africa")$val_percent
for(i in Middle_East_North_Africa){...}
AddData = filter(UseData, location == "Middle East & North Africa")$val_percent
for(i in Middle_East_North_Africa){...}
AddData = filter(UseData, location == "East Asia & Pacific")$val_percent
for(i in East_Asia_Pacific){...}
AddData = filter(UseData, location == "North America")$val_percent
for(i in North_America){...}
AddData = filter(UseData, location == "Sub-Saharan Africa")$val_percent
for(i in Sub_Saharan_Africa){...}
AddData = filter(UseData, location == "Europe & Central Asia")$val_percent
for(i in Europe_Central_Asia){...}
AddData = filter(UseData, location == "Latin America & Caribbean")$val_percent
for(i in Latin_America_Caribbean){...}
# combine wname and wdata together to dataframe
result_data = as.data.frame(cbind(wname, wdata))
# change the data column to numeric class
result_data$wdata = as.numeric(result_data$wdata)
# get the world map
world_map = map_data("world")
# merge according to the name
merged_data <- left_join(world_map, result_data, by = c('region' = 'wname'))
return(merged_data)
}
```
As in part1Q1, we only checked the data for 40-44 aged male's alcohol-related deaths in different region. We wonder how about the female? And what's the relationship between them? If answers to these questions are known, doctors and governments could take more targeted and effective measures to deal with this kind of death.
Therefore, we use `WBdataMerge` and `ggplot` to visualize **40-44 aged male and female alcohol-related deaths in different region** (Figure 2.1 for male and Figure 2.2 for female).
```{r, tidy=FALSE, fig.width=10, fig.height=6, fig.show="hold"}
# filter out the data
subuse_4_male = subuse_1
subuse_4_female = subuse_final %>%
filter(year==2019, age == '40 to 44', sex == 'Female',
measure == 'Deaths', cause == 'Alcohol use disorders')
# To uniform each map's color-data relationship, we first get their range
subuse_4_min = min(c(subuse_4_male$val_percent, subuse_4_female$val_percent))
subuse_4_max = max(c(subuse_4_male$val_percent, subuse_4_female$val_percent))
# slightly move the boundary to plot all regions with color
subuse_4_range = range(subuse_4_min - 1e-5, subuse_4_max + 1e-5)
# Male plot (Figure 2.1)
ggplot(WBdataMerge(subuse_4_male),
aes(x = long, y = lat, group = group, fill = wdata)) +
geom_polygon(color = "black") +
labs(title = "40-44 aged male alcohol-related deaths in 2019",
x="longitude", y="latitude", caption = "Figure 2.1") +
theme_minimal() +
scale_fill_gradient(limits = subuse_4_range, low = "lightblue", high = "darkblue") +
guides(fill = guide_colorbar(title = "Data"))
# Female Plot (Figure 2.2)
ggplot(WBdataMerge(subuse_4_female),
aes(x = long, y = lat, group = group, fill = wdata)) +
geom_polygon(color = "black") +
labs(title = "40-44 aged female alcohol-related deaths in 2019",
x="longitude", y="latitude", caption = "Figure 2.2") +
theme_minimal() +
scale_fill_gradient(limits = subuse_4_range, low = "lightblue", high = "darkblue") +
guides(fill = guide_colorbar(title = "Data"))
```
According to figure 2.1 and 2.2, male has higher death rate than female in 40 to 44 age group in 2019. To quantitative the relationship, we then use the `clever_test` to test their difference.
H~0~: In 2019 and age group of 40 to 44, **male**'s alcohol-related death rate is lower than or no different from **female**.
H~A~: In 2019 and age group of 40 to 44, **male**'s alcohol-related death rate is larger than **female**.
To save the relationship between each location, we use paired version:
```{r, collapse=TRUE}
# ordered the data to same sequence and get val percent
subuse_4_male_val = subuse_4_male[order(subuse_4_male$location), ]$val_percent
subuse_4_female_val = subuse_4_female[order(subuse_4_female$location), ]$val_percent
# compare
clever_test(subuse_4_male_val, subuse_4_female_val,"greater", TRUE)
```
The p.value is smaller than $\alpha$ (we use 0.05 here), so we could reject the null hypothesis.