-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
198 lines (153 loc) · 5.66 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: "Obtaining lists of articles to replicate"
author: "Lars Vilhuber"
date: "3/1/2019"
output:
html_document:
keep_md: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Sources
We have two sources:
- the existing list of articles to replicate on Google Sheets
- CrossRef, for new articles
We load relevant libraries here.
```{r config_libs,include=FALSE,message=FALSE}
source(file.path(rprojroot::find_root(rprojroot::has_file("pathconfig.R")),"pathconfig.R"),echo=TRUE)
source(file.path(basepath,"global-libraries.R"),echo=TRUE)
source(file.path(basepath,"config.R"),echo=TRUE)
```
## Instructions
This file, when executed, will
- download the latest Replication list from Google Drive
- download DOI for all publications for a number of journals from CrossRef
- identify the ones that are new
- provide a CSV that can be manually uploaded
The program will check for prior files, and will NOT download new data if those files are present. Thus, to get a fresh run,
- delete ` `r repllist.file` ` if you want to re-download the list of articles from Google Drive
- delete ` `r file.path(interwrk,paste0("new.Rds"))` ` to re-download files from CrossRef
- revert ` `r issns.file` ` (which stores the last query date, and is updated at the end of this process)
## Data locations
Permanent data is in
> `r dataloc`
and should be committed to the repository.
Temporary data is in
> `r interwrk`
and can (should) be deleted after completion.
## Current list of DOI
We first obtain the current list of DOI. This is not failsafe - it assumes there *is* such a list.
```{r read_list}
if (file.exists(repllist.file)) {
print(paste0("File ",repllist.file," exists."))
} else {
gs_auth()
# Extract Google Sheet Information Object
replication_list.gs <- gs_key(replication_list_KEY)
# Print worksheet names
gs_ws_ls(replication_list.gs)
#
ws <- gs_ws_ls(replication_list.gs)
for (x in 1:length(ws)) {
# Extract list and tidy
tmp.ws <- gs_read(replication_list.gs,ws=x) %>% select(DOI)
tmp.ws$worksheet <- ws[x]
names(tmp.ws) <- sub("\\?","",names(tmp.ws))
# Save
saveRDS(tmp.ws,file = file.path(interwrk,paste0("replication_list_",x,".Rds")))
# Pause so Google doesn't freak out
Sys.sleep(10)
rm(tmp.ws)
}
# End of else statement
}
# now we combine them, and clean them up
```
```{r read_list2}
# Compile all the worksheets except for "2009 missing online material"
if (file.exists(repllist.file)) {
repllist <- readRDS(file = repllist.file)
} else {
repllist <- NA
for ( x in 1:length(ws) ) {
if ( ws[x] != "2009 missing online material" ) {
print(paste("Processing",ws[x]))
if ( x == 1 ) {
# Read in the first list and set variable types
repllist <- readRDS(file = file.path(interwrk,paste0("replication_list_",x,".Rds")))
} else {
# Read in the subsequent lists and set variable types
tmp <- readRDS(file = file.path(interwrk,paste0("replication_list_",x,".Rds")))
# Add to master dataframe
repllist <- bind_rows(repllist,tmp)
rm(tmp)
}
}
}
saveRDS(repllist,file = repllist.file)
# end of else
}
uniques <- repllist %>% select(DOI) %>% distinct() %>% rename(doi = DOI)
```
- We read `r nrow(repllist)` records on `r Sys.Date()`.
- There are **`r nrow(uniques)`** unique records.
```{r issn}
# Each journal has a ISSN
if (!file.exists(issns.file)) {
issns <- data.frame(matrix(ncol=3,nrow=5))
names(issns) <- c("journal","issn","lastdate")
tmp.date <- c("2016-01")
issns[1,] <- c("American Economic Journal: Applied Economics","1945-7790",tmp.date)
issns[2,] <- c("American Economic Journal: Economic Policy","1945-774X",tmp.date)
issns[3,] <- c("American Economic Journal: Macroeconomics", "1945-7715",tmp.date)
issns[4,] <- c("American Economic Journal: Microeconomics", "1945-7685",tmp.date)
issns[5,] <- c("The American Economic Review","1944-7981",tmp.date)
saveRDS(issns, file= issns.file)
}
issns <- readRDS(file = issns.file)
```
Now read DOI for all later dates.
```{r read_new}
if (!file.exists(issns.file)) {
new.df <- NA
for ( x in 1:nrow(issns) ) {
new <- cr_journals(issn=issns[x,"issn"], works=TRUE,
filter=c(from_pub_date=issns[x,"lastdate"]),
select=c("DOI","title","published-print","volume","issue","container-title"),
limit= 500)
if ( x == 1 ) {
new.df <- as.data.frame(new$data)
} else {
new.df <- bind_rows(new.df,as.data.frame(new$data))
}
}
saveRDS(new.df, file= file.path(interwrk,paste0("new.Rds")))
}
new.df <- readRDS(file= file.path(interwrk,paste0("new.Rds")))
```
We read `r nrow(new.df)` records for `r nrow(new.df %>% select(container.title) %>% distinct())` journals:
```{r stats1, echo=FALSE}
knitr::kable(new.df %>% group_by(container.title) %>% summarise(records = n()))
```
```{r write_addtl, include=FALSE}
# we remove those we already have
addtl.df <- anti_join(new.df,uniques,by=c("doi"))
write.csv(addtl.df, file = addtl.file)
```
Of these, `r nrow(addtl.df)` records for `r nrow(addtl.df %>% select(container.title) %>% distinct())` journals were new:
```{r stats2, echo=FALSE}
addtl.stats <- addtl.df %>%
group_by(container.title) %>%
summarise(records = n(), lastdate = max(published.print)) %>%
rename(journal = container.title )
knitr::kable(addtl.stats)
```
The new records can be found [here](`r addtl.file`). We now update the file we use to track the updates, ` `r issns.file` `. If you need to run the process anew, simply revert the file ` `r issns.file` ` and run this document again.
```{r update}
issns <- addtl.stats %>% select(journal,lastdate) %>%
right_join(issns,by=c("journal")) %>%
mutate( lastdate = coalesce(lastdate.x,lastdate.y)) %>%
select(-lastdate.x, -lastdate.y)
#saveRDS(issns, file= issns.file)
```