README.Rmd

---
title: "Obtaining lists of articles to replicate"
author: "Lars Vilhuber"
date: "3/1/2019"
output: 
  html_document: 
    keep_md: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Sources
We have two sources:

- the existing list of articles to replicate on Google Sheets
- CrossRef, for new articles

We load relevant libraries here.

```{r config_libs,include=FALSE,message=FALSE}
source(file.path(rprojroot::find_root(rprojroot::has_file("pathconfig.R")),"pathconfig.R"),echo=TRUE)
source(file.path(basepath,"global-libraries.R"),echo=TRUE)
source(file.path(basepath,"config.R"),echo=TRUE)
```
## Instructions
This file, when executed, will

- download the latest Replication list from Google Drive
- download DOI for all publications for a number of journals from CrossRef
- identify the ones that are new
- provide a CSV that can be manually uploaded

The program will check for prior files, and will NOT download new data if those files are present. Thus, to get a fresh run, 

- delete ` `r repllist.file` ` if you want to re-download the list of articles from Google Drive
- delete ` `r file.path(interwrk,paste0("new.Rds"))` ` to re-download files from CrossRef
- revert ` `r issns.file` ` (which stores the last query date, and is updated at the end of this process)

## Data locations

Permanent data is in

> `r dataloc`

and should be committed to the repository.

Temporary data is in

> `r interwrk`

and can (should) be deleted after completion.

## Current list of DOI
We first obtain the current list of DOI. This is not failsafe - it assumes there *is* such a list.


```{r read_list}

if (file.exists(repllist.file)) {
	print(paste0("File ",repllist.file," exists."))
} else	{
gs_auth()
# Extract Google Sheet Information Object
replication_list.gs <- gs_key(replication_list_KEY)

# Print worksheet names
gs_ws_ls(replication_list.gs)

# 
ws <- gs_ws_ls(replication_list.gs)
for (x in 1:length(ws)) {

  # Extract list and tidy
	tmp.ws <- gs_read(replication_list.gs,ws=x) %>% select(DOI)
	tmp.ws$worksheet <- ws[x]
	names(tmp.ws) <- sub("\\?","",names(tmp.ws))

	# Save
	saveRDS(tmp.ws,file = file.path(interwrk,paste0("replication_list_",x,".Rds")))

	# Pause so Google doesn't freak out
	Sys.sleep(10)
	rm(tmp.ws)

}
# End of else statement
} 
# now we combine them, and clean them up
```


```{r read_list2}
# Compile all the worksheets except for "2009 missing online material"
if (file.exists(repllist.file)) {
	repllist <- readRDS(file = repllist.file)
} else {
repllist <- NA
for ( x in 1:length(ws) ) {
  if ( ws[x] != "2009 missing online material" ) {
    print(paste("Processing",ws[x]))
    if ( x == 1 ) {
      # Read in the first list and set variable types
      repllist <- readRDS(file = file.path(interwrk,paste0("replication_list_",x,".Rds")))
    } else {
      # Read in the subsequent lists and set variable types
      tmp <- readRDS(file = file.path(interwrk,paste0("replication_list_",x,".Rds")))

      # Add to master dataframe
      repllist <- bind_rows(repllist,tmp)
      rm(tmp)
    }
  }
}
saveRDS(repllist,file = repllist.file)
# end of else
}
uniques <- repllist %>% select(DOI) %>% distinct() %>% rename(doi = DOI)
```

- We read `r nrow(repllist)` records on `r Sys.Date()`.
- There are **`r nrow(uniques)`** unique records. 

```{r issn}
# Each journal has a ISSN
if (!file.exists(issns.file)) {
issns <- data.frame(matrix(ncol=3,nrow=5))
names(issns) <- c("journal","issn","lastdate")
tmp.date <- c("2016-01")
issns[1,] <- c("American Economic Journal: Applied Economics","1945-7790",tmp.date)
issns[2,] <- c("American Economic Journal: Economic Policy","1945-774X",tmp.date)
issns[3,] <- c("American Economic Journal: Macroeconomics", "1945-7715",tmp.date)
issns[4,] <- c("American Economic Journal: Microeconomics", "1945-7685",tmp.date)
issns[5,] <- c("The American Economic Review","1944-7981",tmp.date)

saveRDS(issns, file= issns.file)
}

issns <- readRDS(file = issns.file)

```

Now read DOI for all later dates.
```{r read_new}
if (!file.exists(issns.file)) {
	new.df <- NA
	for ( x in 1:nrow(issns) ) {
		new <- cr_journals(issn=issns[x,"issn"], works=TRUE,
				   filter=c(from_pub_date=issns[x,"lastdate"]),
				   select=c("DOI","title","published-print","volume","issue","container-title"),
				   limit= 500)
		if ( x == 1 ) {
      		new.df <- as.data.frame(new$data)  
    	} else {
      		new.df <- bind_rows(new.df,as.data.frame(new$data))
    	}
	}
	saveRDS(new.df, file= file.path(interwrk,paste0("new.Rds")))
}
new.df <- readRDS(file= file.path(interwrk,paste0("new.Rds")))
```

We read `r nrow(new.df)` records for `r nrow(new.df %>% select(container.title) %>% distinct())` journals:

```{r stats1, echo=FALSE}
knitr::kable(new.df %>% group_by(container.title) %>% summarise(records = n()))
```


```{r write_addtl, include=FALSE}
# we remove those we already have
addtl.df <- anti_join(new.df,uniques,by=c("doi"))
write.csv(addtl.df, file = addtl.file)

```
Of these, `r nrow(addtl.df)` records for `r nrow(addtl.df %>% select(container.title) %>% distinct())` journals were new:

```{r stats2, echo=FALSE}
addtl.stats <- addtl.df %>% 
	group_by(container.title) %>% 
	summarise(records = n(), lastdate = max(published.print)) %>%
	rename(journal = container.title )
knitr::kable(addtl.stats)
```

The new records can be found [here](`r addtl.file`). We now update the file we use to track the updates, ` `r issns.file` `. If you need to run the process anew, simply revert the file ` `r issns.file` ` and run this document again.

```{r update}
issns <- addtl.stats %>% select(journal,lastdate) %>% 
	right_join(issns,by=c("journal")) %>%
	mutate( lastdate = coalesce(lastdate.x,lastdate.y)) %>%
	select(-lastdate.x, -lastdate.y)
#saveRDS(issns, file= issns.file)
```