Skip to content

labordynamicsinstitute/pulling-from-crossref

Repository files navigation

title author date output
Obtaining lists of articles to replicate
Lars Vilhuber
3/1/2019
html_document
keep_md
true

Sources

We have two sources:

  • the existing list of articles to replicate on Google Sheets
  • CrossRef, for new articles

We load relevant libraries here.

Instructions

This file, when executed, will

  • download the latest Replication list from Google Drive
  • download DOI for all publications for a number of journals from CrossRef
  • identify the ones that are new
  • provide a CSV that can be manually uploaded

The program will check for prior files, and will NOT download new data if those files are present. Thus, to get a fresh run,

  • delete data/outputs/replication_list_DOI.Rds if you want to re-download the list of articles from Google Drive
  • delete data/interwrk/new.Rds to re-download files from CrossRef
  • revert data/outputs/issns.Rds (which stores the last query date, and is updated at the end of this process)

Data locations

Permanent data is in

data/outputs

and should be committed to the repository.

Temporary data is in

data/interwrk

and can (should) be deleted after completion.

Current list of DOI

We first obtain the current list of DOI. This is not failsafe - it assumes there is such a list.

if (file.exists(repllist.file)) {
	print(paste0("File ",repllist.file," exists."))
} else	{
gs_auth()
# Extract Google Sheet Information Object
replication_list.gs <- gs_key(replication_list_KEY)

# Print worksheet names
gs_ws_ls(replication_list.gs)

# 
ws <- gs_ws_ls(replication_list.gs)
for (x in 1:length(ws)) {

  # Extract list and tidy
	tmp.ws <- gs_read(replication_list.gs,ws=x) %>% select(DOI)
	tmp.ws$worksheet <- ws[x]
	names(tmp.ws) <- sub("\\?","",names(tmp.ws))

	# Save
	saveRDS(tmp.ws,file = file.path(interwrk,paste0("replication_list_",x,".Rds")))

	# Pause so Google doesn't freak out
	Sys.sleep(10)
	rm(tmp.ws)

}
# End of else statement
} 
## [1] "File data/outputs/replication_list_DOI.Rds exists."
# now we combine them, and clean them up
# Compile all the worksheets except for "2009 missing online material"
if (file.exists(repllist.file)) {
	repllist <- readRDS(file = repllist.file)
} else {
repllist <- NA
for ( x in 1:length(ws) ) {
  if ( ws[x] != "2009 missing online material" ) {
    print(paste("Processing",ws[x]))
    if ( x == 1 ) {
      # Read in the first list and set variable types
      repllist <- readRDS(file = file.path(interwrk,paste0("replication_list_",x,".Rds")))
    } else {
      # Read in the subsequent lists and set variable types
      tmp <- readRDS(file = file.path(interwrk,paste0("replication_list_",x,".Rds")))

      # Add to master dataframe
      repllist <- bind_rows(repllist,tmp)
      rm(tmp)
    }
  }
}
saveRDS(repllist,file = repllist.file)
# end of else
}
uniques <- repllist %>% select(DOI) %>% distinct() %>% rename(doi = DOI)
  • We read 1097 records on 2019-03-01.
  • There are 846 unique records.
# Each journal has a ISSN
if (!file.exists(issns.file)) {
issns <- data.frame(matrix(ncol=3,nrow=5))
names(issns) <- c("journal","issn","lastdate")
tmp.date <- c("2016-01")
issns[1,] <- c("American Economic Journal: Applied Economics","1945-7790",tmp.date)
issns[2,] <- c("American Economic Journal: Economic Policy","1945-774X",tmp.date)
issns[3,] <- c("American Economic Journal: Macroeconomics", "1945-7715",tmp.date)
issns[4,] <- c("American Economic Journal: Microeconomics", "1945-7685",tmp.date)
issns[5,] <- c("The American Economic Review","1944-7981",tmp.date)

saveRDS(issns, file= issns.file)
}

issns <- readRDS(file = issns.file)

Now read DOI for all later dates.

if (!file.exists(issns.file)) {
	new.df <- NA
	for ( x in 1:nrow(issns) ) {
		new <- cr_journals(issn=issns[x,"issn"], works=TRUE,
				   filter=c(from_pub_date=issns[x,"lastdate"]),
				   select=c("DOI","title","published-print","volume","issue","container-title"),
				   limit= 500)
		if ( x == 1 ) {
      		new.df <- as.data.frame(new$data)  
    	} else {
      		new.df <- bind_rows(new.df,as.data.frame(new$data))
    	}
	}
	saveRDS(new.df, file= file.path(interwrk,paste0("new.Rds")))
}
new.df <- readRDS(file= file.path(interwrk,paste0("new.Rds")))

We read 580 records for 4 journals:

container.title records


American Economic Journal: Applied Economics 149 American Economic Journal: Economic Policy 166 American Economic Journal: Macroeconomics 116 American Economic Journal: Microeconomics 149

Of these, 333 records for 4 journals were new:

journal records lastdate


American Economic Journal: Applied Economics 50 2019-01
American Economic Journal: Economic Policy 166 2019-02
American Economic Journal: Macroeconomics 54 2019-01
American Economic Journal: Microeconomics 63 2019-02

The new records can be found here. We now update the file we use to track the updates, data/outputs/issns.Rds. If you need to run the process anew, simply revert the file data/outputs/issns.Rds and run this document again.

issns <- addtl.stats %>% select(journal,lastdate) %>% 
	right_join(issns,by=c("journal")) %>%
	mutate( lastdate = coalesce(lastdate.x,lastdate.y)) %>%
	select(-lastdate.x, -lastdate.y)
#saveRDS(issns, file= issns.file)

About

Pulling articles from CrossRef

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages