Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use api.crossref.org/works/doi/agency route to test Crossref indexing status #152

Closed
njahn82 opened this issue Feb 22, 2021 · 3 comments
Closed
Assignees

Comments

@njahn82
Copy link
Collaborator

njahn82 commented Feb 22, 2021

Crossref API provides an API route to obtain DOI agencies per DOI. It is implemented in rcrossref, rcrossref::cr_agency so there's no need to use api.doi.org to check DOI indexing status. #147 https://github.com/subugoe/biblids/issues/38

Another advantage is that we can call the API as metadata plus members, which likely improves performance #133 https://github.com/subugoe/biblids/issues/37.

Here's a reprex

library(rcrossref)
library(purrr)
library(tibble)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dois <- c(
  # medra
  "10.1393/ncc/i2020-20143-y",
  #crossref
  "10.1002/asi.24460",
  # datacite
  "10.4119/UNIBI/UB.2014.18",
  # non-registered doi
  "10.1002/asi.i.am.not.a.doi"
)
doi_agency <- rcrossref::cr_agency(dois = dois)
#> Warning: 404 (client error): /works/10.1002/asi.i.am.not.a.doi/agency - Resource
#> not found.
# print
doi_agency
#> $`10.1393/ncc/i2020-20143-y`
#> $`10.1393/ncc/i2020-20143-y`$DOI
#> [1] "10.1393/ncc/i2020-20143-y"
#> 
#> $`10.1393/ncc/i2020-20143-y`$agency
#> $`10.1393/ncc/i2020-20143-y`$agency$id
#> [1] "medra"
#> 
#> $`10.1393/ncc/i2020-20143-y`$agency$label
#> [1] "mEDRA"
#> 
#> 
#> 
#> $`10.1002/asi.24460`
#> $`10.1002/asi.24460`$DOI
#> [1] "10.1002/asi.24460"
#> 
#> $`10.1002/asi.24460`$agency
#> $`10.1002/asi.24460`$agency$id
#> [1] "crossref"
#> 
#> $`10.1002/asi.24460`$agency$label
#> [1] "Crossref"
#> 
#> 
#> 
#> $`10.4119/UNIBI/UB.2014.18`
#> $`10.4119/UNIBI/UB.2014.18`$DOI
#> [1] "10.4119/unibi/ub.2014.18"
#> 
#> $`10.4119/UNIBI/UB.2014.18`$agency
#> $`10.4119/UNIBI/UB.2014.18`$agency$id
#> [1] "datacite"
#> 
#> $`10.4119/UNIBI/UB.2014.18`$agency$label
#> [1] "DataCite"
#> 
#> 
#> 
#> $`10.1002/asi.i.am.not.a.doi`
#> NULL
# tibble with dois where the resource was found
cr_agency_df <- tibble(
  doi =  purrr::map_chr(doi_agency, "DOI", .default = NA) %>%
    unname(),
  agency = purrr::map(doi_agency, "agency") %>% 
    map_chr("label", .default = NA)
)
# only crossref dois
cr_agency_df %>%
  filter(agency == "Crossref") %>%
  .$doi
#> [1] "10.1002/asi.24460"

Created on 2021-02-22 by the reprex package (v0.3.0)

@maxheld83
Copy link
Contributor

agreed, it'd be good to use an existing function for this purpose, rather than reinventing the wheel.
I'll address this as part of #60.

Questions:

  • is rcrossref with md plus faster/slower than doi.org? By how much? Does it even matter?
  • how much dev effort is necessary for calling the doi.org API over and above is_doi_resolveable()? Openapc check #38
  • are URLs and DOIs properly escaped?
  • can the functions be used in a way to make
    • properly type and length-stable wrappers?
    • proper predicate functions?
  • is the downstream implementation for metadata, or just a header-only singleton (the latter would be preferable)

Above all, of course, metacheck should remain small but very reliable.
Speed is nice, but most important are minimal complexity and maximum reproducibility.

@maxheld83
Copy link
Contributor

maxheld83 commented Feb 26, 2021

The benchmark looks like doi.org even without any bells/whistles is much faster than crossref with bells/whistles:

some_dois <- c(tu_dois(), "10.1000/IDONOTEXIST")
is_doi_cr <- function(x) {
  res <- purrr::map_chr(
    suppressWarnings(rcrossref::cr_agency(some_dois)),
    c("agency", "id"),
    .default = "foo"
  )
  unname(res == "crossref")
}
bench::mark(
  biblids::is_doi_found(some_dois),
  is_doi_cr(some_dois)
)
# A tibble: 2 x 13
  expression                            min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result     memory                 time       gc           
  <bch:expr>                       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>     <list>                 <list>     <list>       
1 biblids::is_doi_found(some_dois)   24.31s   24.31s   0.0411     2.94MB   0.288      1     7     24.31s <lgl [291… <Rprofmem[,3] [2,962 … <bch:tm [… <tibble [1 ×…
2 is_doi_cr(some_dois)                3.34m    3.34m   0.00499   16.55MB   0.0349     1     7      3.34m <lgl [291… <Rprofmem[,3] [22,962… <bch:tm [… <tibble [1 ×…

The raw speed alone is less of a concern though.

The bigger problem is that I can't get multithreaded cr calls to work via rcrossref, though I am not entirely sure why.

@maxheld83
Copy link
Contributor

closing in favor of biblids::is_doi_ra() which is much faster as per #174.
Duplicate check is removed in #174.

Notice that above benchmarks are, somewhat off the mark, about biblids::is_doi_found() -- those are still in because there is no crossref equivalent to them.
Except maybe a header singleton test in #176, but we don't have that yet.
And anyway, doi.org is much faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants