Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-lowercase DOIs #83

Open
2 of 3 tasks
bnewbold opened this issue Jun 7, 2021 · 1 comment
Open
2 of 3 tasks

non-lowercase DOIs #83

bnewbold opened this issue Jun 7, 2021 · 1 comment
Labels
bug Something isn't working content Bulk imports and updates to existing production catalog

Comments

@bnewbold
Copy link
Contributor

bnewbold commented Jun 7, 2021

Fatcat has a general policy that DOIs should always be normalized and stored in lower-case. It turns out this has not actually been enforced at the API level, and the clean_doi() helper function in Python was not normalizing to lower-case, which has resulted in many non-lower-case release entities being created, many of which are likely duplicates.

The usual DOI importers (Crossref, Datacite) did lower-case, and the lookup API also lower-cases, which has minimized the scope of the problem, but there are still on the order of 134k duplicate records:

zcat release_extid.tsv.gz | cut -f3 | rg '[A-Z]' | pv -l | wc -l
139964

Here is an example of two release entities for the same work. The Pubmed-sourced import happened first, and resulted in a release with upper-case DOI. The Crossref import happened second (same day!) with lowercase DOI:

Fixing this could include multiple stages:

  • fix clean_doi() in python to lower-case DOIs
  • have API creation endpoint enforce lower-casing, at least for creation (eg, don't allow creation of entities if DOI is not lower-case, but don't clobber existing records)
  • update and/or merge existing entities
@bnewbold bnewbold added bug Something isn't working content Bulk imports and updates to existing production catalog labels Jun 7, 2021
bnewbold added a commit that referenced this issue Jul 2, 2021
Code in a number of places (including Pubmed importer) assumed that this
was already lower-casing DOIs, resulting in some broken metadata getting
created.

See also: #83

This is just the first step of mitigation.
bnewbold added a commit that referenced this issue Oct 13, 2021
See also: #83

This commit is no behavior change, just leaving a note to self.
bnewbold added a commit that referenced this issue Oct 13, 2021
See also: #83

This commit is no behavior change, just leaving a note to self.
bnewbold added a commit that referenced this issue Oct 14, 2021
See also: #83

This commit is no behavior change, just leaving a note to self.
@bnewbold
Copy link
Contributor Author

All non-lower-case DOIs in the current fatcat catalog have now been updated to be lower-case. This impacted about 140k release entities.

One part of cleanup from this will be the many duplicate DOIs that this introduced, but that can be handled as part of generic DOI de-duplication.

A remaining task is to strictly enforce DOI lower-casing in fatcat API daemon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working content Bulk imports and updates to existing production catalog
Projects
None yet
Development

No branches or pull requests

1 participant