non-lowercase DOIs #83

bnewbold · 2021-06-07T21:38:08Z

Fatcat has a general policy that DOIs should always be normalized and stored in lower-case. It turns out this has not actually been enforced at the API level, and the clean_doi() helper function in Python was not normalizing to lower-case, which has resulted in many non-lower-case release entities being created, many of which are likely duplicates.

The usual DOI importers (Crossref, Datacite) did lower-case, and the lookup API also lower-cases, which has minimized the scope of the problem, but there are still on the order of 134k duplicate records:

zcat release_extid.tsv.gz | cut -f3 | rg '[A-Z]' | pv -l | wc -l
139964

Here is an example of two release entities for the same work. The Pubmed-sourced import happened first, and resulted in a release with upper-case DOI. The Crossref import happened second (same day!) with lowercase DOI:

Fixing this could include multiple stages:

fix clean_doi() in python to lower-case DOIs
have API creation endpoint enforce lower-casing, at least for creation (eg, don't allow creation of entities if DOI is not lower-case, but don't clobber existing records)
update and/or merge existing entities

The text was updated successfully, but these errors were encountered:

Code in a number of places (including Pubmed importer) assumed that this was already lower-casing DOIs, resulting in some broken metadata getting created. See also: #83 This is just the first step of mitigation.

See also: #83 This commit is no behavior change, just leaving a note to self.

bnewbold · 2021-11-12T19:49:44Z

All non-lower-case DOIs in the current fatcat catalog have now been updated to be lower-case. This impacted about 140k release entities.

One part of cleanup from this will be the many duplicate DOIs that this introduced, but that can be handled as part of generic DOI de-duplication.

A remaining task is to strictly enforce DOI lower-casing in fatcat API daemon.

bnewbold added bug Something isn't working content Bulk imports and updates to existing production catalog labels Jun 7, 2021

bnewbold added a commit that referenced this issue Oct 13, 2021

rust: prep for possible DOI lowercase enforcement

3f825db

See also: #83 This commit is no behavior change, just leaving a note to self.

bnewbold added a commit that referenced this issue Oct 13, 2021

rust: prep for possible DOI lowercase enforcement

e79dfe3

See also: #83 This commit is no behavior change, just leaving a note to self.

bnewbold added a commit that referenced this issue Oct 14, 2021

rust: prep for possible DOI lowercase enforcement

54cb27a

See also: #83 This commit is no behavior change, just leaving a note to self.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-lowercase DOIs #83

non-lowercase DOIs #83

bnewbold commented Jun 7, 2021 •

edited

Loading

bnewbold commented Nov 12, 2021

non-lowercase DOIs #83

non-lowercase DOIs #83

Comments

bnewbold commented Jun 7, 2021 • edited Loading

bnewbold commented Nov 12, 2021

bnewbold commented Jun 7, 2021 •

edited

Loading