You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fatcat has a general policy that DOIs should always be normalized and stored in lower-case. It turns out this has not actually been enforced at the API level, and the clean_doi() helper function in Python was not normalizing to lower-case, which has resulted in many non-lower-case release entities being created, many of which are likely duplicates.
The usual DOI importers (Crossref, Datacite) did lower-case, and the lookup API also lower-cases, which has minimized the scope of the problem, but there are still on the order of 134k duplicate records:
Here is an example of two release entities for the same work. The Pubmed-sourced import happened first, and resulted in a release with upper-case DOI. The Crossref import happened second (same day!) with lowercase DOI:
have API creation endpoint enforce lower-casing, at least for creation (eg, don't allow creation of entities if DOI is not lower-case, but don't clobber existing records)
update and/or merge existing entities
The text was updated successfully, but these errors were encountered:
Code in a number of places (including Pubmed importer) assumed that this
was already lower-casing DOIs, resulting in some broken metadata getting
created.
See also: #83
This is just the first step of mitigation.
Fatcat has a general policy that DOIs should always be normalized and stored in lower-case. It turns out this has not actually been enforced at the API level, and the
clean_doi()
helper function in Python was not normalizing to lower-case, which has resulted in many non-lower-case release entities being created, many of which are likely duplicates.The usual DOI importers (Crossref, Datacite) did lower-case, and the lookup API also lower-cases, which has minimized the scope of the problem, but there are still on the order of 134k duplicate records:
Here is an example of two release entities for the same work. The Pubmed-sourced import happened first, and resulted in a release with upper-case DOI. The Crossref import happened second (same day!) with lowercase DOI:
Fixing this could include multiple stages:
clean_doi()
in python to lower-case DOIsThe text was updated successfully, but these errors were encountered: