Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data integrity tests for unique IRIs #164

Merged
merged 14 commits into from
Sep 24, 2021
Merged

Conversation

cthoyt
Copy link
Member

@cthoyt cthoyt commented Sep 23, 2021

This PR adds tests to ensure that when we generate a prefix map using the Bioregistry that all of the IRIs are unique. There are two scenarios we can use to mitigate this I can think of offhand:

  1. When entries are in parent/child relationship, like kegg with kegg.compound, kegg.drug, etc. we can enforce that the part_of annotation is used for all of the children. If only one prefix in a group of prefixes with the same IRI doesn't have a part_of relationship, we'll assume it's the parent and default to using it.
  2. When entries are annotated with a provides, then they will automatically not be considered in the prefix map, such as ctd.gene, which provides for ncbigene

When neither of these work, this PR has also introduced the has_canonical relationship between prefixes that maps one prefix to another.

There are some entries that seem to be duplicates of each other like glycomedb and glytoucan. In this case, glytoucan is the actual name of the database so glycomedb gets the "has_canonical": "glytoucan", denoting it is higher priority (see identifiers-org/identifiers-org.github.io#167). I wasn't confident enough to merge them, but the wb and wormbase entries are now indeed merged.

The tests enabled exhaustive curation of scenarios 1, 2, and the pmapto is used for the rest.

@cthoyt cthoyt marked this pull request as ready for review September 24, 2021 09:02
@cthoyt cthoyt merged commit 15fb869 into main Sep 24, 2021
@cthoyt cthoyt deleted the add-iri-uniqueness-tests branch September 24, 2021 09:20
@cthoyt
Copy link
Member Author

cthoyt commented Sep 24, 2021

@matentzn FYI this is the big step necessary in improving the curation in the Bioregistry to support generating prefix maps that are guaranteed to be bidirectionally unique :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant