Add data integrity tests for unique IRIs #164
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds tests to ensure that when we generate a prefix map using the Bioregistry that all of the IRIs are unique. There are two scenarios we can use to mitigate this I can think of offhand:
kegg
withkegg.compound
,kegg.drug
, etc. we can enforce that thepart_of
annotation is used for all of the children. If only one prefix in a group of prefixes with the same IRI doesn't have apart_of
relationship, we'll assume it's the parent and default to using it.provides
, then they will automatically not be considered in the prefix map, such asctd.gene
, which provides forncbigene
When neither of these work, this PR has also introduced the
has_canonical
relationship between prefixes that maps one prefix to another.There are some entries that seem to be duplicates of each other like
glycomedb
andglytoucan
. In this case,glytoucan
is the actual name of the database soglycomedb
gets the"has_canonical": "glytoucan"
, denoting it is higher priority (see identifiers-org/identifiers-org.github.io#167). I wasn't confident enough to merge them, but thewb
andwormbase
entries are now indeed merged.The tests enabled exhaustive curation of scenarios 1, 2, and the
pmapto
is used for the rest.