-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arxiv: improve author name parsing ("and") #61
Comments
I am not sure which arXiv API is being used here, but I can see they are returning properly structured list of authors.
|
We use the OAI-PMH feed, in the If I recall correctly the reason for this is that the other schemas do not include information at the article-version level. OAI-PMH is also preferred for harvesting because we can pull daily updates. In theory we should be able to pull from multiple API endpoints and merge metadata, but that would be a larger change of the harvest/import pipeline. |
This is a strange OAI-PMH API, they have |
Yes. Also on the subject of arxiv author names, I think they have some sort of "canonical" representation as a string in their database somewhere, but not a unique identifier. They use this string to do author lookups (eg, if you click an author name on arxiv.org, it will try to show all papers by that author, and this might work better than a naive search for the author name string as listed in the PDF). I don't remember if this is documented. ORCID usage is not (yet) widespread enough to use as a true author identifier, but maybe that is changing and folks should require authors to have an ORCID when submitting. My impression is that there has been extensive work in progress towards a new arxiv.org API, but that it hasn't launched yet. The transition from Cornell Libraries to the CS department I think resulted in a lot of staff turn over. Recent replies on the API discussion mailing list have come from unpaid folks (very appreciated!), not paid staff. If/when a new API is available which includes both granular author metadata and granular version metadata we would switch to that. |
I think use last name then a comma, followed by initials of the first name and initials of middle name, if present. This form of canonicalization returns a lot of false positives. For example, it returns 206 results when clicking on my name, but only 8 results when searching for my full name. |
Our arxiv harvester receives author metadata as a single string, with individual author names separated by commas and "and".
Here is the function: https://github.com/internetarchive/fatcat/blob/master/python/fatcat_tools/importers/arxiv.py#L24
In some cases, as discovered by Sawood, this doesn't work and all author names come through as a single string. For example:
https://fatcat.wiki/release/c5s6d7f7w5b3himgditfbiu5nq
https://fatcat.wiki/release/f7j4lf4aqfeqlaqfrtayt62rwe
FIxing this will include:
The text was updated successfully, but these errors were encountered: