Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

many file webarchive (wayback) URLs have only 12 of 14 timestamp digits #81

Open
bnewbold opened this issue May 19, 2021 · 1 comment
Open
Assignees
Labels
bug Something isn't working content Bulk imports and updates to existing production catalog

Comments

@bnewbold
Copy link
Contributor

bnewbold commented May 19, 2021

For example:

https://fatcat.wiki/file/rcbebk4ox5esbnnpipbnegy7si

Some file entities have two wayback URLs, one with 12 digits and one with the full 14. In the majority of cases, however, there is only a single URL with 12 digits. Informally, this seems to impact something like 10% to 30% of all file entities (!).

The root of the problem was a bug in the old arabesque pipeline for doing crawl-specific imports, before the sandcrawler/crawl-bot pipeline was adopted. The bot agent creating bad metadata was fatcat_tools.ArabesqueMatchImporter, but the root of the problem was a bug in arabesque itself storing only 12 digits in sqlite.

Among other problems, having only 12 digits results in an extra wayback redirect at fetch time (inefficient), and make exact string comparisons break, resulting in multiple wayback URLs being added.

Cleanup jobs will need to be written, tested, and executed which:

  • carefully remove duplicate wayback URLs when there are multiple
  • add the extra digits to wayback URLs when known
  • verify that the problem has been cleaned up
@bnewbold bnewbold added bug Something isn't working content Bulk imports and updates to existing production catalog labels May 19, 2021
@bnewbold bnewbold self-assigned this May 19, 2021
@bnewbold
Copy link
Contributor Author

The vast majority of these, more than 9.5 million file entities, have now been updated. In addition to the 12-digit problem, there were also many 4-digit (year only) URLs expanded.

See notes at:

Remaining task is to do a check of remaining invalid URLs after the next bulk metadata export, and investigate why a small fraction of URLs could not be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working content Bulk imports and updates to existing production catalog
Projects
None yet
Development

No branches or pull requests

1 participant