You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some file entities have two wayback URLs, one with 12 digits and one with the full 14. In the majority of cases, however, there is only a single URL with 12 digits. Informally, this seems to impact something like 10% to 30% of all file entities (!).
The root of the problem was a bug in the old arabesque pipeline for doing crawl-specific imports, before the sandcrawler/crawl-bot pipeline was adopted. The bot agent creating bad metadata was fatcat_tools.ArabesqueMatchImporter, but the root of the problem was a bug in arabesque itself storing only 12 digits in sqlite.
Among other problems, having only 12 digits results in an extra wayback redirect at fetch time (inefficient), and make exact string comparisons break, resulting in multiple wayback URLs being added.
Cleanup jobs will need to be written, tested, and executed which:
carefully remove duplicate wayback URLs when there are multiple
add the extra digits to wayback URLs when known
verify that the problem has been cleaned up
The text was updated successfully, but these errors were encountered:
The vast majority of these, more than 9.5 million file entities, have now been updated. In addition to the 12-digit problem, there were also many 4-digit (year only) URLs expanded.
Remaining task is to do a check of remaining invalid URLs after the next bulk metadata export, and investigate why a small fraction of URLs could not be fixed.
For example:
https://fatcat.wiki/file/rcbebk4ox5esbnnpipbnegy7si
Some file entities have two wayback URLs, one with 12 digits and one with the full 14. In the majority of cases, however, there is only a single URL with 12 digits. Informally, this seems to impact something like 10% to 30% of all file entities (!).
The root of the problem was a bug in the old
arabesque
pipeline for doing crawl-specific imports, before the sandcrawler/crawl-bot pipeline was adopted. The bot agent creating bad metadata wasfatcat_tools.ArabesqueMatchImporter
, but the root of the problem was a bug in arabesque itself storing only 12 digits in sqlite.Among other problems, having only 12 digits results in an extra wayback redirect at fetch time (inefficient), and make exact string comparisons break, resulting in multiple wayback URLs being added.
Cleanup jobs will need to be written, tested, and executed which:
The text was updated successfully, but these errors were encountered: