-
Notifications
You must be signed in to change notification settings - Fork 24
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Reduce crawler load on servers (#1581)
The crawler has a hard time crawling all specs nowadays due to more stringent restrictions on servers that lead to network timeouts and errors. See: w3c/webref#1244 The goal of this update is to reduce the load of the crawler onto servers. Two changes: 1. The list of specs to crawl gets sorted to distribute origins. This should help with diluting requests sent to a specific server at once. The notion of "origin" used in the code is loose and more meant to identify the server that serves the resource than the actual origin. 2. Requests sent to a given origin are serialized, and sent 2 seconds minimum after the last request was sent (and processed). The crawler still processes the list 4 specs at a time otherwise (provided the specs are to be retrieved from different origins). The consequence of 1. is that the specs are no longer processed in order, so logs will make the crawler look a bit drunk, processing specs seemingly randomly, as in: ``` 1/610 - https://aomediacodec.github.io/afgs1-spec/ - crawling 8/610 - https://compat.spec.whatwg.org/ - crawling 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - crawling 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - crawling 12/610 - https://datatracker.ietf.org/doc/html/draft-davidben-http-client-hint-reliability - done 16/610 - https://drafts.css-houdini.org/css-typed-om-2/ - crawling 13/610 - https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-rfc6265bis - done 45/610 - https://fidoalliance.org/specs/fido-v2.1-ps-20210615/fido-client-to-authenticator-protocol-v2.1-ps-errata-20220621.html - crawling https://compat.spec.whatwg.org/ [error] Multiple event handler named orientationchange, cannot associate reliably to an interface in Compatibility Standard 8/610 - https://compat.spec.whatwg.org/ - done 66/610 - https://registry.khronos.org/glTF/specs/2.0/glTF-2.0.html - crawling https://aomediacodec.github.io/afgs1-spec/ [log] extract refs without rules 1/610 - https://aomediacodec.github.io/afgs1-spec/ - done ```
- Loading branch information
Showing
2 changed files
with
154 additions
and
20 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters