Cloudflare check timed out & Link extraction timed out ! #768

hamoudak · 2025-02-12T08:44:05Z

when I crawl this domain"https://shamela.ws" which is a well-known arabic library, I got warn messages like Link extraction timed out or sometimes with Cloudflare check timeout, and I have noticed it shows up when am crawling big, little lazy loading books or a whole category; not happening with the light content. besides at the end of the crawl I got issues in zimfile path for some pages, also found that there're pages in kiwix app didn't work not just those warnings of skipping record that was in the crawl with warc2zim , the rare thing no issues with them in the logfile, I guess it's from the server itself "500 internal server error or 502. 525 etc" it means that the crawler triggers silent Cloudflare codes. am using zimit:2.1.5 - Browsertrix-Crawler 1.3.4 (with warcio.js 2.3.1),zimit 2.1.5 , I have also used the latest versions zimit 2.1.7.

REFERE also to : #441
could you please make fixes for this? I can't crawl it at all , it's a simple website though.

some-logs:

log-information.txt

no issues in the crawler log, but the issues show up with warc2zim. silent ones.

silent-errors.txt

ikreymer · 2025-02-12T11:32:58Z

Hmm, the pages do appear to have 3000+ anchor tags - we could perhaps raise the link extraction timeout, given that there's so many links...
Otherwise, the crawl seems to have run fine. Those links may have been retrieved on other pages, since it appears each page has 3000+ links, for the whole book...

For warc2zim errors, I suggest opening an issue on: https://github.com/openzim/warc2zim/issues

hamoudak · 2025-02-12T13:45:43Z

thank you for helping with this, you mean you'll find a way to solve this "Link extraction issue" and warc2zim is another one to consider. I want to share a screenshot, I find it rarely after crawling this domain, may be it can give you more insight on this.
it's a temporary message, when I give it a break for 2 or 3 minutes I browse it normally.
I've edited the refere link and silent-errors.txt

this is one of the errors in the last log i've provided above
[warc2zim::2025-02-12 08:02:37,842] DEBUG:Skipping record with unprocessable HTTP return code 502 ZimPath(shamela.ws/book/97978/1153)

benoit74 · 2025-02-12T20:02:29Z

When server returns a 502, crawler and warc2zim can unfortunately do nothing else than store a 502 ... How can the software know this is not a real 502? What would you expect it to do?

hamoudak · 2025-02-13T04:45:11Z

sorry I was hesitated to open this warc2zim issue, and I thought the same. but I think using retrying can solve this when returns these codes. Real 502: Transient server error, resolves after retries. Fake 502: Persists after retries.

benoit74 · 2025-02-13T08:42:47Z

Browsertrix already has the --maxPageRetries option. I'm not exceptionally familiar with this feature, doesn't it help you?

hamoudak · 2025-02-13T17:24:32Z

am using just zimit for now because I have a limited space on OS, may be I'll try it later.

benoit74 · 2025-02-13T19:21:12Z

Zimit will support this as well soon

hamoudak · 2025-02-14T07:02:27Z

thank you
will "retries" option get these silent errors like Cloudflare blocks or empty pages that don't show up except with warc2zim not in the crawler itself?

I have read about it and found this strategy- based on Hash; first normal retries don't detect silent cloudfare issues.
Crawl and compute the SHA-256 hash of each page.
Check against a predefined list of known error hashes. If a page’s hash matches a known error, retry it.
in my case for example: ( Cloudflare 525, 502, 500 errors).
If multiple different URLs return the same hash, flag them for retry.
If the page is unusually small AND matches a common error hash, retry it.

ikreymer · 2025-02-20T19:00:19Z

We are tracking retry + rate limit work in #758, there could be multiple strategies, include error code checks, hashs, etc..
Closing this issue for now, as main question has been answered and this falls under rate limit + retry improvements.

github-project-automation bot added this to Webrecorder Projects Feb 12, 2025

github-project-automation bot moved this to Triage in Webrecorder Projects Feb 12, 2025

hamoudak mentioned this issue Feb 12, 2025

Skipping record with unprocessable HTTP return code 502 (525) ZimPath openzim/warc2zim#437

Open

ikreymer closed this as completed Feb 20, 2025

github-project-automation bot moved this from Triage to Done! in Webrecorder Projects Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloudflare check timed out & Link extraction timed out ! #768

Cloudflare check timed out & Link extraction timed out ! #768

hamoudak commented Feb 12, 2025 •

edited

Loading

ikreymer commented Feb 12, 2025

hamoudak commented Feb 12, 2025 •

edited

Loading

benoit74 commented Feb 12, 2025

hamoudak commented Feb 13, 2025 •

edited

Loading

benoit74 commented Feb 13, 2025

hamoudak commented Feb 13, 2025

benoit74 commented Feb 13, 2025

hamoudak commented Feb 14, 2025 •

edited

Loading

ikreymer commented Feb 20, 2025

Cloudflare check timed out & Link extraction timed out ! #768

Cloudflare check timed out & Link extraction timed out ! #768

Comments

hamoudak commented Feb 12, 2025 • edited Loading

some-logs:

ikreymer commented Feb 12, 2025

hamoudak commented Feb 12, 2025 • edited Loading

benoit74 commented Feb 12, 2025

hamoudak commented Feb 13, 2025 • edited Loading

benoit74 commented Feb 13, 2025

hamoudak commented Feb 13, 2025

benoit74 commented Feb 13, 2025

hamoudak commented Feb 14, 2025 • edited Loading

ikreymer commented Feb 20, 2025

hamoudak commented Feb 12, 2025 •

edited

Loading

hamoudak commented Feb 12, 2025 •

edited

Loading

hamoudak commented Feb 13, 2025 •

edited

Loading

hamoudak commented Feb 14, 2025 •

edited

Loading