Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudflare check timed out & Link extraction timed out ! #768

Closed
hamoudak opened this issue Feb 12, 2025 · 9 comments
Closed

Cloudflare check timed out & Link extraction timed out ! #768

hamoudak opened this issue Feb 12, 2025 · 9 comments

Comments

@hamoudak
Copy link

hamoudak commented Feb 12, 2025

when I crawl this domain"https://shamela.ws" which is a well-known arabic library, I got warn messages like Link extraction timed out or sometimes with Cloudflare check timeout, and I have noticed it shows up when am crawling big, little lazy loading books or a whole category; not happening with the light content. besides at the end of the crawl I got issues in zimfile path for some pages, also found that there're pages in kiwix app didn't work not just those warnings of skipping record that was in the crawl with warc2zim , the rare thing no issues with them in the logfile, I guess it's from the server itself "500 internal server error or 502. 525 etc" it means that the crawler triggers silent Cloudflare codes. am using zimit:2.1.5 - Browsertrix-Crawler 1.3.4 (with warcio.js 2.3.1),zimit 2.1.5 , I have also used the latest versions zimit 2.1.7.

REFERE also to : #441
could you please make fixes for this? I can't crawl it at all , it's a simple website though.

some-logs:

log-information.txt

no issues in the crawler log, but the issues show up with warc2zim. silent ones.

silent-errors.txt

@ikreymer
Copy link
Member

Hmm, the pages do appear to have 3000+ anchor tags - we could perhaps raise the link extraction timeout, given that there's so many links...
Otherwise, the crawl seems to have run fine. Those links may have been retrieved on other pages, since it appears each page has 3000+ links, for the whole book...

For warc2zim errors, I suggest opening an issue on: https://github.com/openzim/warc2zim/issues

@hamoudak
Copy link
Author

hamoudak commented Feb 12, 2025

thank you for helping with this, you mean you'll find a way to solve this "Link extraction issue" and warc2zim is another one to consider. I want to share a screenshot, I find it rarely after crawling this domain, may be it can give you more insight on this.
it's a temporary message, when I give it a break for 2 or 3 minutes I browse it normally.
I've edited the refere link and silent-errors.txt

Image

this is one of the errors in the last log i've provided above
[warc2zim::2025-02-12 08:02:37,842] DEBUG:Skipping record with unprocessable HTTP return code 502 ZimPath(shamela.ws/book/97978/1153)

Image

@benoit74
Copy link
Contributor

When server returns a 502, crawler and warc2zim can unfortunately do nothing else than store a 502 ... How can the software know this is not a real 502? What would you expect it to do?

@hamoudak
Copy link
Author

hamoudak commented Feb 13, 2025

sorry I was hesitated to open this warc2zim issue, and I thought the same. but I think using retrying can solve this when returns these codes. Real 502: Transient server error, resolves after retries. Fake 502: Persists after retries.

@benoit74
Copy link
Contributor

Browsertrix already has the --maxPageRetries option. I'm not exceptionally familiar with this feature, doesn't it help you?

@hamoudak
Copy link
Author

am using just zimit for now because I have a limited space on OS, may be I'll try it later.

@benoit74
Copy link
Contributor

Zimit will support this as well soon

@hamoudak
Copy link
Author

hamoudak commented Feb 14, 2025

thank you
will "retries" option get these silent errors like Cloudflare blocks or empty pages that don't show up except with warc2zim not in the crawler itself?

I have read about it and found this strategy- based on Hash; first normal retries don't detect silent cloudfare issues.
Crawl and compute the SHA-256 hash of each page.
Check against a predefined list of known error hashes. If a page’s hash matches a known error, retry it.
in my case for example: ( Cloudflare 525, 502, 500 errors).
If multiple different URLs return the same hash, flag them for retry.
If the page is unusually small AND matches a common error hash, retry it.

@ikreymer
Copy link
Member

We are tracking retry + rate limit work in #758, there could be multiple strategies, include error code checks, hashs, etc..
Closing this issue for now, as main question has been answered and this falls under rate limit + retry improvements.

@github-project-automation github-project-automation bot moved this from Triage to Done! in Webrecorder Projects Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done!
Development

No branches or pull requests

3 participants