-
-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloudflare check timed out & Link extraction timed out ! #768
Comments
Hmm, the pages do appear to have 3000+ anchor tags - we could perhaps raise the link extraction timeout, given that there's so many links... For warc2zim errors, I suggest opening an issue on: https://github.com/openzim/warc2zim/issues |
thank you for helping with this, you mean you'll find a way to solve this "Link extraction issue" and warc2zim is another one to consider. I want to share a screenshot, I find it rarely after crawling this domain, may be it can give you more insight on this. ![]() this is one of the errors in the last log i've provided above ![]() |
When server returns a 502, crawler and warc2zim can unfortunately do nothing else than store a 502 ... How can the software know this is not a real 502? What would you expect it to do? |
sorry I was hesitated to open this warc2zim issue, and I thought the same. but I think using retrying can solve this when returns these codes. Real 502: Transient server error, resolves after retries. Fake 502: Persists after retries. |
Browsertrix already has the |
am using just zimit for now because I have a limited space on OS, may be I'll try it later. |
Zimit will support this as well soon |
thank you I have read about it and found this strategy- based on Hash; first normal retries don't detect silent cloudfare issues. |
We are tracking retry + rate limit work in #758, there could be multiple strategies, include error code checks, hashs, etc.. |
when I crawl this domain"https://shamela.ws" which is a well-known arabic library, I got warn messages like Link extraction timed out or sometimes with Cloudflare check timeout, and I have noticed it shows up when am crawling big, little lazy loading books or a whole category; not happening with the light content. besides at the end of the crawl I got issues in zimfile path for some pages, also found that there're pages in kiwix app didn't work not just those warnings of skipping record that was in the crawl with warc2zim , the rare thing no issues with them in the logfile, I guess it's from the server itself "500 internal server error or 502. 525 etc" it means that the crawler triggers silent Cloudflare codes. am using zimit:2.1.5 - Browsertrix-Crawler 1.3.4 (with warcio.js 2.3.1),zimit 2.1.5 , I have also used the latest versions zimit 2.1.7.
REFERE also to : #441
could you please make fixes for this? I can't crawl it at all , it's a simple website though.
some-logs:
log-information.txt
no issues in the crawler log, but the issues show up with warc2zim. silent ones.
silent-errors.txt
The text was updated successfully, but these errors were encountered: