-
-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloudflare security page is saved instead of real content #387
Comments
I believe this is the root cause: Cloudflare's new captcha requires Chrome >= 115. I'm looking forward to the upgrade of Chrome in Browsertrix. |
Thank you a lot for finding this upstream issue. |
I'm struggling to reproduce this issue on my dev machine, so I cannot really test the update to Chrome 117 locally. I've built a Docker image based on "google-chrome-stable 117.0.5938.88-1 amd64" and it is currently running on the machine which encountered this issue, I will update you once I have feedback. |
Apparently you also need to create a profile and the environment inside the container needs to have proper locale and timezone. |
@wsdookadr @benoit74 We've been held up by the lack of Linux Chromium updates ARM64. We will probably switch to Brave for this reason: #189 (comment) For locale, we already have the --lang setting, can consider adding other settings (within reason) if it will help avoid Cloudflare blocking. My impression is that it may be another cat-and-mouse kind of situation to avoid this block, so haven't spent time on it yet. If you have suggestions that we can implement, let us know, and PRs welcome also! |
I don't achieve to get a repro, not with Chrome 117, not with Chrome 112. |
On some sites (we encountered it on radiopaedia.org, see openzim/zim-requests#1016), the crawler is saving Cloudflare security page instead of the real content.
We have tons of pages like this one which are recorded:
![image](https://private-user-images.githubusercontent.com/7102089/268914907-4bad4f82-df3e-4ee7-b73c-a6059092a6ef.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0ODQ3NDQsIm5iZiI6MTczOTQ4NDQ0NCwicGF0aCI6Ii83MTAyMDg5LzI2ODkxNDkwNy00YmFkNGY4Mi1kZjNlLTRlZTctYjczYy1hNjA1OTA5MmE2ZWYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTNUMjIwNzI0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZjAyOTVlNGFkOWMzODAzNzczMWIzNTdlMWU1MmFhMzRhMzA2YWM4NmEwYmU1ZDkxM2JhOWIyNmIyM2E4YjU1NiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.FAOfq8B59mOmWNGz_m86mlOLAgSsQpTkFiTWzPgt5wA)
This is close to #372 but not identical (there is no error and the crawler considers that job is done).
We don't think there's much that can be done technically but at least we prefer to ask and probably this should be documented somewhere as a known limitation.
The text was updated successfully, but these errors were encountered: