Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudflare security page is saved instead of real content #387

Closed
benoit74 opened this issue Sep 19, 2023 · 8 comments
Closed

Cloudflare security page is saved instead of real content #387

benoit74 opened this issue Sep 19, 2023 · 8 comments

Comments

@benoit74
Copy link
Contributor

On some sites (we encountered it on radiopaedia.org, see openzim/zim-requests#1016), the crawler is saving Cloudflare security page instead of the real content.

We have tons of pages like this one which are recorded:
image

This is close to #372 but not identical (there is no error and the crawler considers that job is done).

We don't think there's much that can be done technically but at least we prefer to ask and probably this should be documented somewhere as a known limitation.

@wsdookadr
Copy link

wsdookadr commented Sep 21, 2023

I believe this is the root cause:

Cloudflare's new captcha requires Chrome >= 115.
At the present time, Browsertrix offers Chrome 112.

I'm looking forward to the upgrade of Chrome in Browsertrix.

@benoit74
Copy link
Contributor Author

Thank you a lot for finding this upstream issue.
FYI, I'm right now doing a local test with Chrome 117

@benoit74
Copy link
Contributor Author

benoit74 commented Sep 21, 2023

I'm struggling to reproduce this issue on my dev machine, so I cannot really test the update to Chrome 117 locally.

I've built a Docker image based on "google-chrome-stable 117.0.5938.88-1 amd64" and it is currently running on the machine which encountered this issue, I will update you once I have feedback.

@wsdookadr
Copy link

Apparently you also need to create a profile and the environment inside the container needs to have proper locale and timezone.

@ikreymer
Copy link
Member

@wsdookadr @benoit74 We've been held up by the lack of Linux Chromium updates ARM64. We will probably switch to Brave for this reason: #189 (comment)
which is up-to-date with latest Chromium.

For locale, we already have the --lang setting, can consider adding other settings (within reason) if it will help avoid Cloudflare blocking. My impression is that it may be another cat-and-mouse kind of situation to avoid this block, so haven't spent time on it yet. If you have suggestions that we can implement, let us know, and PRs welcome also!

@benoit74
Copy link
Contributor Author

I don't achieve to get a repro, not with Chrome 117, not with Chrome 112.
I don't know exactly what led to this situation.
I will close this issue and re-open if I achieve to get a repro, it's impossible to make any progress otherwise. Do not hesitate to reopen it if you have a repro on your side.
Thank you all for the good pointers in any case.

@github-project-automation github-project-automation bot moved this from Triage to Done! in Webrecorder Projects Sep 25, 2023
@rgaudin
Copy link
Contributor

rgaudin commented Sep 25, 2023

@benoit74 can you open one requesting browsertrix to handle 429 responses? In this case, Cloudflare was sending proper 429 responses and it's insensible of the crawler to ignore them

@benoit74
Copy link
Contributor Author

@rgaudin done: #392

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done!
Development

No branches or pull requests

4 participants