Cloudflare security page is saved instead of real content #387

benoit74 · 2023-09-19T09:56:40Z

On some sites (we encountered it on radiopaedia.org, see openzim/zim-requests#1016), the crawler is saving Cloudflare security page instead of the real content.

We have tons of pages like this one which are recorded:

This is close to #372 but not identical (there is no error and the crawler considers that job is done).

We don't think there's much that can be done technically but at least we prefer to ask and probably this should be documented somewhere as a known limitation.

wsdookadr · 2023-09-21T03:34:21Z

I believe this is the root cause:

Cloudflare's new captcha requires Chrome >= 115.
At the present time, Browsertrix offers Chrome 112.

I'm looking forward to the upgrade of Chrome in Browsertrix.

benoit74 · 2023-09-21T07:40:07Z

Thank you a lot for finding this upstream issue.
FYI, I'm right now doing a local test with Chrome 117

benoit74 · 2023-09-21T11:49:41Z

I'm struggling to reproduce this issue on my dev machine, so I cannot really test the update to Chrome 117 locally.

I've built a Docker image based on "google-chrome-stable 117.0.5938.88-1 amd64" and it is currently running on the machine which encountered this issue, I will update you once I have feedback.

wsdookadr · 2023-09-22T09:18:47Z

Apparently you also need to create a profile and the environment inside the container needs to have proper locale and timezone.

ikreymer · 2023-09-22T14:55:46Z

@wsdookadr @benoit74 We've been held up by the lack of Linux Chromium updates ARM64. We will probably switch to Brave for this reason: #189 (comment)
which is up-to-date with latest Chromium.

For locale, we already have the --lang setting, can consider adding other settings (within reason) if it will help avoid Cloudflare blocking. My impression is that it may be another cat-and-mouse kind of situation to avoid this block, so haven't spent time on it yet. If you have suggestions that we can implement, let us know, and PRs welcome also!

benoit74 · 2023-09-25T06:32:28Z

I don't achieve to get a repro, not with Chrome 117, not with Chrome 112.
I don't know exactly what led to this situation.
I will close this issue and re-open if I achieve to get a repro, it's impossible to make any progress otherwise. Do not hesitate to reopen it if you have a repro on your side.
Thank you all for the good pointers in any case.

rgaudin · 2023-09-25T08:10:13Z

@benoit74 can you open one requesting browsertrix to handle 429 responses? In this case, Cloudflare was sending proper 429 responses and it's insensible of the crawler to ignore them

benoit74 · 2023-09-25T08:43:06Z

@rgaudin done: #392

github-project-automation bot added this to Webrecorder Projects Sep 19, 2023

github-project-automation bot moved this to Triage in Webrecorder Projects Sep 19, 2023

benoit74 closed this as completed Sep 25, 2023

github-project-automation bot moved this from Triage to Done! in Webrecorder Projects Sep 25, 2023

benoit74 mentioned this issue Sep 25, 2023

Slow down + retry on HTTP 429 errors #392

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloudflare security page is saved instead of real content #387

Cloudflare security page is saved instead of real content #387

benoit74 commented Sep 19, 2023

wsdookadr commented Sep 21, 2023 •

edited

Loading

benoit74 commented Sep 21, 2023

benoit74 commented Sep 21, 2023 •

edited

Loading

wsdookadr commented Sep 22, 2023

ikreymer commented Sep 22, 2023

benoit74 commented Sep 25, 2023

rgaudin commented Sep 25, 2023

benoit74 commented Sep 25, 2023

Cloudflare security page is saved instead of real content #387

Cloudflare security page is saved instead of real content #387

Comments

benoit74 commented Sep 19, 2023

wsdookadr commented Sep 21, 2023 • edited Loading

benoit74 commented Sep 21, 2023

benoit74 commented Sep 21, 2023 • edited Loading

wsdookadr commented Sep 22, 2023

ikreymer commented Sep 22, 2023

benoit74 commented Sep 25, 2023

rgaudin commented Sep 25, 2023

benoit74 commented Sep 25, 2023

wsdookadr commented Sep 21, 2023 •

edited

Loading

benoit74 commented Sep 21, 2023 •

edited

Loading