Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSS Feed is not crawled, when cloudflare bot detection is in place #3037

Closed
3 tasks done
jassi0001 opened this issue Jan 9, 2025 · 4 comments
Closed
3 tasks done
Labels

Comments

@jassi0001
Copy link

jassi0001 commented Jan 9, 2025

IMPORTANT

Read and tick the following checkbox after you have created the issue or place an x inside the brackets ;)

  • I have read the CONTRIBUTING.md and followed the provided tips
  • I accept that the issue will be closed without comment if I do not check here
  • I accept that the issue will be closed without comment if I do not fill out all items in the issue template.

Explain the Problem

Can not ad Feed to News App. Cloudflare Botdetection ist in place.

Steps to Reproduce

Explain what you did to encounter the issue

  1. Adding the Feed https://www.sparbote.de/rss
  2. getting Error 403 , "Just a moment..." back from Cloudflare's Botdetection

System Information

  • News app version: 25.1
  • Nextcloud version: 30.0.2
  • Cron type: (system cron/python updater/...)
  • PHP version: 8.3
  • Database and version: MariaDB
  • Browser and version: Edge
  • OS and version: W11 24H2
Contents of nextcloud/data/nextcloud.log
Paste output here
Contents Problem seems to be with useragent, because Cloudflare can not identify an no Bot. Problem described here as well: https://www.zenrows.com/blog/curl-bypass-cloudflare#bypass-cloudflare-in-curl. This is following my post in the Issue #2966
@jassi0001 jassi0001 added the bug label Jan 9, 2025
@SMillerDev
Copy link
Contributor

This is not a bug, there is nothing News can do to force websites to accept the requests.

@SMillerDev SMillerDev closed this as not planned Won't fix, can't repro, duplicate, stale Jan 9, 2025
@jassi0001
Copy link
Author

jassi0001 commented Jan 9, 2025

ok, thanks. I understand, but as described in the zenrows artikel, the crawler has to look like a human. And there seems to be a way, with the right header informatione.

Here the info from cloudflare community:
https://community.cloudflare.com/t/custom-bot-getting-403-from-cloudflare/342129

The linked Document leads to this form:
When you take a look at https://docs.google.com/forms/d/e/1FAIpQLSdqYNuULEypMnp4i5pROSc-uP6x65Xub9svD27mb8JChA_-XA/viewform

can you tell, which user agent header the news crawler is using? Than i can try to fill the form from cloudflare to have it possible whitelisted. I think it's "userAgent":"CloudNews/1776 CFNetwork/1568.300.101 Darwin/24.2.0", right?

@wofferl
Copy link
Collaborator

wofferl commented Jan 9, 2025

ok, thanks. I understand, but as described in the zenrows artikel, the crawler has to look like a human. And there seems to be a way, with the right header informatione.

Here the info from cloudflare community: https://community.cloudflare.com/t/custom-bot-getting-403-from-cloudflare/342129

The linked Document leads to this form: When you take a look at https://docs.google.com/forms/d/e/1FAIpQLSdqYNuULEypMnp4i5pROSc-uP6x65Xub9svD27mb8JChA_-XA/viewform

can you tell, which user agent header the news crawler is using? Than i can try to fill the form from cloudflare to have it possible whitelisted. I think it's "userAgent":"CloudNews/1776 CFNetwork/1568.300.101 Darwin/24.2.0", right?

The News App uses NextCloud-News/VERSION (e.g NextCloud-News/25.1.2) as user agent.

But as you can see at the botton of the cloudflare form you need some kind of verification like ip list or reverse dns where the crawler come from.
Since this is no single service it will be hard to get the News App white listed.

Here is another good summary of the problem:
https://openrss.org/blog/using-cloudflare-on-your-website-could-be-blocking-rss-users

I think users should report the problem to the news providers and Cloudflare to get this right in the future.

@SMillerDev
Copy link
Contributor

ok, thanks. I understand, but as described in the zenrows artikel, the crawler has to look like a human. And there seems to be a way, with the right header informatione.

It might be able to temporarily trick Cloudflare if it does, but that's not a solution to the problem that the RSS feed can't be fetched by a bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants