Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect values in pages.jsonl for javascript redirects #760

Open
Mr0grog opened this issue Feb 9, 2025 · 0 comments
Open

Incorrect values in pages.jsonl for javascript redirects #760

Mr0grog opened this issue Feb 9, 2025 · 0 comments

Comments

@Mr0grog
Copy link

Mr0grog commented Feb 9, 2025

This is an unusual situation and what’s right is probably debatable, but there are a few pages I’m crawling where the server responds with a 403 error, but the error page includes javascript that immediately navigates to a different URL, which has a 200 status. The listing in pages.jsonl records values from the page that was redirected to via JS, i.e. it lists a 200 status and the title from the target page.

This page is a good example: https://www.energy.gov/justice/no-fear-act-data — the page has been removed, but instead of stopping to show the error, it just immediately directs the user’s browser to the DOE home page at https://www.energy.gov/. If you use a client that doesn’t run JS, you’ll see this snippet in the source:

<script type="text/javascript">
  window.location.href = "https://www.energy.gov/";
</script>

There’s probably room for debate as to what should be recorded in pages.jsonl here. I’m hitting this in a case where the redirect target is not really a meaningful equivalent and is functioning more in a way that hides the error, and so I’d like to clearly differentiate HTTP vs. client redirects here. But I can also imagine lots of sites on static file servers (e.g. GitHub pages) using this technique to implement dynamic routing. Maybe the pages.jsonl entry could record info about both responses in this kind of case?

I imagine the same or similar issues exist with <meta http-equiv="refresh"> redirects, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

1 participant