You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is an unusual situation and what’s right is probably debatable, but there are a few pages I’m crawling where the server responds with a 403 error, but the error page includes javascript that immediately navigates to a different URL, which has a 200 status. The listing in pages.jsonl records values from the page that was redirected to via JS, i.e. it lists a 200 status and the title from the target page.
This page is a good example: https://www.energy.gov/justice/no-fear-act-data — the page has been removed, but instead of stopping to show the error, it just immediately directs the user’s browser to the DOE home page at https://www.energy.gov/. If you use a client that doesn’t run JS, you’ll see this snippet in the source:
There’s probably room for debate as to what should be recorded in pages.jsonl here. I’m hitting this in a case where the redirect target is not really a meaningful equivalent and is functioning more in a way that hides the error, and so I’d like to clearly differentiate HTTP vs. client redirects here. But I can also imagine lots of sites on static file servers (e.g. GitHub pages) using this technique to implement dynamic routing. Maybe the pages.jsonl entry could record info about both responses in this kind of case?
I imagine the same or similar issues exist with <meta http-equiv="refresh"> redirects, too.
The text was updated successfully, but these errors were encountered:
This is an unusual situation and what’s right is probably debatable, but there are a few pages I’m crawling where the server responds with a 403 error, but the error page includes javascript that immediately navigates to a different URL, which has a 200 status. The listing in
pages.jsonl
records values from the page that was redirected to via JS, i.e. it lists a 200 status and the title from the target page.This page is a good example: https://www.energy.gov/justice/no-fear-act-data — the page has been removed, but instead of stopping to show the error, it just immediately directs the user’s browser to the DOE home page at https://www.energy.gov/. If you use a client that doesn’t run JS, you’ll see this snippet in the source:
There’s probably room for debate as to what should be recorded in
pages.jsonl
here. I’m hitting this in a case where the redirect target is not really a meaningful equivalent and is functioning more in a way that hides the error, and so I’d like to clearly differentiate HTTP vs. client redirects here. But I can also imagine lots of sites on static file servers (e.g. GitHub pages) using this technique to implement dynamic routing. Maybe thepages.jsonl
entry could record info about both responses in this kind of case?I imagine the same or similar issues exist with
<meta http-equiv="refresh">
redirects, too.The text was updated successfully, but these errors were encountered: