-
Notifications
You must be signed in to change notification settings - Fork 761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawl job stats and reports misleading when excluding PDF-Files (follow up to issue #453) #455
Comments
Since neither FetchHTTP choosing to not download the request body nor the WarcWriter choosing not to write the record changes fetch status code of the CrawlURI it's still considered a success for statistics purposes. As for fixing it, well WorkQueueFrontier.processFinish() is where the decision gets made. A URI is treated either as success, disregarded or failure. I suppose either the definition CrawlURI.isSuccess() and WorkQueueFrontier.isDisregarded() could be changed so URIs with the midFetchAbort annotation are considered disregarded or the abort itself could be changed to call setFetchStatus(S_OUT_OF_SCOPE). This would have some side-effects though: extractors wouldn't run, the record wouldn't be recorded in the WARC file and the request wouldn't charged to the queue's budget. In your case those are desirable as the goal is for the PDF to be treated as out of scope. I guess the question is if there are other use cases for FetchHTTP shouldFetchBodyRule where those side-effects would be undesirable? |
Another idea is perhaps the full scope should be re-evaluated after the response header is received. This would mean putting a content type decide rule in the normal scope would "just work" and maybe would be less surprising to the operator. |
Thanks for the information. This makes sense, even if it is not a perfect situation for our use case. But if I think about it, we can live with it. We do produce I am rather sceptical about your second idea, if just for performance and runtime reasons. |
Using the advice in Issue #453, I successfully excluded unwanted PDF-documents from fetching and being written to WARC. But this method seems to generate misleading reports and stats.
mimetype-report
shows pdf- and zip-files with counts and bytes, both are excluded
count of content-type from WARC-file
If I grep and count the
Content-Type
fields from WARC, this is what I get. No pdf and zip:Crawled Bytes
Problem
We use the reports and logs in our archive for an overview of the content. In this case, this is dangerous. Is there an explanation and maybe a fix to the problem?
The text was updated successfully, but these errors were encountered: