Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does ipwb handle segmented response records? #374

Open
machawk1 opened this issue Jan 30, 2018 · 5 comments
Open

Does ipwb handle segmented response records? #374

machawk1 opened this issue Jan 30, 2018 · 5 comments

Comments

@machawk1
Copy link
Member

The WARC/1.1 spec (Section B.8) gives an example where a response record is segmented into multiple other smaller records. This changes the hash digests of the records both in the context of the WARC-Block-Digest and WARC-Payload-Digest fields in the warc-response and continuation records but also in ipwb, which also likely calculates the multihash of the content in the initial response records and does not consider other segments.

Let's check the implementation of the module we are using to extract the warc-response records with some dummy data along with a continuation records and a WARC-Segment-Number field in the initial (and potentially subsequent) records.

Ideally, it would be useful to have a data set of WARC exercising all of the features a la a set of minimum working examples but I have yet to come across such a data set. The key here would be MINIMAL examples without the cruft that may trip up other process and produce true/false positives/negatives

@ibnesayeed
Copy link
Member

Since we are relying on PyWB for WARC parsing, we can offload this responsibility there. In fact I would much prefer to move to the new warcio library for WARC parsing.

/cc @ikreymer

@machawk1
Copy link
Member Author

@ibnesayeed There is offloading the responsibility and verifying whether ipwb does the right thing currently. This ticket is about verifying the correctness in ipwb.

As an aside, I believe there was an effort to move to warcio at one point but something about the difference in the Iteration approach used kept that from moving forward.

@ibnesayeed
Copy link
Member

I thought the hiccup was due to Python version, but I might be wrong.

@ikreymer
Copy link

Yeah, you should use warcio directly for reading the WARC, the latest of pywb just uses warcio as well.

@machawk1
Copy link
Member Author

@ibnesayeed warcio reporting being compatible with Python 2, so this might have not been the issue. Hopefully that will be moot when we finish #51. We discussed utilizing parts of warcio in #129 and #211.

@ikreymer Can you report on how warcio handles continuation record(s) chained from with a warc-response record?

@machawk1 machawk1 changed the title Does ipwb handle segmented response records Does ipwb handle segmented response records? Jan 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants