You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The WARC/1.1 spec (Section B.8) gives an example where a response record is segmented into multiple other smaller records. This changes the hash digests of the records both in the context of the WARC-Block-Digest and WARC-Payload-Digest fields in the warc-response and continuation records but also in ipwb, which also likely calculates the multihash of the content in the initial response records and does not consider other segments.
Let's check the implementation of the module we are using to extract the warc-response records with some dummy data along with a continuation records and a WARC-Segment-Number field in the initial (and potentially subsequent) records.
Ideally, it would be useful to have a data set of WARC exercising all of the features a la a set of minimum working examples but I have yet to come across such a data set. The key here would be MINIMAL examples without the cruft that may trip up other process and produce true/false positives/negatives
The text was updated successfully, but these errors were encountered:
Since we are relying on PyWB for WARC parsing, we can offload this responsibility there. In fact I would much prefer to move to the new warcio library for WARC parsing.
@ibnesayeed There is offloading the responsibility and verifying whether ipwb does the right thing currently. This ticket is about verifying the correctness in ipwb.
As an aside, I believe there was an effort to move to warcio at one point but something about the difference in the Iteration approach used kept that from moving forward.
@ibnesayeed warcio reporting being compatible with Python 2, so this might have not been the issue. Hopefully that will be moot when we finish #51. We discussed utilizing parts of warcio in #129 and #211.
@ikreymer Can you report on how warcio handles continuation record(s) chained from with a warc-response record?
machawk1
changed the title
Does ipwb handle segmented response records
Does ipwb handle segmented response records?
Jan 31, 2018
The WARC/1.1 spec (Section B.8) gives an example where a response record is segmented into multiple other smaller records. This changes the hash digests of the records both in the context of the
WARC-Block-Digest
andWARC-Payload-Digest
fields in the warc-response and continuation records but also in ipwb, which also likely calculates the multihash of the content in the initial response records and does not consider other segments.Let's check the implementation of the module we are using to extract the warc-response records with some dummy data along with a
continuation
records and aWARC-Segment-Number
field in the initial (and potentially subsequent) records.Ideally, it would be useful to have a data set of WARC exercising all of the features a la a set of minimum working examples but I have yet to come across such a data set. The key here would be MINIMAL examples without the cruft that may trip up other process and produce true/false positives/negatives
The text was updated successfully, but these errors were encountered: