Does ipwb handle segmented response records? #374

machawk1 · 2018-01-30T01:33:21Z

The WARC/1.1 spec (Section B.8) gives an example where a response record is segmented into multiple other smaller records. This changes the hash digests of the records both in the context of the WARC-Block-Digest and WARC-Payload-Digest fields in the warc-response and continuation records but also in ipwb, which also likely calculates the multihash of the content in the initial response records and does not consider other segments.

Let's check the implementation of the module we are using to extract the warc-response records with some dummy data along with a continuation records and a WARC-Segment-Number field in the initial (and potentially subsequent) records.

Ideally, it would be useful to have a data set of WARC exercising all of the features a la a set of minimum working examples but I have yet to come across such a data set. The key here would be MINIMAL examples without the cruft that may trip up other process and produce true/false positives/negatives

The text was updated successfully, but these errors were encountered:

ibnesayeed · 2018-01-30T16:45:39Z

Since we are relying on PyWB for WARC parsing, we can offload this responsibility there. In fact I would much prefer to move to the new warcio library for WARC parsing.

/cc @ikreymer

machawk1 · 2018-01-30T17:22:57Z

@ibnesayeed There is offloading the responsibility and verifying whether ipwb does the right thing currently. This ticket is about verifying the correctness in ipwb.

As an aside, I believe there was an effort to move to warcio at one point but something about the difference in the Iteration approach used kept that from moving forward.

ibnesayeed · 2018-01-30T21:20:13Z

I thought the hiccup was due to Python version, but I might be wrong.

ikreymer · 2018-01-30T21:39:42Z

Yeah, you should use warcio directly for reading the WARC, the latest of pywb just uses warcio as well.

machawk1 · 2018-01-30T21:45:16Z

@ibnesayeed warcio reporting being compatible with Python 2, so this might have not been the issue. Hopefully that will be moot when we finish #51. We discussed utilizing parts of warcio in #129 and #211.

@ikreymer Can you report on how warcio handles continuation record(s) chained from with a warc-response record?

machawk1 added bug External project dependence labels Jan 30, 2018

machawk1 added this to the 2.0 (Extended more featureful implementation) milestone Jan 30, 2018

machawk1 changed the title ~~Does ipwb handle segmented response records~~ Does ipwb handle segmented response records? Jan 31, 2018

machawk1 mentioned this issue Jul 2, 2018

Support for WARCs based on version 1.1 of the spec? webrecorder/warcio#37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does ipwb handle segmented response records? #374

Does ipwb handle segmented response records? #374

machawk1 commented Jan 30, 2018

ibnesayeed commented Jan 30, 2018

machawk1 commented Jan 30, 2018

ibnesayeed commented Jan 30, 2018

ikreymer commented Jan 30, 2018

machawk1 commented Jan 30, 2018

Does ipwb handle segmented response records? #374

Does ipwb handle segmented response records? #374

Comments

machawk1 commented Jan 30, 2018

ibnesayeed commented Jan 30, 2018

machawk1 commented Jan 30, 2018

ibnesayeed commented Jan 30, 2018

ikreymer commented Jan 30, 2018

machawk1 commented Jan 30, 2018