Account for chunk-extension values when parsing chunked responses #125

machawk1 · 2017-03-02T04:30:47Z

When ipwb replay parses the chunked responses, it does not consider the potential extensions values per https://tools.ietf.org/html/rfc2616#section-3.6.1 and in discussion at #124 (comment). Check if a semi-colon is present and if so, parse out the chunk size as the first value rather than relying on the entirety of the line.

Via @ibnesayeed

machawk1 · 2017-03-02T05:00:07Z

Something like:

chunkSizeHex = chunkDescriptor.split(';')[0].strip()
chunkSizeDec = int(chunkSizeHex, 16)

via @ibnesayeed

No need to check for ';' presence, full string will return if delimiter is not present.

TODO: Fabricate WARC file w/ variants of valid values of the extension value for testing.

ibnesayeed · 2017-03-02T17:23:27Z

I think it would be better if we shift this dechuncking practice to the indexer and store actual payload in the IPFS. This will make the replay faster (and less memory intensive). Also, the indexing happens only once while the replay of the same content happens more often, so this can lead to a better optimization. Another advantage of dechuncking at indexing time is the reliable content addressibility. The server might split a response in chunks differently each time the same resource is requested, which would lead to poor deduplication.

machawk1 · 2017-03-02T17:37:11Z

I thought about this before implementing it as well but opted for decoding at time of replay so the true(r) capture is preserved in IPFS instead of one we have further manipulated.

Your final point is very important: if two HTML pages are the exact same content at different URI-Rs, their hash should be the same. Only the index (i.e., cdxj) will associate the different URIs.

See #126.

ibnesayeed · 2017-03-02T17:57:28Z

I thought about this before implementing it as well but opted for decoding at time of replay so the true(r) capture is preserved in IPFS instead of one we have further manipulated.

Well, the "true"-rity of the essence of the response is preserved in Content-Encoding. In contrast, the Transfer-Encoding only deals with how the data is transported between the server and the client; or more precisely, any on-the-fly transformation to the data is done with the Transfer-Encoding which needs to be undone on the reception.

I see one logical issue with dechunking though. If the original headers are utilized to understand the payload, the parser might fail as it would expect a chuked encoded data. For now it will be an out-of-band knowledge, unless we know a more expressive way to convey the changes made. One way would be to mask the Transfer-Encoding and add a Content-Length header, but that would be too much fabrication.

Your final point is very important: if two HTML pages are the exact same content at different URI-Rs, their hash should be the same. Only the index (i.e., cdxj) will associate the different URIs.

Not just the same content on different URIs, but the same URI might return differently chunked response each time it is requested.

machawk1 · 2017-03-02T18:03:42Z

If the original headers are utilized to understand the payload, the parser might fail as it would expect a chuked encoded data.

Should we removed the chunked header prior to pushing to IPFS? We could also add something to indicate this in the header blocked prior to being pushed (this would be an expressive way to convey changes).

and add a Content-Length header

What if this header already exists? When we manipulate the payload and re-set() it, the content-length is adjusted when served to the client.

ibnesayeed · 2017-03-02T18:12:26Z

I think you might want to study about the chunked encoding a bit more. When chunked encoding is present, content-length header does not make any sense and it is ignored. If the transfer encoding is not chunked then content-length header is required to determine the length of the payload. If neither is present the closing connection indicates to the client of the end of the entity. Read more about message length calculation in RFC 2616.

Add handling of chunked response metadata, closes #125

machawk1 added bug ipwb replay labels Mar 2, 2017

machawk1 added this to the 2.0 (Extended more featureful implementation) milestone Mar 2, 2017

machawk1 mentioned this issue Mar 2, 2017

Decode Chunked Transfer encoded payload prior to pushing to IPFS instead of decoding at replay #126

Open

machawk1 closed this as completed in 745548d Mar 5, 2017

machawk1 added a commit that referenced this issue Mar 5, 2017

Merge pull request #127 from oduwsdl/issue-125

c13ffcc

Add handling of chunked response metadata, closes #125

machawk1 modified the milestones: 2.0 (Extended more featureful implementation), 1.0 β Mar 13, 2017

machawk1 mentioned this issue Aug 16, 2017

Should WARC records on the distributed web default to a flat list of hashes, or should we crawl to directories datatogether/sentry#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Account for chunk-extension values when parsing chunked responses #125

Account for chunk-extension values when parsing chunked responses #125

machawk1 commented Mar 2, 2017

machawk1 commented Mar 2, 2017 •

edited by ibnesayeed

Loading

ibnesayeed commented Mar 2, 2017

machawk1 commented Mar 2, 2017

ibnesayeed commented Mar 2, 2017

machawk1 commented Mar 2, 2017 •

edited

Loading

ibnesayeed commented Mar 2, 2017

Account for chunk-extension values when parsing chunked responses #125

Account for chunk-extension values when parsing chunked responses #125

Comments

machawk1 commented Mar 2, 2017

machawk1 commented Mar 2, 2017 • edited by ibnesayeed Loading

ibnesayeed commented Mar 2, 2017

machawk1 commented Mar 2, 2017

ibnesayeed commented Mar 2, 2017

machawk1 commented Mar 2, 2017 • edited Loading

ibnesayeed commented Mar 2, 2017

machawk1 commented Mar 2, 2017 •

edited by ibnesayeed

Loading

machawk1 commented Mar 2, 2017 •

edited

Loading