Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for chunk-extension values when parsing chunked responses #125

Closed
machawk1 opened this issue Mar 2, 2017 · 6 comments
Closed

Account for chunk-extension values when parsing chunked responses #125

machawk1 opened this issue Mar 2, 2017 · 6 comments

Comments

@machawk1
Copy link
Member

machawk1 commented Mar 2, 2017

When ipwb replay parses the chunked responses, it does not consider the potential extensions values per https://tools.ietf.org/html/rfc2616#section-3.6.1 and in discussion at #124 (comment). Check if a semi-colon is present and if so, parse out the chunk size as the first value rather than relying on the entirety of the line.

Via @ibnesayeed

@machawk1
Copy link
Member Author

machawk1 commented Mar 2, 2017

Something like:

chunkSizeHex = chunkDescriptor.split(';')[0].strip()
chunkSizeDec = int(chunkSizeHex, 16)

via @ibnesayeed

No need to check for ';' presence, full string will return if delimiter is not present.

TODO: Fabricate WARC file w/ variants of valid values of the extension value for testing.

@ibnesayeed
Copy link
Member

I think it would be better if we shift this dechuncking practice to the indexer and store actual payload in the IPFS. This will make the replay faster (and less memory intensive). Also, the indexing happens only once while the replay of the same content happens more often, so this can lead to a better optimization. Another advantage of dechuncking at indexing time is the reliable content addressibility. The server might split a response in chunks differently each time the same resource is requested, which would lead to poor deduplication.

@machawk1
Copy link
Member Author

machawk1 commented Mar 2, 2017

I thought about this before implementing it as well but opted for decoding at time of replay so the true(r) capture is preserved in IPFS instead of one we have further manipulated.

Your final point is very important: if two HTML pages are the exact same content at different URI-Rs, their hash should be the same. Only the index (i.e., cdxj) will associate the different URIs.

See #126.

@ibnesayeed
Copy link
Member

I thought about this before implementing it as well but opted for decoding at time of replay so the true(r) capture is preserved in IPFS instead of one we have further manipulated.

Well, the "true"-rity of the essence of the response is preserved in Content-Encoding. In contrast, the Transfer-Encoding only deals with how the data is transported between the server and the client; or more precisely, any on-the-fly transformation to the data is done with the Transfer-Encoding which needs to be undone on the reception.

I see one logical issue with dechunking though. If the original headers are utilized to understand the payload, the parser might fail as it would expect a chuked encoded data. For now it will be an out-of-band knowledge, unless we know a more expressive way to convey the changes made. One way would be to mask the Transfer-Encoding and add a Content-Length header, but that would be too much fabrication.

Your final point is very important: if two HTML pages are the exact same content at different URI-Rs, their hash should be the same. Only the index (i.e., cdxj) will associate the different URIs.

Not just the same content on different URIs, but the same URI might return differently chunked response each time it is requested.

@machawk1
Copy link
Member Author

machawk1 commented Mar 2, 2017

If the original headers are utilized to understand the payload, the parser might fail as it would expect a chuked encoded data.

Should we removed the chunked header prior to pushing to IPFS? We could also add something to indicate this in the header blocked prior to being pushed (this would be an expressive way to convey changes).

and add a Content-Length header

What if this header already exists? When we manipulate the payload and re-set() it, the content-length is adjusted when served to the client.

@ibnesayeed
Copy link
Member

I think you might want to study about the chunked encoding a bit more. When chunked encoding is present, content-length header does not make any sense and it is ignored. If the transfer encoding is not chunked then content-length header is required to determine the length of the payload. If neither is present the closing connection indicates to the client of the end of the entity. Read more about message length calculation in RFC 2616.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants