-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Account for chunk-extension values when parsing chunked responses #125
Comments
Something like: chunkSizeHex = chunkDescriptor.split(';')[0].strip()
chunkSizeDec = int(chunkSizeHex, 16) via @ibnesayeed No need to check for ';' presence, full string will return if delimiter is not present. TODO: Fabricate WARC file w/ variants of valid values of the extension value for testing. |
I think it would be better if we shift this dechuncking practice to the indexer and store actual payload in the IPFS. This will make the replay faster (and less memory intensive). Also, the indexing happens only once while the replay of the same content happens more often, so this can lead to a better optimization. Another advantage of dechuncking at indexing time is the reliable content addressibility. The server might split a response in chunks differently each time the same resource is requested, which would lead to poor deduplication. |
I thought about this before implementing it as well but opted for decoding at time of replay so the true(r) capture is preserved in IPFS instead of one we have further manipulated. Your final point is very important: if two HTML pages are the exact same content at different URI-Rs, their hash should be the same. Only the index (i.e., cdxj) will associate the different URIs. See #126. |
Well, the "true"-rity of the essence of the response is preserved in I see one logical issue with dechunking though. If the original headers are utilized to understand the payload, the parser might fail as it would expect a chuked encoded data. For now it will be an out-of-band knowledge, unless we know a more expressive way to convey the changes made. One way would be to mask the
Not just the same content on different URIs, but the same URI might return differently chunked response each time it is requested. |
Should we removed the chunked header prior to pushing to IPFS? We could also add something to indicate this in the header blocked prior to being pushed (this would be an expressive way to convey changes).
What if this header already exists? When we manipulate the payload and re-set() it, the content-length is adjusted when served to the client. |
I think you might want to study about the chunked encoding a bit more. When chunked encoding is present, content-length header does not make any sense and it is ignored. If the transfer encoding is not chunked then content-length header is required to determine the length of the payload. If neither is present the closing connection indicates to the client of the end of the entity. Read more about message length calculation in RFC 2616. |
Add handling of chunked response metadata, closes #125
When ipwb replay parses the chunked responses, it does not consider the potential extensions values per https://tools.ietf.org/html/rfc2616#section-3.6.1 and in discussion at #124 (comment). Check if a semi-colon is present and if so, parse out the chunk size as the first value rather than relying on the entirety of the line.
Via @ibnesayeed
The text was updated successfully, but these errors were encountered: