Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for WARCs based on version 1.1 of the spec? #37

Closed
machawk1 opened this issue Jul 2, 2018 · 5 comments
Closed

Support for WARCs based on version 1.1 of the spec? #37

machawk1 opened this issue Jul 2, 2018 · 5 comments

Comments

@machawk1
Copy link
Contributor

machawk1 commented Jul 2, 2018

I am creating some test cases for https://github.com/oduwsdl/ipwb and want to use the feature of the WARC/1.1 specification that allows for WARC-Date precision on the sub-second scale.

The sample WARCs I have generated process fine with warcio unless I use the WARC/1.1 first line of a WARC record. Are there plans to allow records using this version of the spec to be processed by warcio?

@machawk1
Copy link
Contributor Author

machawk1 commented Jul 2, 2018

In selectively importing parts of the warcio API, I can persuade the below to process examples with the above scenario:

from warcio.archiveiterator import ArchiveIterator
from warcio.recordloader import ArcWarcRecordLoader
ArcWarcRecordLoader.WARC_TYPES.append('WARC/1.1')

warc11 = '(pathtomywarc)'

with open(warc11, 'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_type == 'response':
            print(record.rec_headers.get_header('WARC-Date'))

...but this is a dirty hack and does not account for the other features of the 1.1 spec. With that also in mind, WARC-Dates that are invalid per the WARC/1.0 spec but legal per WARC/1.1 (e.g., 2014-02-10T00:00:01.000000002Z) do not throw any sort of validation error when processed with warcio 95d5dcd.

My above question (plans?) still remains. I am hoping to finally get around to integrating warcio into ipwb for oduwsdl/ipwb#380 and oduwsdl/ipwb#374.

@sebastian-nagel
Copy link

Similar issue: warcio index fails on a WARC file of version 1.1:

warcio.recordloader.ArchiveLoadFailed: Unknown archive format, first line: ['WARC/1.1']

The mentioned work-around (add WARC/1.1 to WARC_TYPES) is not applicable to command-line tools.
The only 1.1 feature I plan to use is the WARC-Refers-To-Date header in revisit records. Warcio does not seem to have issues with unknown headers. If there is already partial support for WARC/1.1 (simply because the differences to 1.0 are small), why not claim to support it?

@ikreymer
Copy link
Member

ikreymer commented Aug 2, 2018

@sebastian-nagel yeah, i think you're right, while we've been cautious to start writing 1.1 WARCs, we should definitely support reading WARC/1.1. We can look into this soon.

N0taN3rd added a commit to N0taN3rd/warcio that referenced this issue Aug 2, 2018

Unverified

This user has not yet uploaded their public signing key.
…sing of WARCs using the latest WARC spec

Bumped version to 1.5.4 per this change.
Fixes webrecorder#37
@nicholasamorim
Copy link

Any news on Warc 1.1 support?

ikreymer added a commit that referenced this issue Oct 7, 2018
…addresses #37) (reading already possible)

- use full millis precision for WARC-Date when using WARC/1.1
- timeutils: iso_date_to_datetime() supports parsing millis param
- timeutils: datetime_to_iso_date() supports 'use_millis' param which includes a millis fraction (as prt ISO 8601)
- record_http: pass extra args to base warcwriter, supports 'warc_version' param
- warc version: can be '1.0' or '1.1', converted to 'WARC/1.0' and 'WARC/1.1' respectively
- tests: test warc 1.1 writing directly, through record_http, also add test for utils.open()
- warcwriter: curr_warc_date() returns a second precsion (default) to millis precision based on current WARC version
ikreymer added a commit that referenced this issue Oct 9, 2018

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
* warc/1.1 support! add ability to more easily write WARC/1.1 records (addresses #37) (reading already possible)
- use microsecond precision for WARC-Date when using WARC/1.1
- timeutils: iso_date_to_datetime() supports parsing microsecond param
- timeutils: datetime_to_iso_date() supports 'use_micros' param which includes a microsecond fraction (as prt ISO 8601)
- record_http: pass extra args to base warcwriter, supports 'warc_version' param
- warc version: can be '1.0' or '1.1', converted to 'WARC/1.0' and 'WARC/1.1' respectively
- tests: test warc 1.1 writing directly, through record_http, also add test for utils.open()
- warcwriter: curr_warc_date() returns a second precsion (default) to microsecond precision based on current WARC version
- Update README to mention WARC/1.1 support
@ikreymer
Copy link
Member

Support for reading and writing WARC 1.1 added in warcio 1.6.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants