-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify --pages option to copy pages files directly into WACZ #92
Modify --pages option to copy pages files directly into WACZ #92
Conversation
pages.jsonl and extraPages.jsonl files will be copied, other files are ignored.
Hi @tw4l -- thanks for the PR! I'm on board with the idea 🏄 Here are two things I would suggest:
What do you think? |
Those both make sense to me! Happy to push these changes tomorrow morning :) |
- Replace existing -p/--pages implementation rather than adding another option - Rather than hardcoding allowed names, check that JSONL files passed have correct extension and are well-formed JSON lines - Modify tests and fixtures to account for new logic
Good morning @matteocargnelutti ! I've gone ahead and made the changes, as well as checking for a Let me know if anything! |
index.js
Outdated
const rl = readline.createInterface({ input: createReadStream(pagesFile) }) | ||
for await (const line of rl) { | ||
try { | ||
JSON.parse(line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the edits @tw4l!
I think we're almost there - We should maybe we add a little bit of validation against the spec, just to make sure those are indeed pages files.
We could check:
- That the first item contains
format
andid
properties - That subsequent entries contain
url
andts
properties
Maybe using Node's assert
since we're in a try / catch ?
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah nice suggestion! May as well get this right while we're focused on it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit pushed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ilya raised a related point: in older versions of the crawler, the pages files occasionally included invalid lines, we think because of text extraction that wasn't truncated.
It may be safer if less performant to filter per-line rather than per-file, but write valid lines as-is into the correct file in the WACZ. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now inclined to say that we can generate valid pages.jsonl and just fail if its invalid, but yeah, another option would be, on first failure, to skip the failing lines and reserialize only valid ones (similar to old behavior). But not sure if its needed at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either approach works for me; in both cases, we are unlikely to end up with invalid pages.jsonl
files added to the archive. I am also not concerned about performance for this step.
Let me know if you'd like to add line-by-line filtering or not 😄 . Otherwise: this is great and I'm happy to test / approve / merge.
Thank you both!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like we're going to stick with per-file and just make sure the crawler isn't writing any invalid pages files to begin with, so feel free to test/approve/merge, thank you!
Co-authored-by: Matteo Cargnelutti <[email protected]>
Thanks again for a great PR, @tw4l |
Fixes #91
Adds tests as well. Happy to make any changes you see fit. Thanks for the review!