Modify --pages option to copy pages files directly into WACZ #92

tw4l · 2024-03-20T21:27:00Z

Fixes #91

Adds tests as well. Happy to make any changes you see fit. Thanks for the review!

pages.jsonl and extraPages.jsonl files will be copied, other files are ignored.

matteocargnelutti · 2024-03-20T21:37:34Z

Hi @tw4l -- thanks for the PR!

I'm on board with the idea 🏄

Here are two things I would suggest:

--pages-dir should be a replacement of --pages, so we don't have two features behaving in almost identical ways. I'd suggest replacing the current --pages option with your feature.
I would add some data validation in copyPagesFilesToZip() so we don't accidentally copy over an invalid pages.jsonl / extraPages.jsonl file. For example, I'd want to be "sure" it's a series of JSON objects, and that they match the spec.

What do you think?

tw4l · 2024-03-20T21:58:30Z

Hi @tw4l -- thanks for the PR!

I'm on board with the idea 🏄

Here are two things I would suggest:

* `--pages-dir` should be a replacement of `--pages`, so we don't have two features behaving in almost identical ways. I'd suggest replacing the current `--pages` option with your feature.

* I would add some data validation in `copyPagesFilesToZip()` so we don't accidentally copy over an invalid `pages.jsonl` / `extraPages.jsonl` file. For example, I'd want to be "sure" it's a series of JSON objects, and that they match the spec.

What do you think?

Those both make sense to me! Happy to push these changes tomorrow morning :)

- Replace existing -p/--pages implementation rather than adding another option - Rather than hardcoding allowed names, check that JSONL files passed have correct extension and are well-formed JSON lines - Modify tests and fixtures to account for new logic

tw4l · 2024-03-21T15:56:22Z

Good morning @matteocargnelutti !

I've gone ahead and made the changes, as well as checking for a .jsonl extension (case-insensitive) rather than checking for pages.jsonl and extraPages.jsonl specifically, to allow for future flexibility with pages filenames.

Let me know if anything!

matteocargnelutti · 2024-03-21T20:46:53Z

index.js

+      const rl = readline.createInterface({ input: createReadStream(pagesFile) })
+      for await (const line of rl) {
+        try {
+          JSON.parse(line)


Thanks for the edits @tw4l!

I think we're almost there - We should maybe we add a little bit of validation against the spec, just to make sure those are indeed pages files.

We could check:

That the first item contains format and id properties

That subsequent entries contain url and ts properties

Maybe using Node's assert since we're in a try / catch ?

What do you think?

Yeah nice suggestion! May as well get this right while we're focused on it :)

Commit pushed!

Ilya raised a related point: in older versions of the crawler, the pages files occasionally included invalid lines, we think because of text extraction that wasn't truncated.

It may be safer if less performant to filter per-line rather than per-file, but write valid lines as-is into the correct file in the WACZ. What do you think?

I'm now inclined to say that we can generate valid pages.jsonl and just fail if its invalid, but yeah, another option would be, on first failure, to skip the failing lines and reserialize only valid ones (similar to old behavior). But not sure if its needed at this point.

I think either approach works for me; in both cases, we are unlikely to end up with invalid pages.jsonl files added to the archive. I am also not concerned about performance for this step.

Let me know if you'd like to add line-by-line filtering or not 😄 . Otherwise: this is great and I'm happy to test / approve / merge.

Thank you both!

Sounds like we're going to stick with per-file and just make sure the crawler isn't writing any invalid pages files to begin with, so feel free to test/approve/merge, thank you!

index.js

Co-authored-by: Matteo Cargnelutti <[email protected]>

matteocargnelutti · 2024-03-22T19:31:36Z

Thanks again for a great PR, @tw4l

tw4l added 3 commits March 20, 2024 17:25

Add --pagesDir option to copy pages files directly into WACZ

216c464

pages.jsonl and extraPages.jsonl files will be copied, other files are ignored.

Fix linting

6a61f72

Fix help text typo

d8ba5f2

Make code review revisions

02ef17a

- Replace existing -p/--pages implementation rather than adding another option - Rather than hardcoding allowed names, check that JSONL files passed have correct extension and are well-formed JSON lines - Modify tests and fixtures to account for new logic

tw4l changed the title ~~Add --pagesDir option to copy pages files directly into WACZ~~ Modify --pages option to copy pages files directly into WACZ Mar 21, 2024

matteocargnelutti reviewed Mar 21, 2024

View reviewed changes

Check conformance of pages files against spec

dd847e7

matteocargnelutti reviewed Mar 22, 2024

View reviewed changes

index.js Show resolved Hide resolved

Add trace logging for error when pages aren't validated

729affa

Co-authored-by: Matteo Cargnelutti <[email protected]>

matteocargnelutti approved these changes Mar 22, 2024

View reviewed changes

matteocargnelutti merged commit ec3bac4 into harvard-lil:main Mar 22, 2024
2 checks passed

tw4l deleted the issue-91-copy-pages-files branch March 22, 2024 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify --pages option to copy pages files directly into WACZ #92

Modify --pages option to copy pages files directly into WACZ #92

tw4l commented Mar 20, 2024

matteocargnelutti commented Mar 20, 2024

tw4l commented Mar 20, 2024

tw4l commented Mar 21, 2024

matteocargnelutti Mar 21, 2024 •

edited

Loading

tw4l Mar 21, 2024

tw4l Mar 21, 2024

tw4l Mar 21, 2024 •

edited

Loading

ikreymer Mar 21, 2024

matteocargnelutti Mar 22, 2024

tw4l Mar 22, 2024

matteocargnelutti commented Mar 22, 2024

Modify --pages option to copy pages files directly into WACZ #92

Modify --pages option to copy pages files directly into WACZ #92

Conversation

tw4l commented Mar 20, 2024

matteocargnelutti commented Mar 20, 2024

tw4l commented Mar 20, 2024

tw4l commented Mar 21, 2024

matteocargnelutti Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

tw4l Mar 21, 2024

Choose a reason for hiding this comment

tw4l Mar 21, 2024

Choose a reason for hiding this comment

tw4l Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

ikreymer Mar 21, 2024

Choose a reason for hiding this comment

matteocargnelutti Mar 22, 2024

Choose a reason for hiding this comment

tw4l Mar 22, 2024

Choose a reason for hiding this comment

matteocargnelutti commented Mar 22, 2024

matteocargnelutti Mar 21, 2024 •

edited

Loading

tw4l Mar 21, 2024 •

edited

Loading