Skip to content

Commit

Permalink
0.1.0 RC
Browse files Browse the repository at this point in the history
Added comments, fixed docs
  • Loading branch information
matteocargnelutti committed Mar 22, 2024
1 parent ec3bac4 commit 62921ef
Show file tree
Hide file tree
Showing 3 changed files with 11 additions and 8 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,12 +92,13 @@ js-wacz create --file cool-beans.warc --output cool-beans.wacz

### --pages, -p

Pass a specific [pages.jsonl](https://specs.webrecorder.net/wacz/1.1.1/#pages-jsonl) file.
Path to a folder containing [pages.jsonl](https://specs.webrecorder.net/wacz/1.1.1/#pages-jsonl) files (`pages.jsonl`, `extraPages.jsonl` ...).

If not provided, **js-wacz** is going to attempt to detect pages in WARC records to build its own `pages.jsonl` index.

```bash
js-wacz create -f "collection/*.warc.gz" --pages collection/pages.jsonl
# Assuming the following file exists: /collections/pages/pages.jsonl
js-wacz create -f "collection/*.warc.gz" --pages collection/pages/
```

### --cdxj
Expand Down
10 changes: 5 additions & 5 deletions index.js
Original file line number Diff line number Diff line change
Expand Up @@ -600,7 +600,7 @@ export class WACZ {
}

/**
* Copies pages.jsonl and extraPages.jsonl files in this.pagesDir into ZIP.
* Copies pages.jsonl and extraPages.jsonl files in `this.pagesDir` into ZIP.
* @returns {Promise<void>}
*/
copyPagesFilesToZip = async () => {
Expand All @@ -619,8 +619,9 @@ export class WACZ {
const filenameLower = filename.toLowerCase()
const pagesFile = resolve(this.pagesDir, filename)

// Ensure file is JSONL
if (!filenameLower.endsWith('.jsonl')) {
log.warn(`Pages: Skipping file ${pagesFile}, does not end with jsonl extension`)
log.warn(`Pages: Skipping file ${basename(pagesFile)}: does not end with jsonl extension.`)
continue
}

Expand All @@ -644,7 +645,7 @@ export class WACZ {
} catch (err) {
isValidJSONL = false
log.trace(err)
log.warn(`Pages: Skipping file ${pagesFile}, not valid JSONL`)
log.warn(`Pages: Skipping file ${basename(pagesFile)}: not valid JSONL / page entry.`)
break
}
}
Expand All @@ -656,7 +657,7 @@ export class WACZ {
}

/**
* Streams all the files listes in `this.WARCs` to the output ZIP.
* Streams all the files listed in `this.WARCs` to the output ZIP.
* @returns {Promise<void>}
*/
writeWARCsToZip = async () => {
Expand Down Expand Up @@ -886,7 +887,6 @@ export class WACZ {
addCDXJ = (cdjx) => {
this.stateCheck()
this.indexFromWARCs = false

this.cdxTree.setIfNotPresent(cdjx, true)
}

Expand Down
4 changes: 3 additions & 1 deletion types.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@
* @typedef {Object} WACZOptions
* @property {string|string[]} input - Required. Path(s) to input .warc or .warc.gz file(s). Glob-compatible.
* @property {string} output - Required. Path to output .wacz file. Will default to PWD + `archive.wacz` if not provided.
* @property {boolean} [detectPages=true] - If true (default), will attempt to detect pages in WARC records.
* @property {boolean} [indexFromWARCs=true] - If true, will attempt to generate CDXJ indexes from processed WARCs. Automatically disabled if `addCDXJ()` is called.
* @property {boolean} [detectPages=true] - If true (default), will attempt to detect pages in WARC records. Automatically disabled if `pages` is provided or `addPages()` is called.
* @property {?string} pages - Path to a folder containing pages files (pages.jsonl, extraPages.jsonl ...).
* @property {?string} url - If set, will be added to datapackage.json as `mainPageUrl`.
* @property {?string} ts - If set, will be added to datapackage.json as `mainPageDate`. Can be any value that `Date()` can parse.
* @property {?string} title - If set, will be added to datapackage.json as `title`.
Expand Down

0 comments on commit 62921ef

Please sign in to comment.