Downloading Resource Files

Jump to bottom

Paco Nathan edited this page Oct 3, 2019 · 10 revisions

The download_corpus_resources.py script will download all resource files. In this script, we assume that

all publication open access URIs return PDF files
the dataset foaf:page URIs return either HTML or PDF files

To run using the default corpus file (corpus.jsonld) and default output directory (resources/):

python download_corpus_resources.py

Known issues when running this script on the v0.1.5 corpus file:

unable to download publication pdf files embedded in an HTML page (for example: on epdf links in onlinelibrary.wiley.com, reader.elsevier.com html file)

kudos: @philipskokoh