-
Notifications
You must be signed in to change notification settings - Fork 6
Downloading Resource Files
Paco Nathan edited this page Oct 3, 2019
·
10 revisions
The download_corpus_resources.py
script will download all resource files.
In this script, we assume that
- all publication open access URIs return PDF files
- the dataset
foaf:page
URIs return either HTML or PDF files
To run using the default corpus file (corpus.jsonld
) and default output directory (resources/
):
python download_corpus_resources.py
Known issues when running this script on the v0.1.5 corpus file:
- unable to download publication pdf files embedded in an HTML page (for example: on epdf links in onlinelibrary.wiley.com, reader.elsevier.com html file)
kudos: @philipskokoh