-
Notifications
You must be signed in to change notification settings - Fork 6
Downloading Resource Files
To install the Python library dependencies:
pip install -r requirements.txt
python3 -m spacy download en_core_web_sm
Resource files for the corpus have already been downloaded and processed. The following provides means to access that.
First, you need to have your .aws
directory configured with valid keys,
etc., for S3 access before the following script will work. The bucket is
readable by the public, even so the boto3
library for requires keys for
a valid AWS user account.
Then adapt the bin/download_s3.py
script as example code to download
the PDF files (open access publications) and TXT files (raw extracted text) from the public S3
bucket. You will need to modify or adapt that code.
Download the corpus PDFs and other resource files:
python bin/download_resources.py --logger errors.txt
The PDF files get stored in the resources/pub/pdf
subdirectory.
We use Parsr
to extract text and JSON from research publications. The quickest way to install and run the Parsr API is through the docker image:
docker pull axarev/parsr
To run the API, issue:
docker run -p 3001:3001 axarev/parsr
-- The advanced guide is available here. --
Then run the parsr.py
script to extract text and JSON from the PDF files:
python bin/parsr.py localhost:3001
The outputs will be saved in json
and text
folders. It might be quite
time-consuming though. Please be patient.
Also, there is a known error rate associated with Parsr
. As a contingency,
use the following script to extract text for the PDF files which Parsr
does not handle:
python bin/extract_text.py
For those on the NYU-CI team who update the corpus:
Upload to the public S3 bucket:
- PDF files (open access publications)
- JSON files (semi-structured extracted text)
- TXT files (raw text)
python bin/upload_s3.py
View the public AWS S3 Bucket richcontext
online:
- https://richcontext.s3.us-east-2.amazonaws.com/
- https://s3.console.aws.amazon.com/s3/buckets/richcontext/corpus_docs/
The directory structure of the public S3 bucket is similar to the directory structure used for resources in this repo:
- richcontext
- corpus_docs
- pub
- pdf
- json
- txt
Known issues when running this script on the v0.1.5 corpus file:
-
unable to download publication pdf files embedded in an HTML page
(for example: on epdf links inonlinelibrary.wiley.com
,reader.elsevier.com
)
kudos: @philipskokoh, @srand525, @JasonZhangzy1757