NLPdf

PDF Extractor using Natural Language Processing

Quickstart

Download the repository
Install the requirements:

pip3 install -r requirements.txt

Load the language model for Spacy:

python3 -m spacy download en

Copy the PDF files to be cleaned into the directory "PDFs"
Run the extraction tool:

python3 run.py

The output is written to the directory "output"