Application for identifying documents containing specific strings.
The script has been written thinking of adding more Document subclasses in the future.
Please see below an screenshot of the application user interface
Please find below a screenshot of the output spreadsheet generated when running DocumentsFilter to analyse 200 resumes.
- Download "DocumentsFilter.zip" from https://github.com/lmponcio/DocumentsFilter
- Extract the "DocumentsFilter" folder into your computer (right click the zip file, and choose "Extract All")
- Open "Filters.txt", write one filter (or keyword) per line and save the file
- Copy all documents you want to analyse into "Files to Filter" folder (file extensions ".docx" and ".pdf")
- Double-click "DocumentsFilter.exe"
- Click "Run Filters"
- The results will be generated in a folder next to the exe file
- Clone the repository into your computer
- In the same folder where main.py is located, create a "Files to Filter" directory and a "Filters.txt" file
- Open "Filters.txt", write one filter (or keyword) per line and save the file
- Copy all documents you want to analyse into "Files to Filter" folder (file extensions ".docx" and ".pdf")
- Create a python virtual environment and use requirements.txt to install all the required dependencies
- run main.py
- Click "Run Filters"
- The results will be generated in a folder next to the exe file
- DocumentsFilter checks DOCX (not DOC) and PDF files. If files with other extensions are provided (DOC, JPEG, CSV, etc.) they will be ignored. This could change in future releases.
- The filters are not case-sensitive (if you write "Excel" or "excel" in Filters.txt it has the same effect). The script transforms both the filters and the documents content to lowercase, and checks if the lower-cased filters are contained in the lower-cased documents content.
- DocumentsFilter is not 100% accurate - It's quite accurate, but not perfect. For more information check the libraries used for scanning the ".docx" files (python-docx) and the ".pdf" files (pypdf).
- Images are not checked, only text. In the future I might add optical character recognition so text in images is also checked, but for now it is only checking text elements.
- There is no AI involved. It is a script that goes through the text elements in documents and checks if the filter strings provided are present or absent.
DocumentsFilter is a Python code that uses external libraries to do its job. Special thanks to the mantainers of python-docx, pypdf and openpyxl.
- Logo created using Canva
- UML Class Diagram created using http://draw.io/
- https://www.visual-paradigm.com/guide/uml-unified-modeling-language/uml-aggregation-vs-composition/