DocumentsFilter

Application for identifying documents containing specific strings.

UML Class Diagram

The script has been written thinking of adding more Document subclasses in the future.

GUI

Please see below an screenshot of the application user interface

Output spreadsheet

Please find below a screenshot of the output spreadsheet generated when running DocumentsFilter to analyse 200 resumes.

How to use it

Using the executable file (easy way)

Download "DocumentsFilter.zip" from https://github.com/lmponcio/DocumentsFilter
Extract the "DocumentsFilter" folder into your computer (right click the zip file, and choose "Extract All")
Open "Filters.txt", write one filter (or keyword) per line and save the file
Copy all documents you want to analyse into "Files to Filter" folder (file extensions ".docx" and ".pdf")
Double-click "DocumentsFilter.exe"
Click "Run Filters"
The results will be generated in a folder next to the exe file

Using main.py (source code)

Clone the repository into your computer
In the same folder where main.py is located, create a "Files to Filter" directory and a "Filters.txt" file
Open "Filters.txt", write one filter (or keyword) per line and save the file
Copy all documents you want to analyse into "Files to Filter" folder (file extensions ".docx" and ".pdf")
Create a python virtual environment and use requirements.txt to install all the required dependencies
run main.py
Click "Run Filters"
The results will be generated in a folder next to the exe file

Important information

DocumentsFilter checks DOCX (not DOC) and PDF files. If files with other extensions are provided (DOC, JPEG, CSV, etc.) they will be ignored. This could change in future releases.
The filters are not case-sensitive (if you write "Excel" or "excel" in Filters.txt it has the same effect). The script transforms both the filters and the documents content to lowercase, and checks if the lower-cased filters are contained in the lower-cased documents content.
DocumentsFilter is not 100% accurate - It's quite accurate, but not perfect. For more information check the libraries used for scanning the ".docx" files (python-docx) and the ".pdf" files (pypdf).
Images are not checked, only text. In the future I might add optical character recognition so text in images is also checked, but for now it is only checking text elements.
There is no AI involved. It is a script that goes through the text elements in documents and checks if the filter strings provided are present or absent.

Acknowledgments

DocumentsFilter is a Python code that uses external libraries to do its job. Special thanks to the mantainers of python-docx, pypdf and openpyxl.

Resources

Logo created using Canva
UML Class Diagram created using http://draw.io/
https://www.visual-paradigm.com/guide/uml-unified-modeling-language/uml-aggregation-vs-composition/

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
media		media
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
logo.ico		logo.ico
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocumentsFilter

UML Class Diagram

GUI

Output spreadsheet

How to use it

Using the executable file (easy way)

Using main.py (source code)

Important information

Acknowledgments

Resources

About

Releases 1

Packages

Languages

License

lmponcio/DocumentsFilter

Folders and files

Latest commit

History

Repository files navigation

DocumentsFilter

UML Class Diagram

GUI

Output spreadsheet

How to use it

Using the executable file (easy way)

Using main.py (source code)

Important information

Acknowledgments

Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages