Skip to content

Python Application for filtering Word and Pdf documents containing keywords. Outputs an Excel sheet of findings. UML | OOP | python-docx | pypdf | openpyxl | tkinter.

License

Notifications You must be signed in to change notification settings

lmponcio/DocumentsFilter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocumentsFilter

Application for identifying documents containing specific strings.

UML Class Diagram

The script has been written thinking of adding more Document subclasses in the future.

GUI

Please see below an screenshot of the application user interface

Output spreadsheet

Please find below a screenshot of the output spreadsheet generated when running DocumentsFilter to analyse 200 resumes.

How to use it

Using the executable file (easy way)

  • Download "DocumentsFilter.zip" from https://github.com/lmponcio/DocumentsFilter
  • Extract the "DocumentsFilter" folder into your computer (right click the zip file, and choose "Extract All")
  • Open "Filters.txt", write one filter (or keyword) per line and save the file
  • Copy all documents you want to analyse into "Files to Filter" folder (file extensions ".docx" and ".pdf")
  • Double-click "DocumentsFilter.exe"
  • Click "Run Filters"
  • The results will be generated in a folder next to the exe file

Using main.py (source code)

  • Clone the repository into your computer
  • In the same folder where main.py is located, create a "Files to Filter" directory and a "Filters.txt" file
  • Open "Filters.txt", write one filter (or keyword) per line and save the file
  • Copy all documents you want to analyse into "Files to Filter" folder (file extensions ".docx" and ".pdf")
  • Create a python virtual environment and use requirements.txt to install all the required dependencies
  • run main.py
  • Click "Run Filters"
  • The results will be generated in a folder next to the exe file

Important information

  • DocumentsFilter checks DOCX (not DOC) and PDF files. If files with other extensions are provided (DOC, JPEG, CSV, etc.) they will be ignored. This could change in future releases.
  • The filters are not case-sensitive (if you write "Excel" or "excel" in Filters.txt it has the same effect). The script transforms both the filters and the documents content to lowercase, and checks if the lower-cased filters are contained in the lower-cased documents content.
  • DocumentsFilter is not 100% accurate - It's quite accurate, but not perfect. For more information check the libraries used for scanning the ".docx" files (python-docx) and the ".pdf" files (pypdf).
  • Images are not checked, only text. In the future I might add optical character recognition so text in images is also checked, but for now it is only checking text elements.
  • There is no AI involved. It is a script that goes through the text elements in documents and checks if the filter strings provided are present or absent.

Acknowledgments

DocumentsFilter is a Python code that uses external libraries to do its job. Special thanks to the mantainers of python-docx, pypdf and openpyxl.

Resources

About

Python Application for filtering Word and Pdf documents containing keywords. Outputs an Excel sheet of findings. UML | OOP | python-docx | pypdf | openpyxl | tkinter.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages