Skip to content

Latest commit

 

History

History
29 lines (29 loc) · 15.3 KB

comparison.md

File metadata and controls

29 lines (29 loc) · 15.3 KB
Deep Link Demo Notebook Deep? Reads image? Detectron? OCR included? Seems to work get pandas df? get text? get image? throughput (cpu)
nougat github Nougat eval ✓✓ latex table (mmd) ~330 s/page
gmft github gmft eval ✓✓ ~1.867 s/page
img2table github img2table eval ✓✓ ~1.45 s/page
unstructured docs.unstructured.io Unstructured eval ✓ (html -> df) ? ~15.35 s/page
open-parse (unitable) github openparse_quickstart.ipynb open-parse eval ✓ (html -> df) ✓ (custom) ~126 s/page
open-parse (tatr) github open-parse eval ✓ (html -> df) ✓ (custom) ~4.992 s/page
open-parse (pymupdf) github open-parse eval ✓ (custom) ~0.67 s/page
deepdoctection, tatr github deepdoctection tatr eval ✗ needs config ? ~58s per page
surya github surya eval ~60.679 s/page
paddleocr github https://medium.com/@malshanCS/automating-table-data-extraction-tools-and-techniques-for-efficiency-a29df313cbda#629d ?
alibaba/omniparser github ?
alibaba/DocXChain github ?
layoutparser (no commit in 2 yrs?) github https://github.com/Layout-Parser/layout-parser/blob/main/examples/OCR%20Tables%20and%20Parse%20the%20Output.ipynb unmaintained
doctr (not tbl focused) github https://huggingface.co/spaces/mindee/doctr N/A N/A
Non-deep
camelot github camelot eval ✓ many false positives, needs config possible ~1.82 s/page
pdfplumber github pdfplumber eval ✗ or needs config possible ~0.273 s/page
pymupdf github pymupdf eval ✗ or needs config possible ~0.250 s/page
pdfminer github
Proprietary
mathpix
Adobe Sensei developer.adobe.com
AWS TextExtract
Azure Document Intelligence azure.microsoft.com
Google Document AI cloud.google.com