Deep | Link | Demo | Notebook | Deep? | Reads image? | Detectron? | OCR included? | Seems to work | get pandas df? | get text? | get image? | throughput (cpu) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
nougat | github | Nougat eval | ✓ | ✓ | ✓ | ✓✓ | latex table (mmd) | ✓ | ✗ | ~330 s/page | ||
gmft | github | gmft eval | ✓ | ✓ | ✗ | ✓✓ | ✓ | ✓ | ✓ | ~1.867 s/page | ||
img2table | github | img2table eval | ✗ | ✓ | ✓ | ✓✓ | ✓ | ✓ | ✓ | ~1.45 s/page | ||
unstructured | docs.unstructured.io | Unstructured eval | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ (html -> df) | ✓ | ? | ~15.35 s/page | |
open-parse (unitable) | github | openparse_quickstart.ipynb | open-parse eval | ✓ | ✓ | ✓ | ✓ (html -> df) | ✓ | ✓ (custom) | ~126 s/page | ||
open-parse (tatr) | github | open-parse eval | ✓ | ✓ | ✓ | ✓ (html -> df) | ✓ | ✓ (custom) | ~4.992 s/page | |||
open-parse (pymupdf) | github | open-parse eval | ✗ | ✗ | ✗ | ✓ (custom) | ~0.67 s/page | |||||
deepdoctection, tatr | github | deepdoctection tatr eval | ✓ | ✓ | ✓ | ✓ | ✗ needs config | ? | ~58s per page | |||
surya | github | surya eval | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ | ~60.679 s/page | ||
paddleocr | github | https://medium.com/@malshanCS/automating-table-data-extraction-tools-and-techniques-for-efficiency-a29df313cbda#629d | ✓ | ✓ | ? | |||||||
alibaba/omniparser | github | ✓ | ✓ | ? | ||||||||
alibaba/DocXChain | github | ✓ | ✓ | ? | ||||||||
layoutparser (no commit in 2 yrs?) | github | https://github.com/Layout-Parser/layout-parser/blob/main/examples/OCR%20Tables%20and%20Parse%20the%20Output.ipynb | ✓ | ✓ | ✓ | unmaintained | ||||||
doctr (not tbl focused) | github | https://huggingface.co/spaces/mindee/doctr | ✓ | ✓ | N/A | N/A | ||||||
Non-deep | ||||||||||||
camelot | github | camelot eval | ✗ | ✓ many false positives, needs config | ✓ | ✓ | possible | ~1.82 s/page | ||||
pdfplumber | github | pdfplumber eval | ✗ | ✗ or needs config | possible | ~0.273 s/page | ||||||
pymupdf | github | pymupdf eval | ✗ | ✗ or needs config | possible | ~0.250 s/page | ||||||
pdfminer | github | ✗ | ||||||||||
Proprietary | ||||||||||||
mathpix | ✓ | ✓ | ||||||||||
Adobe Sensei | developer.adobe.com | ✓ | ✓ | |||||||||
AWS TextExtract | ✓ | ✓ | ||||||||||
Azure Document Intelligence | azure.microsoft.com | ✓ | ✓ | |||||||||
Google Document AI | cloud.google.com | ✓ | ✓ |