forked from conjuncts/gmft
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcomparison.tsv
We can make this file beautiful and searchable if this error is corrected: Any value after quoted field isn't allowed in line 1.
29 lines (26 loc) · 4.28 KB
/
comparison.tsv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Deep Link Demo Notebook Deep? Reads image? Detectron? OCR included? "Seems to work"?* get pandas df? get text? get image? throughput (cpu)
nougat [github](https://github.com/facebookresearch/nougat) [Nougat eval](https://colab.research.google.com/drive/1B4agm6hwR-Ia-5AduEU-y7DteNAOxRhX) ✓ ✓ ✓ ✓✓ latex table (mmd) ✓ ✗ ~330 s/page
gmft [github](https://github.com/conjuncts/gmft) [gmft eval](https://colab.research.google.com/drive/1fEqsTdKcO5RNPV_b2v9cB4Y5We9Kv-hR) ✓ ✓ ✗ ✓✓ ✓ ✓ ✓ ~1.867 s/page
img2table [github](https://github.com/xavctn/img2table) [img2table eval](https://colab.research.google.com/drive/1_TD2U0JsaW0SqmuCUv7iSbAyJwvRuq_C) ✗ ✓ ✓ ✓✓ ✓ ✓ ✓ ~1.45 s/page
unstructured [docs.unstructured.io](https://docs.unstructured.io/examplecode/codesamples/apioss/table-extraction-from-pdf) [Unstructured eval](https://colab.research.google.com/drive/1k8IpVqyCW8DUZ8psRxHPCQSnE3XZBuOd) ✓ ✓ ✓ ✓ ✓ ✓ (html -> df) ✓ ? ~15.35 s/page
open-parse (unitable) [github](https://github.com/Filimoa/open-parse) [openparse_quickstart.ipynb](https://colab.research.google.com/drive/1Z5B5gsnmhFKEFL-5yYIcoox7-jQao8Ep) [open-parse eval](https://colab.research.google.com/drive/18r_0vfxbD-RsCqIcQE3Lo_nF4Lsh_z2s?ouid=110924231912857331758) ✓ ✓ ✓ ✓ (html -> df) ✓ ✓ (custom) ~126 s/page
open-parse (tatr) [github](https://github.com/Filimoa/open-parse) [open-parse eval](https://colab.research.google.com/drive/18r_0vfxbD-RsCqIcQE3Lo_nF4Lsh_z2s?ouid=110924231912857331758) ✓ ✓ ✓ ✓ (html -> df) ✓ ✓ (custom) ~4.992 s/page
open-parse (pymupdf) [github](https://github.com/Filimoa/open-parse) [open-parse eval](https://colab.research.google.com/drive/18r_0vfxbD-RsCqIcQE3Lo_nF4Lsh_z2s?ouid=110924231912857331758) ✗ ✗ ✗ ✓ (custom) ~0.67 s/page
deepdoctection, tatr [github](https://github.com/deepdoctection/deepdoctection) [deepdoctection tatr eval](https://colab.research.google.com/drive/19c7uMC0Ya2tfZw1r2itstmuX2wxun86L) ✓ ✓ ✓ ✓ ✗ needs config ? ~58s per page
surya [github](https://github.com/VikParuchuri/surya) [surya eval](https://colab.research.google.com/drive/1LUqEIiiGt0EDK3jrypWQJKrrXW3nA9ty?usp=drive_link) ✓ ✓ ✓ ✓ ✗ ✗ ✓ ~60.679 s/page
paddleocr [github](https://github.com/PaddlePaddle/PaddleOCR/blob/main/README_en.md) https://medium.com/@malshanCS/automating-table-data-extraction-tools-and-techniques-for-efficiency-a29df313cbda#629d ✓ ✓ ?
alibaba/omniparser [github](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/OmniParser) ✓ ✓ ?
alibaba/DocXChain [github](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/Applications/DocXChain) ✓ ✓ ?
layoutparser (no commit in 2 yrs?) [github](https://github.com/Layout-Parser/layout-parser/blob/main/examples/OCR%20Tables%20and%20Parse%20the%20Output.ipynb) https://github.com/Layout-Parser/layout-parser/blob/main/examples/OCR%20Tables%20and%20Parse%20the%20Output.ipynb ✓ ✓ ✓ unmaintained
doctr (not tbl focused) [github](https://github.com/mindee/doctr) https://huggingface.co/spaces/mindee/doctr ✓ ✓ N/A N/A
Non-deep
camelot [github](https://github.com/camelot-dev/camelot) [camelot eval](https://colab.research.google.com/drive/1ORQPURWJuLvTOeboU0-t4Xg9t6iqTIPO) ✗ ✓ many false positives, needs config ✓ ✓ possible ~1.82 s/page
pdfplumber [github](https://github.com/jsvine/pdfplumber) [pdfplumber eval](https://colab.research.google.com/drive/1DUmd_Sjzhp4ZrltxvXV0-F3fiBQhE8a6) ✗ ✗ or needs config possible ~0.273 s/page
pymupdf [github](https://github.com/pymupdf/PyMuPDF) [pymupdf eval](https://colab.research.google.com/drive/1ZBrAwrfOgDewXhyfDl5xN7mbGUM4idhW) ✗ ✗ or needs config possible ~0.250 s/page
pdfminer [github](https://github.com/pdfminer/pdfminer.six) ✗
Proprietary
mathpix ✓ ✓
Adobe Sensei [developer.adobe.com](https://developer.adobe.com/document-services/apis/pdf-extract/) ✓ ✓
AWS TextExtract ✓ ✓
Azure Document Intelligence [azure.microsoft.com](https://azure.microsoft.com/en-us/pricing/details/ai-document-intelligence/) ✓ ✓
Google Document AI [cloud.google.com](https://cloud.google.com/document-ai?hl=en) ✓ ✓