comparison.tsv

Deep	Link	Demo	Notebook	Deep?	Reads image?	Detectron?	OCR included?	"Seems to work"?*	get pandas df?	get text?	get image?	throughput (cpu)
nougat	[github](https://github.com/facebookresearch/nougat)		[Nougat eval](https://colab.research.google.com/drive/1B4agm6hwR-Ia-5AduEU-y7DteNAOxRhX)	✓	✓		✓	✓✓	latex table (mmd)	✓	✗	~330 s/page
gmft	[github](https://github.com/conjuncts/gmft)		[gmft eval](https://colab.research.google.com/drive/1fEqsTdKcO5RNPV_b2v9cB4Y5We9Kv-hR)	✓	✓		✗	✓✓	✓	✓	✓	~1.867 s/page
img2table	[github](https://github.com/xavctn/img2table)		[img2table eval](https://colab.research.google.com/drive/1_TD2U0JsaW0SqmuCUv7iSbAyJwvRuq_C)	✗	✓		✓	✓✓	✓	✓	✓	~1.45 s/page
unstructured	[docs.unstructured.io](https://docs.unstructured.io/examplecode/codesamples/apioss/table-extraction-from-pdf)		[Unstructured eval](https://colab.research.google.com/drive/1k8IpVqyCW8DUZ8psRxHPCQSnE3XZBuOd)	✓	✓	✓	✓	✓	✓ (html -> df)	✓	?	~15.35 s/page
open-parse (unitable)	[github](https://github.com/Filimoa/open-parse)	[openparse_quickstart.ipynb](https://colab.research.google.com/drive/1Z5B5gsnmhFKEFL-5yYIcoox7-jQao8Ep)	[open-parse eval](https://colab.research.google.com/drive/18r_0vfxbD-RsCqIcQE3Lo_nF4Lsh_z2s?ouid=110924231912857331758)	✓	✓			✓	✓ (html -> df)	✓	✓ (custom)	~126 s/page
open-parse (tatr)	[github](https://github.com/Filimoa/open-parse)		[open-parse eval](https://colab.research.google.com/drive/18r_0vfxbD-RsCqIcQE3Lo_nF4Lsh_z2s?ouid=110924231912857331758)	✓	✓			✓	✓ (html -> df)	✓	✓ (custom)	~4.992 s/page
open-parse (pymupdf)	[github](https://github.com/Filimoa/open-parse)		[open-parse eval](https://colab.research.google.com/drive/18r_0vfxbD-RsCqIcQE3Lo_nF4Lsh_z2s?ouid=110924231912857331758)	✗	✗			✗			✓ (custom)	~0.67 s/page
deepdoctection, tatr	[github](https://github.com/deepdoctection/deepdoctection)		[deepdoctection tatr eval](https://colab.research.google.com/drive/19c7uMC0Ya2tfZw1r2itstmuX2wxun86L)	✓	✓	✓	✓	✗ needs config			?	~58s per page
surya	[github](https://github.com/VikParuchuri/surya)		[surya eval](https://colab.research.google.com/drive/1LUqEIiiGt0EDK3jrypWQJKrrXW3nA9ty?usp=drive_link)	✓	✓		✓	✓	✗	✗	✓	~60.679 s/page
paddleocr	[github](https://github.com/PaddlePaddle/PaddleOCR/blob/main/README_en.md)		https://medium.com/@malshanCS/automating-table-data-extraction-tools-and-techniques-for-efficiency-a29df313cbda#629d	✓	✓			?				
alibaba/omniparser	[github](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/OmniParser)			✓	✓			?				
alibaba/DocXChain	[github](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/Applications/DocXChain)			✓	✓			?				
layoutparser (no commit in 2 yrs?)	[github](https://github.com/Layout-Parser/layout-parser/blob/main/examples/OCR%20Tables%20and%20Parse%20the%20Output.ipynb)	https://github.com/Layout-Parser/layout-parser/blob/main/examples/OCR%20Tables%20and%20Parse%20the%20Output.ipynb		✓	✓	✓		unmaintained				
												
doctr (not tbl focused)	[github](https://github.com/mindee/doctr)	https://huggingface.co/spaces/mindee/doctr		✓	✓			N/A	N/A			
												
Non-deep												
camelot	[github](https://github.com/camelot-dev/camelot)		[camelot eval](https://colab.research.google.com/drive/1ORQPURWJuLvTOeboU0-t4Xg9t6iqTIPO)	✗				✓ many false positives, needs config	✓	✓	possible	~1.82 s/page
pdfplumber	[github](https://github.com/jsvine/pdfplumber)		[pdfplumber eval](https://colab.research.google.com/drive/1DUmd_Sjzhp4ZrltxvXV0-F3fiBQhE8a6)	✗				✗ or needs config			possible	~0.273 s/page
pymupdf	[github](https://github.com/pymupdf/PyMuPDF)		[pymupdf eval](https://colab.research.google.com/drive/1ZBrAwrfOgDewXhyfDl5xN7mbGUM4idhW)	✗				✗ or needs config			possible	~0.250 s/page
pdfminer	[github](https://github.com/pdfminer/pdfminer.six)			✗								

Proprietary												
mathpix				✓				✓				
Adobe Sensei	[developer.adobe.com](https://developer.adobe.com/document-services/apis/pdf-extract/)			✓				✓				
AWS TextExtract				✓				✓				
Azure Document Intelligence	[azure.microsoft.com](https://azure.microsoft.com/en-us/pricing/details/ai-document-intelligence/)			✓				✓				
Google Document AI	[cloud.google.com](https://cloud.google.com/document-ai?hl=en)			✓				✓