Repeated table extraction in Markdown output #168

Meaveryway · 2024-10-13T16:28:41Z

Meaveryway
Oct 13, 2024

Hello there,

Thanks for the wonderful work! this outperforms even most commercial solutions out there!
I have a question regarding tables extraction: when extracting a PDF page that has a table to markdown, it seems that the table's raw text is first extracted and put in place of the table, then the formatted table at the bottom of the page.

Is this the desired output? Why?

PiochU19 · 2024-11-05T15:44:55Z

PiochU19
Nov 5, 2024

I noticed the same issue while investigating performance improvements. I forked the repository, profiled the code, and found that by skipping markdown table extraction at this line, I was able to process PDFs approximately five times faster, and those PDFs didn't have much tables inside.

@JorjMcKie

0 replies

jamie-lemon · 2024-11-05T16:13:54Z

jamie-lemon
Nov 5, 2024
Maintainer

So I think this is a bug with the latest version, to avoid the duplication of table content I had to go down a version with:

pip install -Iv pymupdf4llm==0.0.16

0 replies

Meaveryway · 2024-11-05T18:53:25Z

Meaveryway
Nov 5, 2024
Author

@PiochU19 @jamie-lemon This turned out to be a bug indeed. I reported it as an issue over #171

0 replies

tahitimoon · 2024-11-23T16:22:26Z

tahitimoon
Nov 23, 2024

0.0.17 seems to have not fixed this issue

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated table extraction in Markdown output #168

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Repeated table extraction in Markdown output #168

Meaveryway Oct 13, 2024

Replies: 4 comments

PiochU19 Nov 5, 2024

jamie-lemon Nov 5, 2024 Maintainer

Meaveryway Nov 5, 2024 Author

tahitimoon Nov 23, 2024

Meaveryway
Oct 13, 2024

PiochU19
Nov 5, 2024

jamie-lemon
Nov 5, 2024
Maintainer

Meaveryway
Nov 5, 2024
Author

tahitimoon
Nov 23, 2024