Extract Text from PDF without Including Table Context: PymuPDF4llm #184

VaheSahakyan23 · 2024-11-11T09:40:13Z

VaheSahakyan23
Nov 11, 2024

Hello,

I wanted to inquire if there is currently a feature, or if there are plans to introduce one, in PyMuPDF4LLM that allows text extraction from PDF files while ignoring table data (i.e., excluding table context from the extracted text).

This feature would be particularly useful for documents with complex table structures that may not be extracted correctly. Including such tables often results in messy or unreadable text in the output.

Looking forward to your thoughts or suggestions on this!

Thank you!

JorjMcKie · 2024-11-11T12:23:53Z

JorjMcKie
Nov 11, 2024
Maintainer

I understand what you mean. But allow me to state that this is logically impossible to do:
A table can only be excluded from extraction if it has been identified as a table be the table finder.

Or are you saying you want to exclude a table's content even when it has been accepted as a table?

6 replies

VaheSahakyan23 Nov 11, 2024
Author

I understand your point. What I’m suggesting is this: when a table is identified by the table finder, its content should not be included in the parsed data (i.e., exclude its content from the final text).

The reasoning is that, in cases where table data is not correctly extracted, including it in the final parsed data introduces noise. This makes it more challenging to clean the final text, as we would then have to remove partially or incorrectly parsed table content from the extracted data.

JorjMcKie Nov 11, 2024
Maintainer

Okay.
But you are aware that there can be cases where parts of a table are detected as a table (e.g. just 2 columns out of a total of 3+ etc.)?
Omitting text within the associated, identified table.bbox will leave behind the unrecognized table parts as text - which is now mutilated / incomplete.
The basic problem is that the table finder is unaware of its errors / gaps.

JorjMcKie Nov 11, 2024
Maintainer

I therefore think that the only guarantee for complete text is to suppress table finding altogether (by a new parameter).

VaheSahakyan23 Nov 11, 2024
Author

I agree with you that excluding the complete content of a table from the final parsed data requires the table to be fully and accurately identified, which can indeed be a challenging task depending on the style of the table or document.

Having a parameter to suppress table finding altogether could be a useful solution in certain cases, as it provides more control over how the data is parsed.

JorjMcKie Nov 11, 2024
Maintainer

ok - I see that we agree, thank you
Let me put this on my list of possible enhancements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract Text from PDF without Including Table Context: PymuPDF4llm #184

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Extract Text from PDF without Including Table Context: PymuPDF4llm #184

VaheSahakyan23 Nov 11, 2024

Replies: 1 comment · 6 replies

JorjMcKie Nov 11, 2024 Maintainer

VaheSahakyan23 Nov 11, 2024 Author

JorjMcKie Nov 11, 2024 Maintainer

JorjMcKie Nov 11, 2024 Maintainer

VaheSahakyan23 Nov 11, 2024 Author

JorjMcKie Nov 11, 2024 Maintainer

VaheSahakyan23
Nov 11, 2024

Replies: 1 comment 6 replies

JorjMcKie
Nov 11, 2024
Maintainer

VaheSahakyan23 Nov 11, 2024
Author

JorjMcKie Nov 11, 2024
Maintainer

JorjMcKie Nov 11, 2024
Maintainer

VaheSahakyan23 Nov 11, 2024
Author

JorjMcKie Nov 11, 2024
Maintainer