page extract with pdf4llm.to_markdown not extracting the first line of the page in specific scenarios #196

leelaraj72 · 2024-11-29T06:50:33Z

leelaraj72
Nov 29, 2024

I am using following simple python code to extract a page (and the corresponding text) from a pdf file

import pdf4llm
import pprint

data = pdf4llm.to_markdown("../docs/abc.pdf", pages=[13], page_chunks=True)
pprint.pprint(data[0])

The first line is not extracted always if the first line is a plain text. For Ex: if the page has following text

promotional modifiers where a service item gets added automatically when a certain
finished good is added to the order. This helps in cutting down order creation time, adds
efficiency and accuracy of order creator, and enables companies to implement service
...

then data[0]['text'] has

'finished good is added to the order. This helps in cutting down order '
'creation time, adds\n'
'efficiency and accuracy of order creator, and enables companies to implement '
'service\n'
...

Did anyone else face the same problem?

Answered by JorjMcKie

Nov 29, 2024

Try setting the margin parameter. Default is margins=(0, 50, 0, 50) which ignores stripes of height 50 at top and bottom.
Using margins=0 looks at the full page.

View full answer

JorjMcKie · 2024-11-29T09:21:22Z

JorjMcKie
Nov 29, 2024
Maintainer

Try setting the margin parameter. Default is margins=(0, 50, 0, 50) which ignores stripes of height 50 at top and bottom.
Using margins=0 looks at the full page.

11 replies

Fianax Nov 29, 2024

With this code snippet it does return all the lines of code from the example pdf.

Although with the code that you have passed me it works for what I wanted (thanks for your code 💯 ) but the hdr_info would still not give all the text in the ‘span’ value of the method that is passed to it.

Fianax Nov 29, 2024

I have also tried to use the flag=TEXTFLAGS_TEXT but I have not been able to import it from anywhere(fitz.).

What numerical value does it have to put it?

JorjMcKie Nov 29, 2024
Maintainer

pymupdf.TEXTFLAGS_TEXT of course!!!

JorjMcKie Nov 29, 2024
Maintainer

The check in hdr_info will only called for the first span of a line - not for every span.

Fianax Nov 29, 2024

oki, thanks for everything

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

page extract with pdf4llm.to_markdown not extracting the first line of the page in specific scenarios #196

{{title}}

Replies: 1 comment 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

page extract with pdf4llm.to_markdown not extracting the first line of the page in specific scenarios #196

leelaraj72 Nov 29, 2024

I am using following simple python code to extract a page (and the corresponding text) from a pdf file

data = pdf4llm.to_markdown("../docs/abc.pdf", pages=[13], page_chunks=True) pprint.pprint(data[0])

The first line is not extracted always if the first line is a plain text. For Ex: if the page has following text

promotional modifiers where a service item gets added automatically when a certain finished good is added to the order. This helps in cutting down order creation time, adds efficiency and accuracy of order creator, and enables companies to implement service ...

then data[0]['text'] has

'finished good is added to the order. This helps in cutting down order ' 'creation time, adds\n' 'efficiency and accuracy of order creator, and enables companies to implement ' 'service\n' ...

Replies: 1 comment · 11 replies

JorjMcKie Nov 29, 2024 Maintainer

Fianax Nov 29, 2024

Fianax Nov 29, 2024

JorjMcKie Nov 29, 2024 Maintainer

JorjMcKie Nov 29, 2024 Maintainer

Fianax Nov 29, 2024

leelaraj72
Nov 29, 2024

data = pdf4llm.to_markdown("../docs/abc.pdf", pages=[13], page_chunks=True)
pprint.pprint(data[0])

promotional modifiers where a service item gets added automatically when a certain
finished good is added to the order. This helps in cutting down order creation time, adds
efficiency and accuracy of order creator, and enables companies to implement service
...

'finished good is added to the order. This helps in cutting down order '
'creation time, adds\n'
'efficiency and accuracy of order creator, and enables companies to implement '
'service\n'
...

Replies: 1 comment 11 replies

JorjMcKie
Nov 29, 2024
Maintainer

JorjMcKie Nov 29, 2024
Maintainer

JorjMcKie Nov 29, 2024
Maintainer