Markdown conversion stops when using pymupdf4llm #219

ozgurnsahin · 2025-01-20T11:11:08Z

I am using pymudf4llm actively for my RAG based product so for it did not create any issues for the pdf but for a single pdf to_markdown stucks on a page of the pdf and process runs continuously I will share the pdf and my code below

Multiple-Input Variational Auto-Encoder.pdf

my code:
markdown_pages = pymupdf4llm.to_markdown("docs\domain1\Multiple-Input Variational Auto-Encoder.pdf",margins=0)

process stucks at the page 12 and my pymupdf version is 0.0.17

robvandijk · 2025-01-21T14:27:20Z

Note: this is the same issue as #215 and is solved by PR #216 . It is simply very slow - if you keep it running long enough it will finish. Once PR #216 has been merged, it will finish much sooner.

FYI this occurs when the array path_rects in pymupdf4llm/pymupdf4llm/helpers/multi_column.py becomes very large. For your PDF, the length of this array per page:

651
0
37
4
1309
17
59
154
317
387
384
64960

ozgurnsahin · 2025-01-21T14:35:48Z

Thank you for the information I will close the issue with this comment and wait for the patch.

ozgurnsahin closed this as completed Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Markdown conversion stops when using pymupdf4llm #219

Markdown conversion stops when using pymupdf4llm #219

ozgurnsahin commented Jan 20, 2025 •

edited

Loading

robvandijk commented Jan 21, 2025

ozgurnsahin commented Jan 21, 2025

Markdown conversion stops when using pymupdf4llm #219

Markdown conversion stops when using pymupdf4llm #219

Comments

ozgurnsahin commented Jan 20, 2025 • edited Loading

robvandijk commented Jan 21, 2025

ozgurnsahin commented Jan 21, 2025

ozgurnsahin commented Jan 20, 2025 •

edited

Loading