Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdown conversion stops when using pymupdf4llm #219

Closed
ozgurnsahin opened this issue Jan 20, 2025 · 2 comments
Closed

Markdown conversion stops when using pymupdf4llm #219

ozgurnsahin opened this issue Jan 20, 2025 · 2 comments

Comments

@ozgurnsahin
Copy link

ozgurnsahin commented Jan 20, 2025

I am using pymudf4llm actively for my RAG based product so for it did not create any issues for the pdf but for a single pdf to_markdown stucks on a page of the pdf and process runs continuously I will share the pdf and my code below

Multiple-Input Variational Auto-Encoder.pdf

my code:
markdown_pages = pymupdf4llm.to_markdown("docs\domain1\Multiple-Input Variational Auto-Encoder.pdf",margins=0)

process stucks at the page 12 and my pymupdf version is 0.0.17

@robvandijk
Copy link

Note: this is the same issue as #215 and is solved by PR #216 . It is simply very slow - if you keep it running long enough it will finish. Once PR #216 has been merged, it will finish much sooner.

FYI this occurs when the array path_rects in pymupdf4llm/pymupdf4llm/helpers/multi_column.py becomes very large. For your PDF, the length of this array per page:

  1. 651
  2. 0
  3. 37
  4. 4
  5. 1309
  6. 17
  7. 59
  8. 154
  9. 317
  10. 387
  11. 384
  12. 64960

@ozgurnsahin
Copy link
Author

Thank you for the information I will close the issue with this comment and wait for the patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants