Text Extraktion from docx and pptx files #102

simonschoe · 2024-07-29T07:31:34Z

simonschoe
Jul 29, 2024

Hi there, on the website you state that text can be extracted from all sorts of documents (e.g., docx and pptx): https://pymupdf4llm.readthedocs.io/en/latest/. Are there any examples how I would best proceed if I have docx and/or pptx files instead of PDF files?

hewliyang · 2024-08-04T13:23:08Z

hewliyang
Aug 4, 2024

convert them to PDFs

0 replies

JorjMcKie · 2024-08-04T15:13:05Z

JorjMcKie
Aug 4, 2024
Maintainer

You can directly use them by their filenames like "document.docx".
The issue is that page sizes in these cases are fluid "reflowable", and no tables or text columns are recognized.
It is recommended to therefore regard the full document as one large page, which is the default: height is set to None.

0 replies

simonschoe · 2024-08-07T07:05:14Z

simonschoe
Aug 7, 2024
Author

You can directly use them by their filenames like "document.docx". The issue is that page sizes in these cases are fluid "reflowable", and no tables or text columns are recognized. It is recommended to therefore regard the full document as one large page, which is the default: height is set to None.

When I try reading the docx file directly, I currently obtain an AssertionError:

import pymupdf4llm
elements = pymupdf4llm.to_markdown("testfile.docx")

...

~\AppData\Roaming\Python\Python311\site-packages\pymupdf\__init__.py in ?(page, required)
    333         return page
    334     elif isinstance(page, mupdf.FzPage):
    335         ret = mupdf.pdf_page_from_fz_page(page)
    336         if required:
--> 337             assert ret.m_internal
    338         return ret
    339     elif page is None:
    340         assert 0, f'page is None'

AssertionError:

0 replies

JorjMcKie · 2024-08-07T07:12:57Z

JorjMcKie
Aug 7, 2024
Maintainer

I just confirmed correct behavior using pymupdf4llm v0.0.10 and pymupdf v1.24.9.
Transferring this thread to Discussions tab.

0 replies

simonschoe · 2024-08-08T06:43:03Z

simonschoe
Aug 8, 2024
Author

I can confirm that it was an error particular to pymupdf v1.24.7 (pymupdf/PyMuPDF#3654). Is now fixed.

0 replies

Filimoa · 2024-12-30T21:36:35Z

Filimoa
Dec 30, 2024

I think this is a common use case of wanting to convert xlsx to pdf - not sure the best way to do this with pymupdfpro?

import pymupdf.pro
pymupdf.pro.unlock(KEY)

doc = pymupdf.open("./Collision Extract Layout.xls")
doc.save("./Collision Extract Layout.pdf")

Returns

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/var/folders/yw/680m_xnx3b9427r40r4hdzv40000gn/T/ipykernel_69191/605626560.py in ?()
      1 doc = pymupdf.open("/Users/sergey/Downloads/Collision Extract Layout.xls")
      2 
      3 # save as pdf
----> 4 doc.save("/Users/sergey/Downloads/aug22.pdf", clean=True)

~/Developer/business/filings-ai-website/backend/.venv/lib/python3.12/site-packages/pymupdf/__init__.py in ?(self, filename, garbage, clean, deflate, deflate_images, deflate_fonts, incremental, ascii, expand, linear, no_new_id, appearance, pretty, encryption, permissions, owner_pw, user_pw, preserve_metadata, use_objstms, compression_effort)
   5620                 raise ValueError("incremental needs original file")
   5621         if user_pw and len(user_pw) > 40 or owner_pw and len(owner_pw) > 40:
   5622             raise ValueError("password length must not exceed 40")
   5623 
-> 5624         pdf = _as_pdf_document(self)
   5625         opts = mupdf.PdfWriteOptions()
   5626         opts.do_incremental = incremental
   5627         opts.do_ascii = ascii

~/Developer/business/filings-ai-website/backend/.venv/lib/python3.12/site-packages/pymupdf/__init__.py in ?(document, required)
    469         return document
    470     elif isinstance(document, mupdf.FzDocument):
    471         ret = mupdf.PdfDocument(document)
    472         if required:
--> 473             assert ret.m_internal
    474         return ret
    475     elif document is None:
    476         assert 0, f'document is None'

AssertionError:

Docs are unhelpful.

3 replies

JorjMcKie Dec 31, 2024
Maintainer

Docs are very unhelpful.

Well, admittedly we haven't yet implemented this telepathy feature 😂!
You still must endeavor entering "convert" in the documentation's search field. Scrolling down a bit leads you to this

The Document.save() method's documentation on the other hand clearly says "PDF only".
So, what else could we have done?

Filimoa Dec 31, 2024

Raise an 'NotImplementedError' instead? I'm not sure your screenshot tells the user anything about this not being implemented.

JorjMcKie Dec 31, 2024
Maintainer

As a new user, I would click on the link. Then I would see that every supported document type can be converted to a PDF.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Extraktion from docx and pptx files #102

{{title}}

Replies: 6 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Text Extraktion from docx and pptx files #102

simonschoe Jul 29, 2024

Replies: 6 comments · 3 replies

hewliyang Aug 4, 2024

JorjMcKie Aug 4, 2024 Maintainer

simonschoe Aug 7, 2024 Author

JorjMcKie Aug 7, 2024 Maintainer

simonschoe Aug 8, 2024 Author

Filimoa Dec 30, 2024

JorjMcKie Dec 31, 2024 Maintainer

Filimoa Dec 31, 2024

JorjMcKie Dec 31, 2024 Maintainer

simonschoe
Jul 29, 2024

Replies: 6 comments 3 replies

hewliyang
Aug 4, 2024

JorjMcKie
Aug 4, 2024
Maintainer

simonschoe
Aug 7, 2024
Author

JorjMcKie
Aug 7, 2024
Maintainer

simonschoe
Aug 8, 2024
Author

Filimoa
Dec 30, 2024

JorjMcKie Dec 31, 2024
Maintainer

JorjMcKie Dec 31, 2024
Maintainer