Text Extraktion from docx and pptx files #102
Replies: 6 comments 3 replies
-
convert them to PDFs |
Beta Was this translation helpful? Give feedback.
-
You can directly use them by their filenames like "document.docx". |
Beta Was this translation helpful? Give feedback.
-
When I try reading the import pymupdf4llm
elements = pymupdf4llm.to_markdown("testfile.docx") ...
~\AppData\Roaming\Python\Python311\site-packages\pymupdf\__init__.py in ?(page, required)
333 return page
334 elif isinstance(page, mupdf.FzPage):
335 ret = mupdf.pdf_page_from_fz_page(page)
336 if required:
--> 337 assert ret.m_internal
338 return ret
339 elif page is None:
340 assert 0, f'page is None'
AssertionError: |
Beta Was this translation helpful? Give feedback.
-
I just confirmed correct behavior using pymupdf4llm v0.0.10 and pymupdf v1.24.9. |
Beta Was this translation helpful? Give feedback.
-
I can confirm that it was an error particular to |
Beta Was this translation helpful? Give feedback.
-
I think this is a common use case of wanting to convert xlsx to pdf - not sure the best way to do this with pymupdfpro? import pymupdf.pro
pymupdf.pro.unlock(KEY)
doc = pymupdf.open("./Collision Extract Layout.xls")
doc.save("./Collision Extract Layout.pdf") Returns
Docs are unhelpful. |
Beta Was this translation helpful? Give feedback.
-
Hi there, on the website you state that text can be extracted from all sorts of documents (e.g., docx and pptx): https://pymupdf4llm.readthedocs.io/en/latest/. Are there any examples how I would best proceed if I have docx and/or pptx files instead of PDF files?
Beta Was this translation helpful? Give feedback.
All reactions