Markdown extraction bug #224

mik8142 · 2025-02-12T10:31:09Z

Description of the bug

Good afternoon!

Problem Description

I have a problem with extracting md from pdf file.

If I extract text using page.get_text() - everything is fine.

If I extract text using pymupdf4llm.to_markdown() one digit “1” is lost
PDF file: example.pdf

Problem part (produced by pymupdf4llm:

**Humidity**

                  - 0% minimum (non-condensing)

                   - 95% maximum (non-condensing)

Same part produced by page.get_text():

Humidity
	
•10% minimum (non-condensing)
	
• 95% maximum (non-condensing)
Safety & Emissions

Text inside pdf:

Humidity
•10% minimum (non-condensing)
• 95% maximum (non-condensing)

Full returned text from pymupdf4llm:

>>> print(md_text)
**Standards**

    - IEEE 802.11n

    - IEEE 802.11g

    - IEEE 802.3

    - IEEE 802.3u

**Security**

     - WEP[TM ]

    - WPA[TM ]- Personal/Enterprise

    - WPA2[TM ]- Personal/Enterprise

**Wireless Signal Rates[1]**
**IEEE 802.11n:**

    - 300 Mbps (max)


# Technical Specifications

**Humidity**

                  - 0% minimum (non-condensing)

                   - 95% maximum (non-condensing)

**Safety & Emissions**

                       - CE

                     - FCC

                        - Wi-Fi Certified

**Dimensions**

                     - 175 x 150 x 31 mm
(6.89 x 5.9 x 1.22 inches)


**IEEE 802.11g:**

    - 54 Mbps    - 48 Mbps    - 36 Mbps

    - 24 Mbps    - 18 Mbps    - 12 Mbps

    - 11 Mbps    - 9 Mbps    - 6 Mbps

**Wireless Frequency Range[2] (Europe)**

    - 2.4 GHz to 2.4835 GHz (802.11g/n)

**Operating Temperature**

    - 0 °C to 40 °C (32 °F to 104 °F)

**Storage Temperature**

    - -20 °C to 65 °C (-4 °F to 149 °F)

1 Maximum wireless signal rate derived from IEEE Standard 802.11b, 802.11g and 802.11n specifications. Actual data throughput will vary. Network conditions and environmental factors, including
volume of network traffic, building materials and construction, and network overhead, lower actual data throughput rate. Environmental factors will adversely affect wireless signal range.
2 Frequency Range varies depending on country’s regulation


-----

Full returned text by page.get_text():

>>> print(d[0].get_text())
74
D-Link DIR-612 User Manual
Appendix C - Technical Specifications
Technical Specifications
Standards
	
• IEEE 802.11n
	
• IEEE 802.11g
	
• IEEE 802.3
	
• IEEE 802.3u
	
Security
	
• WEPTM 
	
• WPATM - Personal/Enterprise
	
• WPA2TM - Personal/Enterprise
	
Wireless Signal Rates1
	
IEEE 802.11n:
	
• 300 Mbps (max)
	
IEEE 802.11g:
	
• 54 Mbps	
• 48 Mbps	
• 36 Mbps
	
• 24 Mbps	
• 18 Mbps	
• 12 Mbps
	
• 11 Mbps	
• 9 Mbps	
• 6 Mbps
	
Wireless Frequency Range2 (Europe)
	
• 2.4 GHz to 2.4835 GHz (802.11g/n)
Operating Temperature
	
• 0 °C to 40 °C (32 °F to 104 °F)
Storage Temperature
	
• -20 °C to 65 °C (-4 °F to 149 °F)
Humidity
	
•10% minimum (non-condensing)
	
• 95% maximum (non-condensing)
Safety & Emissions
	
• CE
	
• FCC
	
• Wi-Fi Certified
Dimensions
	
• 175 x 150 x 31 mm 
                 (6.89 x 5.9 x 1.22 inches)
1 	Maximum wireless signal rate derived from IEEE Standard 802.11b, 802.11g and 802.11n specifications. Actual data throughput will vary. Network conditions and environmental factors, including 
volume of network traffic, building materials and construction, and network overhead, lower actual data throughput rate. Environmental factors will adversely affect wireless signal range.
2 Frequency Range varies depending on country’s regulation

Expected vs Actual Behavior

Expected: Text extracted bypymupdf4llm.to_markdown and page.get_text - must be same (exclude markup)
Actual: Extracted text by pymupdf4llm.to_markdown lost one character ("1" - digit in string "•10% minimum (non-condensing)
" is lost)

Environment Information

OS: Linux Mint 20.3 Una
Python: 3.10.16
fitz.version: '1.25.3'
pymupdf4llm.version: '0.0.17'

Thanks in advance!

How to reproduce the bug

Steps to Reproduce

Download example.pdf
Run simple code:

import fitz
import pymupdf4llm

md_text = pymupdf4llm.to_markdown("example.pdf")

print("MD extracted by pymupdf4llm.to_markdown (with problems)")
print(md_text)

doc = fitz.open("example.pdf")
text = doc[0].get_text()

print("Text extracted by page.get_text() (without problems)")
print(text)

PyMuPDF version

1.25.3

Operating system

Linux

Python version

3.10

The text was updated successfully, but these errors were encountered:

JorjMcKie transferred this issue from pymupdf/PyMuPDF Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Markdown extraction bug #224

Markdown extraction bug #224

mik8142 commented Feb 12, 2025

Markdown extraction bug #224

Markdown extraction bug #224

Comments

mik8142 commented Feb 12, 2025

Description of the bug

Problem Description

Problem part (produced by pymupdf4llm:

Same part produced by page.get_text():

Text inside pdf:

Full returned text from pymupdf4llm:

Full returned text by page.get_text():

Expected vs Actual Behavior

Environment Information

How to reproduce the bug

Steps to Reproduce

PyMuPDF version

Operating system

Python version