Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdown extraction bug #224

Open
mik8142 opened this issue Feb 12, 2025 · 0 comments
Open

Markdown extraction bug #224

mik8142 opened this issue Feb 12, 2025 · 0 comments

Comments

@mik8142
Copy link

mik8142 commented Feb 12, 2025

Description of the bug

Good afternoon!

Problem Description

I have a problem with extracting md from pdf file.

If I extract text using page.get_text() - everything is fine.

If I extract text using pymupdf4llm.to_markdown() one digit “1” is lost
PDF file: example.pdf

Problem part (produced by pymupdf4llm:

**Humidity**

                  - 0% minimum (non-condensing)

                   - 95% maximum (non-condensing)

Same part produced by page.get_text():

Humidity
	
•10% minimum (non-condensing)
	
• 95% maximum (non-condensing)
Safety & Emissions

Text inside pdf:

Image

Humidity
•10% minimum (non-condensing)
• 95% maximum (non-condensing)

Full returned text from pymupdf4llm:

>>> print(md_text)
**Standards**

    - IEEE 802.11n

    - IEEE 802.11g

    - IEEE 802.3

    - IEEE 802.3u

**Security**

     - WEP[TM ]

    - WPA[TM ]- Personal/Enterprise

    - WPA2[TM ]- Personal/Enterprise

**Wireless Signal Rates[1]**
**IEEE 802.11n:**

    - 300 Mbps (max)


# Technical Specifications

**Humidity**

                  - 0% minimum (non-condensing)

                   - 95% maximum (non-condensing)

**Safety & Emissions**

                       - CE

                     - FCC

                        - Wi-Fi Certified

**Dimensions**

                     - 175 x 150 x 31 mm
(6.89 x 5.9 x 1.22 inches)


**IEEE 802.11g:**

    - 54 Mbps    - 48 Mbps    - 36 Mbps

    - 24 Mbps    - 18 Mbps    - 12 Mbps

    - 11 Mbps    - 9 Mbps    - 6 Mbps

**Wireless Frequency Range[2] (Europe)**

    - 2.4 GHz to 2.4835 GHz (802.11g/n)

**Operating Temperature**

    - 0 °C to 40 °C (32 °F to 104 °F)

**Storage Temperature**

    - -20 °C to 65 °C (-4 °F to 149 °F)

1 Maximum wireless signal rate derived from IEEE Standard 802.11b, 802.11g and 802.11n specifications. Actual data throughput will vary. Network conditions and environmental factors, including
volume of network traffic, building materials and construction, and network overhead, lower actual data throughput rate. Environmental factors will adversely affect wireless signal range.
2 Frequency Range varies depending on country’s regulation


-----

Full returned text by page.get_text():

>>> print(d[0].get_text())
74
D-Link DIR-612 User Manual
Appendix C - Technical Specifications
Technical Specifications
Standards
	
• IEEE 802.11n
	
• IEEE 802.11g
	
• IEEE 802.3
	
• IEEE 802.3u
	
Security
	
• WEPTM 
	
• WPATM - Personal/Enterprise
	
• WPA2TM - Personal/Enterprise
	
Wireless Signal Rates1
	
IEEE 802.11n:
	
• 300 Mbps (max)
	
IEEE 802.11g:
	
• 54 Mbps	
• 48 Mbps	
• 36 Mbps
	
• 24 Mbps	
• 18 Mbps	
• 12 Mbps
	
• 11 Mbps	
• 9 Mbps	
• 6 Mbps
	
Wireless Frequency Range2 (Europe)
	
• 2.4 GHz to 2.4835 GHz (802.11g/n)
Operating Temperature
	
• 0 °C to 40 °C (32 °F to 104 °F)
Storage Temperature
	
• -20 °C to 65 °C (-4 °F to 149 °F)
Humidity
	
•10% minimum (non-condensing)
	
• 95% maximum (non-condensing)
Safety & Emissions
	
• CE
	
• FCC
	
• Wi-Fi Certified
Dimensions
	
• 175 x 150 x 31 mm 
                 (6.89 x 5.9 x 1.22 inches)
1 	Maximum wireless signal rate derived from IEEE Standard 802.11b, 802.11g and 802.11n specifications. Actual data throughput will vary. Network conditions and environmental factors, including 
volume of network traffic, building materials and construction, and network overhead, lower actual data throughput rate. Environmental factors will adversely affect wireless signal range.
2 Frequency Range varies depending on country’s regulation

Expected vs Actual Behavior

Expected: Text extracted bypymupdf4llm.to_markdown and page.get_text - must be same (exclude markup)
Actual: Extracted text by pymupdf4llm.to_markdown lost one character ("1" - digit in string "•10% minimum (non-condensing)
" is lost)

Environment Information

  • OS: Linux Mint 20.3 Una
  • Python: 3.10.16
  • fitz.version: '1.25.3'
  • pymupdf4llm.version: '0.0.17'

Thanks in advance!

How to reproduce the bug

Steps to Reproduce

  1. Download example.pdf
  2. Run simple code:
import fitz
import pymupdf4llm

md_text = pymupdf4llm.to_markdown("example.pdf")

print("MD extracted by pymupdf4llm.to_markdown (with problems)")
print(md_text)

doc = fitz.open("example.pdf")
text = doc[0].get_text()

print("Text extracted by page.get_text() (without problems)")
print(text)

PyMuPDF version

1.25.3

Operating system

Linux

Python version

3.10

@JorjMcKie JorjMcKie transferred this issue from pymupdf/PyMuPDF Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant