Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing text in parsed output. #9

Open
GoogleCodeExporter opened this issue Apr 11, 2015 · 0 comments
Open

Missing text in parsed output. #9

GoogleCodeExporter opened this issue Apr 11, 2015 · 0 comments

Comments

@GoogleCodeExporter
Copy link

What steps will reproduce the problem?

1. Certain text is not correctly output during parsing. For example, the text 
in the HTML shown below (from the page for the year 1979) is not correctly 
extracted. It appears there is a problem dealing with certain anchor tags 
(problem with a regular expression?).

What is the expected output? What do you see instead?

For this code...

<li><a href="/wiki/May_27" title="May 27">May 27</a> – <a 
href="/wiki/1979_Indianapolis_500" title="1979 Indianapolis 500">Indianapolis 
500</a>: <a href="/wiki/Rick_Mears" title="Rick Mears">Rick Mears</a> wins the 
race for the first time, and car owner <a href="/wiki/Roger_Penske" 
title="Roger Penske">Roger Penske</a> for the second time.</li>

The extracted text is: *   wins the race for the first time, and car owner 
Roger Penske for the second time.
Instead of: * May 27 – Indianapolis 500: Rick Mears wins the race for the 
first time, and car owner Roger Penske for the second time.

And...for this code:
...
<li>The <a href="/wiki/United_States" title="United States">United States</a> 
and the <a href="/wiki/People%27s_Republic_of_China" title="People's Republic 
of China">People's Republic of China</a> establish full <a 
href="/wiki/Sino-American_relations" title="Sino-American relations">diplomatic 
relations</a>.</li>
...

The extracted text is: diplomatic relations.
Instead of: * The United States and the People's Republic of China establish 
full diplomatic relations.

Cheers

Original issue reported on code.google.com by [email protected] on 25 May 2011 at 2:21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant