You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What steps will reproduce the problem?
1. Certain text is not correctly output during parsing. For example, the text
in the HTML shown below (from the page for the year 1979) is not correctly
extracted. It appears there is a problem dealing with certain anchor tags
(problem with a regular expression?).
What is the expected output? What do you see instead?
For this code...
<li><a href="/wiki/May_27" title="May 27">May 27</a> – <a
href="/wiki/1979_Indianapolis_500" title="1979 Indianapolis 500">Indianapolis
500</a>: <a href="/wiki/Rick_Mears" title="Rick Mears">Rick Mears</a> wins the
race for the first time, and car owner <a href="/wiki/Roger_Penske"
title="Roger Penske">Roger Penske</a> for the second time.</li>
The extracted text is: * wins the race for the first time, and car owner
Roger Penske for the second time.
Instead of: * May 27 – Indianapolis 500: Rick Mears wins the race for the
first time, and car owner Roger Penske for the second time.
And...for this code:
...
<li>The <a href="/wiki/United_States" title="United States">United States</a>
and the <a href="/wiki/People%27s_Republic_of_China" title="People's Republic
of China">People's Republic of China</a> establish full <a
href="/wiki/Sino-American_relations" title="Sino-American relations">diplomatic
relations</a>.</li>
...
The extracted text is: diplomatic relations.
Instead of: * The United States and the People's Republic of China establish
full diplomatic relations.
Cheers
Original issue reported on code.google.com by [email protected] on 25 May 2011 at 2:21
The text was updated successfully, but these errors were encountered:
Original issue reported on code.google.com by
[email protected]
on 25 May 2011 at 2:21The text was updated successfully, but these errors were encountered: