-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
autolink for non-HTTP URIs, and other non-tag content, produces invalid XML #1244
Comments
First of all, Python-Markdown is NOT a Commonmark implementation, so I will ignore all references to that spec.
And in fact, according to Babelmark, Markdown.pl only recognizes HTTP URLs. Seems to me we are mostly inline with the reference implementation.
This seems to be the motivation behind your report. However, I have a different view. We cannot be responsible for invalid input. That is the responsibility of the document author. In fact, we do not validate any raw HTML for this reason. Any contact wrapped in angle brackets which is not recognized as a auto link is just treated as raw HTML Everything between (and including) the angle brackets is passed thorough unaltered in any way. I have no intention of changing that. That said, if there are some additional forms of URLs would should clearly be recognized as auto links which are not currently, then I would be willing to review a PR. However, if those URLs are not also recognized by the reference implementation, then they it would be better for support to be provided through a third party extension. Speaking of third party extensions, you can always modify any part of the parser with an extension. If you want to avoid invalid output, then you are always free to use an extension to do that. Although, I would think a post processor would be better suited to the task. Simply pass the output from Python-Markdown into some HTML tidy library and then use the output of that on your HTML templates (or whatever you are going with it). |
Thanks for looking at this.
And the wonderful thing about Markdown implementations is that there are so many to choose from! I only mentioned Commonmark because it was one that came to mind, and it seemed useful to mention some spec on top of Gruber's original. I didn't intend to suggest that python-markdown was, or should be, specifically a Commonmark implementation.
True, but I notice that, if we look at the other parsers in that useful list, which produce a result, it's only Markdown.pl and python-markdown that don't do at least something with
That is indeed the motivation. I want to consume what python-markdown produces, so I naturally have an interest in it being consumable.
That's an interesting point of view. Myself, I feel that if there's a ‘spirit of Markdown’, it's that the user is never wrong, and there is no such thing as ‘invalid input’, only input which the processor doesn't recognise, where the user therefore fails to communicate their intention. A Markdown processor is a fairly heuristic tool, and should always have sensible defaults. That is, Markdown is a DWIM application, and I think we can say, with some degree of confidence, that if the user types
I think it would be reasonable to regard essentially any URL inside angle brackets as an autolink candidate -- there's nothing special about HTTP and FTP from Markdown's point of view. That would pass a DWI(probably)M test, for me. In the case of Thus, in
Additionally, and whether or not you though the above was a good idea (ie, I can see at least some sort of case for restricting autolink to HTTP/FTP), there might be milage in something like
That (along with a suitable
Indeed, but element |
Let me restate that everything in angle brackets is considered to be raw HTML unless it is clearly an auto link. That is how the reference implementation works and that is how Python-Markdown will always work by default. If you want to alter that behavior, then you can do so in a third party extension. However, if you want to explore expanding auto links, then that may be worth considering. However, as with everything else in Markdown, I don't think we want to be super strict and encode the entire requirements of some spec. Something simple like |
Consider the following:
This renders as
I think items number 2 and 3 are incorrect, (a) because the behaviour doesn't match two significant Markdown specs, and (b) because they are both invalid XML (yes,
<urn:foo>
looks like an XML element with a namespace prefix; let's not go there...).The autolink feature in the Daring Fireball spec is ‘for URLs and email addresses’ (though the only URL in that example is an HTTP URL). The corresponding section in the CommonMark spec says that the autolink should happen for an absolute URI. So the second case should be turned into
<a href='ssh://example.org'>ssh://example.org</a>
.What appears to be happening, instead, is that this is being interpreted as literal HTML. The relevant section of Gruber's spec is rather vague, but the corresponding part of the CommonMark spec says that this should happen only to ‘[t]ext between < and > that looks like an HTML tag’, which of course
<ssh://example.org>
doesn't (CommonMark: ‘A tag name consists of an ASCII letter followed by zero or more ASCII letters, digits, or hyphens (-)’).Independently of any spec, however, having
<ssh://example.org>
appear in the output means that that output is syntactically invalid, and I feel this shouldn't happen for any input, however insane.Suggestion:
<starttag>
consists of something other than[a-zA-Z][a-zA-Z0-9-]*
, then it is either a URI, in which case it should be turned into an<a>
element, or it is not, in which case it should be included literally in the output, as if the content were instead enclosed in backticks.This would imply that item 3 should render as
<code>urn:foo</code>
.The text was updated successfully, but these errors were encountered: