Various thoughts from writing a parser #203
Replies: 4 comments 12 replies
-
Love this!
why is this required? For a pull parser, I’d expect it to do a single pass, emit unresolved link events, and leave link resolution to the consumer of the events.
I’d say the fancy right way to deal with attributes in Rust would be something like this: pub struct Attr<'a> {
raw: &'a str, // attribute as it appears in the source text, with escapes
}
impl Attr {
/// private function to compute the value by unescaping chars in the fly
fn value(&self) -> impl Iterator<Item=char>;
}
// public API, which makes Attr behave like an abstract string
// which is implemented via value 0-alloc fn internally
impl fmt::Display for Attr {}
impl PartialEq<str> for Attr this approach probably won’t work for generating ids from headers (as that’ll be tantamount to running inline parser twice). One thing we can do here is make the parser into a lending iterator: store a scratch String buf inside the parser, accumulate current title there, allow the generated reference event to borrow from this internal buffer, and re-use the same buffer for all readers. |
Beta Was this translation helpful? Give feedback.
-
I'm glad to see this! Do you have comparison benchmarks? I imagine it should be much faster than djot.js? I'm curious why you used the naive recursive approach to parsing blocks. That's what I did in pandoc's original markdown parser, but for commonmark and djot I used a different strategy (see the main loop of block.ts in djot.js ; the basic idea is also described at the end of spec.commonmark.org). I would think this approach would require fewer allocations and make it easier to track source positions (though I don't know if your parser aspires to do that). In the reference parser we also try to avoid recursion to avoid stack overflows, though that is really mostly of theoretical interest since you can always impose a reasonable limit on nesting. |
Beta Was this translation helpful? Give feedback.
-
> This is at least required for getting URLs from link definitions and headers that may be referenced before the definition/heading.
why is this required? For a pull parser, I’d expect it to do a single pass, emit _unresolved_ link events, and leave link resolution to the consumer of the events.
> The first thing my implementation does is to traverse the whole file
> and parse the block tree structure
I guess a more general point is that djot is designed such that
two-phase parsing is not necessary (unlike markdown, where you have to
resolve links to be able to parse them). So it probably makes sense to
aim for a “real” single-pass pull parser for djot.
True, this is certainly possible. It would leave more up to the
consumer, though. Perhaps it is possible to provide enough helper
functions to make it not too cumbersome.
However, when one needs resolved links, one has to run the pull parser
and cache the parsed events or output until the link is resolved?[^a]
This may decrease runtime but increase memory requirements?
[^a] Unless we want to use the latest link definition which would
require looking until the end.
> Another case is attributes.
I’d say the fancy right way to deal with attributes in Rust would be something like this:
```rust
pub struct Attr<'a> {
raw: &'a str, // attribute as it appears in the source text, with escapes
}
impl Attr {
/// private function to compute the value by unescaping chars in the fly
fn value(&self) -> impl Iterator<Item=char>;
}
// public API, which makes Attr behave like an abstract string
// which is implemented via value 0-alloc fn internally
impl fmt::Display for Attr {}
impl PartialEq<str> for Attr
```
Hmm, this is a good idea! It should allow us to use escaping without any
intermediate string.
this approach probably won’t work for generating ids from headers (as
that’ll be tantamount to running inline parser twice). One thing we
can do here is make the parser into a lending iterator: store a
scratch String buf inside the parser, accumulate _current_ title
there, allow the generated reference event to borrow from this
internal buffer, and re-use the same buffer for all readers.
I guess this would require GATs, I haven't had a good use case to try
them out with before. In order to allow storing events for later use I
guess it would have to be reduced to an owned string when cloned, like a
Cow.
Interesting ideas! There are lots of things to experiment with.
|
Beta Was this translation helpful? Give feedback.
-
I'm glad to see this! Do you have comparison benchmarks? I imagine
it should be much faster than djot.js?
I don't have any good benchmarks yet to compare with. But I am aware of
some inefficiencies in the current implementation and haven't really
done any optimization work so far. Next step is to set up proper
benchmarks so we can experiment with changes and measure the impact.
I'm curious why you used the naive recursive approach to parsing
blocks. That's what I did in pandoc's original markdown parser, but
for commonmark and djot I used a different strategy (see the main loop
of block.ts in djot.js ; the basic idea is also described at the end
of spec.commonmark.org). I would think this approach would require
fewer allocations and make it easier to track source positions (though
I don't know if your parser aspires to do that). In the reference
parser we also try to avoid recursion to avoid stack overflows, though
that is really mostly of theoretical interest since you can always
impose a reasonable limit on nesting.
Not sure if we did exactly the same naive approach, but my approach
needs a single deque/ring buffer to keep track of the start and end (in
byte position from start of file) of each line for the current block.
And whenever we enter a block, we simply modify the start value of the
lines of the block in order to strip the outer parts.
We also don't lose the source positions, what we do lose is the position
of the original start of the line. So we can't know the line or column
location, only the byte position. And because I lose column information,
the unaligned blockquotes become problematic.
I hadn't actually considered your approach before. One problem with my
approach is that the deque might become large if a very long block is
encountered. I think your approach is better in this regard, and is
probably more suitable for a pull parser.
I need to setup benchmarks and do some testing.
|
Beta Was this translation helpful? Give feedback.
-
Hi, I have tried to implement a djot parser in Rust. I released an initial version the other day: jotdown, there is also a web demo at https://hllmn.net/projects/jotdown/demo/. It implements all the features in the syntax reference but it does currently have some known deviations from the reference implementation for some inputs. However, most output is identical, e.g. a typical file, like bench/readme.dj`, has identical output.
I thought I would share some thoughts that occurred during the implementation so far and highlight some behavioral differences. I started around 3 months ago, and took a month off, so around 2 months of work as a side project to get to this initial draft version. When implementing, I mostly read the syntax reference, experimented with the reference implementation's output and reused its unit tests, I haven't really looked at the implementation details to any of the existing djot parsers. The goal has been to maximize performance and
minimize memory usage.
I also got in contact with @kmaasrud, who turns out to have tried to implement pretty much the same thing (https://sr.ht/~kmaasrud/djot/). We're currently discussing combining our efforts.
Initial inline parsing
The first thing my implementation does is to traverse the whole file and parse the block tree structure. This is at least required for getting URLs from link definitions and headers that may be referenced before the definition/heading. However, before beginning inline parsing of the whole document, it is also required to do some inline parsing ahead. This is specifically because of automatic headers. A header could be referenced before the actual heading, e.g.
When we encounter the inline
[Heading][]
we must know if there is a header with a matching id, in order to know whether the link should be empty or "#Heading" (or some url from a matching link definition). (also, the headings attributes must have been parsed also, as they can override the id).I think the way it is now is reasonable, but kind of unfortunate that heading content must be parsed more than once.
This means that the content of every heading has to be known beforehand.
Additionally:
should also create a link to the heading. So we do not only need the content but also need to inline parse every header in advance so that we can determine their tags (by stripping away the formatting).
There are some other cases that require at least partial inline parsing during the block parsing:
Inline structure may take priority over block structure
Generally, the block structure can be discerned prior to any inline parsing. However, this does not seem to apply to tables.
Naively, you could try to identify a table row as simply a line with length >= 2 that starts and ends with
|
. However, in this case the inline seems to take precedence so one has to consider some inline elements, i.e. verbatim and backslash escapes:So here we either need a partial or full inline parser to determine the block structure.
I am guessing table cells are considered block structure also. Either way, they also require a partial inline parser:
To me, it would seem more consistent if the block structure took priority here. Though I can understand wanting to allow verbatim/escaped pipes in tables. A compromise might be to not allow escaping/verbatim the last pipe, making it possible to easily identify a table row, at least.
Absolute position of indent
I went for a quite simple and naive method to parse blocks recursively. When parsing an outer block, I simply strip the start of the line that belongs to the outer block or indentation of that block.
For example for
we see a blockquote, so we strip out the blockquote parts to parse the inner blocks and parse again as if we got input without the blockquote:
Then we encounter a list item (4 lines) so we strip out the outer parts again:
and now we only have paragraphs.
If I am not mistaken, this method works for all cases except one: unaligned blockquotes. For example,
the reference implementation parses as
while mine parses it as
causing a different list structure. So a limitation of my method is that we lose the absolute indent on the original full line. And as far as I can tell, an unaligned blockquote is the only case that requires it.
Not sure if it is ever useful to have unaligned blockquotes, though. A simplification to the language might be to simply disallow it. E.g. letting
parse as
<blockquote><p>a\n> b</p></blockquote>
(mostly) Unavoidable string allocations
I have tried to minimize the amount of copying and allocations. In the ideal case we simply copy from the input directly to the output. For example if we have the text
This is represented by some "events":
Here we simply copy the parts of the input specified by the three str events directly to the output (with html tags inbetween).
However, there are cases when formatting or escaping kind of forces us to create new strings.
For example, a heading might contain formatting:
but the url used for referencing should be
#Formatted heading
without any of the formatting. So I guess we either have toAnother case is attributes. Attributes can in most cases be copied directly from the source code because there is no formatting or other things that we need to strip. There are however escapes, which means e.g. the value in
a="abc\"def"
should beabc"def
. Newlines also cause the same problem, they are stripped from attributes and URLs.Currently, I mostly try to copy any of these directly if continuous, and fall back to creating an intermediate string only if e.g. formatting is in the way. Not really sure if we could change the syntax in any way to improve this.
Inline container precedence
The syntax reference states
I interpret this meaning that this principle applies to any delimiter that may contain inline elements. And this matches the reference implementation most of the time:
*a*
,_a_
:a:
[^a]
However, the two last does not contain inline content but are still using the basic precedence principle. For example, from the unit tests:
I haven't tried implementing it this way yet, but it would complicate things in my implementation at least. Now it is quite simple to just look ahead for a closing parenthesis or a space when encountering the opening parenthesis. Perhaps worth considering to not use the precedence principle here if it might simplify other implementations also.
That's it for now at least, just thought I would share some things that I encountered. Implementing a djot parser has been pretty fun. There are still a lot of things that can be improved so I will continue working on it for now.
Beta Was this translation helpful? Give feedback.
All reactions