Various thoughts from writing a parser #203

hellux · 2023-02-08T00:27:16Z

hellux
Feb 8, 2023

Hi, I have tried to implement a djot parser in Rust. I released an initial version the other day: jotdown, there is also a web demo at https://hllmn.net/projects/jotdown/demo/. It implements all the features in the syntax reference but it does currently have some known deviations from the reference implementation for some inputs. However, most output is identical, e.g. a typical file, like bench/readme.dj`, has identical output.

I thought I would share some thoughts that occurred during the implementation so far and highlight some behavioral differences. I started around 3 months ago, and took a month off, so around 2 months of work as a side project to get to this initial draft version. When implementing, I mostly read the syntax reference, experimented with the reference implementation's output and reused its unit tests, I haven't really looked at the implementation details to any of the existing djot parsers. The goal has been to maximize performance and
minimize memory usage.

I also got in contact with @kmaasrud, who turns out to have tried to implement pretty much the same thing (https://sr.ht/~kmaasrud/djot/). We're currently discussing combining our efforts.

Initial inline parsing

The first thing my implementation does is to traverse the whole file and parse the block tree structure. This is at least required for getting URLs from link definitions and headers that may be referenced before the definition/heading. However, before beginning inline parsing of the whole document, it is also required to do some inline parsing ahead. This is specifically because of automatic headers. A header could be referenced before the actual heading, e.g.

See [Heading][].

# Heading

When we encounter the inline [Heading][] we must know if there is a header with a matching id, in order to know whether the link should be empty or "#Heading" (or some url from a matching link definition). (also, the headings attributes must have been parsed also, as they can override the id).

I think the way it is now is reasonable, but kind of unfortunate that heading content must be parsed more than once.

This means that the content of every heading has to be known beforehand.

Additionally:

See [Heading][].

# [_Heading_]{.some_class}

should also create a link to the heading. So we do not only need the content but also need to inline parse every header in advance so that we can determine their tags (by stripping away the formatting).

There are some other cases that require at least partial inline parsing during the block parsing:

Inline structure may take priority over block structure

Generally, the block structure can be discerned prior to any inline parsing. However, this does not seem to apply to tables.

Naively, you could try to identify a table row as simply a line with length >= 2 that starts and ends with |. However, in this case the inline seems to take precedence so one has to consider some inline elements, i.e. verbatim and backslash escapes:

|`|         # not row
|`  `|      # row
|\`|        # row

So here we either need a partial or full inline parser to determine the block structure.

I am guessing table cells are considered block structure also. Either way, they also require a partial inline parser:

|$``|``|    # one cell
|\||        # one cell
|\\||       # two cells

To me, it would seem more consistent if the block structure took priority here. Though I can understand wanting to allow verbatim/escaped pipes in tables. A compromise might be to not allow escaping/verbatim the last pipe, making it possible to easily identify a table row, at least.

Absolute position of indent

I went for a quite simple and naive method to parse blocks recursively. When parsing an outer block, I simply strip the start of the line that belongs to the outer block or indentation of that block.

For example for

> - a
>
>   b
>
> - c

we see a blockquote, so we strip out the blockquote parts to parse the inner blocks and parse again as if we got input without the blockquote:

 - a

   b

 - c

Then we encounter a list item (4 lines) so we strip out the outer parts again:

 a

 b

and now we only have paragraphs.

If I am not mistaken, this method works for all cases except one: unaligned blockquotes. For example,

> - a
 >
  > - b
 > - c
> - d

the reference implementation parses as

> - a
>
>   - b
>   - c
> - d

while mine parses it as

> - a
>
> - b
> - c
> - d

causing a different list structure. So a limitation of my method is that we lose the absolute indent on the original full line. And as far as I can tell, an unaligned blockquote is the only case that requires it.

Not sure if it is ever useful to have unaligned blockquotes, though. A simplification to the language might be to simply disallow it. E.g. letting

> a
 > b

parse as <blockquote><p>a\n> b</p></blockquote>

(mostly) Unavoidable string allocations

I have tried to minimize the amount of copying and allocations. In the ideal case we simply copy from the input directly to the output. For example if we have the text

Some paragraph with some _bold_ text

This is represented by some "events":

start(paragraph)
str("Some paragraph with some ")
start(strong)
str("bold")
end(strong)
str(" text")
end(paragraph)

Here we simply copy the parts of the input specified by the three str events directly to the output (with html tags inbetween).

However, there are cases when formatting or escaping kind of forces us to create new strings.

For example, a heading might contain formatting:

# *Formatted* _heading_

but the url used for referencing should be #Formatted heading without any of the formatting. So I guess we either have to

model it as multiple discontiuous strings, or
use some fancy bitmask of skipped characters???, or
simply strip out the formatting, and write it to a new string, or
??

Another case is attributes. Attributes can in most cases be copied directly from the source code because there is no formatting or other things that we need to strip. There are however escapes, which means e.g. the value in
a="abc\"def" should be abc"def. Newlines also cause the same problem, they are stripped from attributes and URLs.

Currently, I mostly try to copy any of these directly if continuous, and fall back to creating an intermediate string only if e.g. formatting is in the way. Not really sure if we could change the syntax in any way to improve this.

Inline container precedence

The syntax reference states

Most inline syntax is defined by opening and closing delimiters that
surround inline content, defining it as emphasized, or as link text, for
example. The basic principle governing “precedence” for inline containers
is that the first opener that gets closed takes precedence.

I interpret this meaning that this principle applies to any delimiter that may contain inline elements. And this matches the reference implementation most of the time:

container	contains inline	uses basic precedence
typesetting, e.g. `a`, `_a_`	yes	yes
symbols `:a:`	no	no
verbatim	no	no
footnote tag `[^a]`	no	no
attributes {}	no	no
spans []	yes	yes
link text []	yes	yes
link url, within () after []	no	yes
link tag, within [] after []	no	yes

However, the two last does not contain inline content but are still using the basic precedence principle. For example, from the unit tests:

Here the strong takes precedence over the link because it starts first:
```
*[closed](hello*)
.
<p><strong>[closed](hello</strong>)</p>
```

I haven't tried implementing it this way yet, but it would complicate things in my implementation at least. Now it is quite simple to just look ahead for a closing parenthesis or a space when encountering the opening parenthesis. Perhaps worth considering to not use the precedence principle here if it might simplify other implementations also.

That's it for now at least, just thought I would share some things that I encountered. Implementing a djot parser has been pretty fun. There are still a lot of things that can be improved so I will continue working on it for now.

matklad · 2023-02-08T01:19:42Z

matklad
Feb 8, 2023

Love this!

This is at least required for getting URLs from link definitions and headers that may be referenced before the definition/heading.

why is this required? For a pull parser, I’d expect it to do a single pass, emit unresolved link events, and leave link resolution to the consumer of the events.

Another case is attributes.

I’d say the fancy right way to deal with attributes in Rust would be something like this:

pub struct Attr<'a> {
    raw: &'a str, // attribute as it appears in the source text, with escapes
}

impl Attr {
  /// private function to compute the value by unescaping chars in the fly
  fn value(&self) -> impl Iterator<Item=char>;
}

// public API, which makes Attr behave like an abstract string
// which is implemented via value 0-alloc fn internally

impl fmt::Display for Attr {}
impl PartialEq<str> for Attr

this approach probably won’t work for generating ids from headers (as that’ll be tantamount to running inline parser twice). One thing we can do here is make the parser into a lending iterator: store a scratch String buf inside the parser, accumulate current title there, allow the generated reference event to borrow from this internal buffer, and re-use the same buffer for all readers.

1 reply

matklad Feb 8, 2023

The first thing my implementation does is to traverse the whole file and parse the block tree structure

I guess a more general point is that djot is designed such that two-phase parsing is not necessary (unlike markdown, where you have to resolve links to be able to parse them). So it probably makes sense to aim for a “real” single-pass pull parser for djot.

jgm · 2023-02-08T01:55:06Z

jgm
Feb 8, 2023
Maintainer

I'm glad to see this! Do you have comparison benchmarks? I imagine it should be much faster than djot.js?

I'm curious why you used the naive recursive approach to parsing blocks. That's what I did in pandoc's original markdown parser, but for commonmark and djot I used a different strategy (see the main loop of block.ts in djot.js ; the basic idea is also described at the end of spec.commonmark.org). I would think this approach would require fewer allocations and make it easier to track source positions (though I don't know if your parser aspires to do that). In the reference parser we also try to avoid recursion to avoid stack overflows, though that is really mostly of theoretical interest since you can always impose a reasonable limit on nesting.

11 replies

hellux Feb 10, 2023
Author

Curious also how the WASM version performs compared to djot.js. Have you measured that?

Tried some quick measurements in the browser with the bench/readme.dj file. The accuracy is not very high in the browser so tried to repeat the input a few times (n):

n	djot.js (firefox)	jotdown wasm (firefox)	djot.js (node)	jotdown
1	2-4 ms	0-2 ms	350 us	230 us
10	17-24 ms	9-11 ms
100	213-218 ms	96-102 ms
throughput	6 MB/s	13 MB/s	35 MB/s	53 MB/s

Seems to be around double, currently.

The readme.dj is 13 kB so this is around 6MB/s vs 13 MB/s. When running non-wasm, jotdown is around 50 MB/s on the same machine.

Is there a way to get the time/throughput from the djot.js benchmarks? When I run npm run bench I only get ops/s, e.g:

parse readme.dj x 2,849 ops/sec ±1.25% (96 runs sampled)

Not sure what that translates to in throughput..

jgm Feb 10, 2023
Maintainer

There isn't a way to do that with this benchmark framework, but you can always get the length of bench/readme.dj and multiply by 2849.

hellux Feb 10, 2023
Author

oh, I didn't understand the output at first. I guess it means it parsed the whole file 2849 times per second. In that case it is 351 us on average. The file is 12.6 kB so the throughput is 35 MB/s. So around 70% of jotdown.

matklad Feb 10, 2023

Oh wow, that’s surprising actually! I wouldn’t expect Wasm to be faster:

I think a major part of work here is skipping over characters which aren’t markup (findNextSpecial). My guess would be that, with JS, this skipping is handled by some tight C++ loop which uses simd to jump over the bulk of data. With JS, that would be a slow byte-at-a-time loop.

Given the result, I suspect one of the following:

it’s not true that skipping over non-markup is slow, even if done byte-at-a-time
Browser’s Wasm is actually advanced enough to support SIMD to write this loop in a fairly optimal way.
findNextSpecial isn’t actually particularly fast in JavaScript, either because browser’s impl of regexes doesn’t do optimizations I imagine it to do, or because our code doesn’t trigger thouse optimizzations.

hellux Feb 10, 2023
Author

Oh wow, that’s surprising actually! I wouldn’t expect Wasm to be faster:

Actually, I should clarify, the djot.js was run in the browser, and compared with wasm jotdown. djot.js run in node is still faster than wasm jotdown (35MB/s vs 13MB/s). But hopefully it can be made faster than node djot.js :)

Added all results in the table to make it clearer.

hellux · 2023-02-08T18:17:28Z

hellux
Feb 8, 2023
Author

> This is at least required for getting URLs from link definitions and headers that may be referenced before the definition/heading. why is this required? For a pull parser, I’d expect it to do a single pass, emit _unresolved_ link events, and leave link resolution to the consumer of the events. > The first thing my implementation does is to traverse the whole file > and parse the block tree structure I guess a more general point is that djot is designed such that two-phase parsing is not necessary (unlike markdown, where you have to resolve links to be able to parse them). So it probably makes sense to aim for a “real” single-pass pull parser for djot.

True, this is certainly possible. It would leave more up to the consumer, though. Perhaps it is possible to provide enough helper functions to make it not too cumbersome. However, when one needs resolved links, one has to run the pull parser and cache the parsed events or output until the link is resolved?[^a] This may decrease runtime but increase memory requirements? [^a] Unless we want to use the latest link definition which would require looking until the end.

> Another case is attributes. I’d say the fancy right way to deal with attributes in Rust would be something like this: ```rust pub struct Attr<'a> { raw: &'a str, // attribute as it appears in the source text, with escapes } impl Attr { /// private function to compute the value by unescaping chars in the fly fn value(&self) -> impl Iterator<Item=char>; } // public API, which makes Attr behave like an abstract string // which is implemented via value 0-alloc fn internally impl fmt::Display for Attr {} impl PartialEq<str> for Attr ```

Hmm, this is a good idea! It should allow us to use escaping without any intermediate string.

this approach probably won’t work for generating ids from headers (as that’ll be tantamount to running inline parser twice). One thing we can do here is make the parser into a lending iterator: store a scratch String buf inside the parser, accumulate _current_ title there, allow the generated reference event to borrow from this internal buffer, and re-use the same buffer for all readers.

I guess this would require GATs, I haven't had a good use case to try them out with before. In order to allow storing events for later use I guess it would have to be reduced to an owned string when cloned, like a Cow. Interesting ideas! There are lots of things to experiment with.

0 replies

hellux · 2023-02-08T18:46:53Z

hellux
Feb 8, 2023
Author

I'm glad to see this! Do you have comparison benchmarks? I imagine it should be much faster than djot.js?

I don't have any good benchmarks yet to compare with. But I am aware of some inefficiencies in the current implementation and haven't really done any optimization work so far. Next step is to set up proper benchmarks so we can experiment with changes and measure the impact.

I'm curious why you used the naive recursive approach to parsing blocks. That's what I did in pandoc's original markdown parser, but for commonmark and djot I used a different strategy (see the main loop of block.ts in djot.js ; the basic idea is also described at the end of spec.commonmark.org). I would think this approach would require fewer allocations and make it easier to track source positions (though I don't know if your parser aspires to do that). In the reference parser we also try to avoid recursion to avoid stack overflows, though that is really mostly of theoretical interest since you can always impose a reasonable limit on nesting.

Not sure if we did exactly the same naive approach, but my approach needs a single deque/ring buffer to keep track of the start and end (in byte position from start of file) of each line for the current block. And whenever we enter a block, we simply modify the start value of the lines of the block in order to strip the outer parts. We also don't lose the source positions, what we do lose is the position of the original start of the line. So we can't know the line or column location, only the byte position. And because I lose column information, the unaligned blockquotes become problematic. I hadn't actually considered your approach before. One problem with my approach is that the deque might become large if a very long block is encountered. I think your approach is better in this regard, and is probably more suitable for a pull parser. I need to setup benchmarks and do some testing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various thoughts from writing a parser #203

{{title}}

Replies: 4 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Various thoughts from writing a parser #203

hellux Feb 8, 2023

Initial inline parsing

Inline structure may take priority over block structure

Absolute position of indent

(mostly) Unavoidable string allocations

Inline container precedence

Replies: 4 comments · 12 replies

matklad Feb 8, 2023

matklad Feb 8, 2023

jgm Feb 8, 2023 Maintainer

hellux Feb 10, 2023 Author

jgm Feb 10, 2023 Maintainer

hellux Feb 10, 2023 Author

matklad Feb 10, 2023

hellux Feb 10, 2023 Author

hellux Feb 8, 2023 Author

hellux Feb 8, 2023 Author

hellux
Feb 8, 2023

Replies: 4 comments 12 replies

matklad
Feb 8, 2023

jgm
Feb 8, 2023
Maintainer

hellux Feb 10, 2023
Author

jgm Feb 10, 2023
Maintainer

hellux Feb 10, 2023
Author

hellux Feb 10, 2023
Author

hellux
Feb 8, 2023
Author

hellux
Feb 8, 2023
Author