Performance of using a lot of istring<>'s in a sor<> #140

WChrisK · 2019-01-05T22:10:12Z

WChrisK
Jan 5, 2019

Question: Suppose we have a definition like this

sor<istring<'a', 'a', 'a'>, istring<'a', 'b', 'a'>, istring<'a', 'c', 'x'>

Will the library check through each istring<> and check for a match, or does it notice that 3 of them share an 'a'/'A' and create a suffix-tree like matcher so it only looks at the first letter once for matches?

This is more of a curiosity question, however I am implementing something that will have unfortunately 300+ istring<>'s in something that performance is important (but not critical). I haven't profiled anything, I'm just exploring ideas and seeing what my bounds are.

ColinH · 2019-01-06T09:20:10Z

ColinH
Jan 6, 2019
Maintainer

The PEGTL does not optimise this case, and to be honest it probably never will, it doesn't fit well with our more minimalist approach.

For the example we would probably pull the 'a' in front of the sor, but this would not be practical when you have several hundred different strings.

In that case I would probably write a custom rule, might be as easy as finding the first non-ASCII-letter in the input and then performing a lookup in a std::map or sorted std::vector, global or in the States.

To make it even faster one could try fiddling with trie data structures or similar, but so far we have always found the performance of simpler approaches more than sufficient for our requirements...

0 replies

d-frey · 2019-01-06T10:07:55Z

d-frey
Jan 6, 2019
Maintainer

Another option: The rule in the grammar would match any sequence of characters (plus<alpha> or something similar) and the action's apply-method could look up the matching string in a case-insensitive map and return true/false accordingly. Note that this requires the action class to be applied at all times, hence using at<> on your rule would not match correctly. But when this is not a problem for your use-case, using an apply method that returns a boolean result might be an option.

0 replies

nlyan · 2019-01-07T23:34:10Z

nlyan
Jan 7, 2019

You could look in to mixing in a tool like Ragel, which will perform this optimization, and then call the resulting generated code from a custom PEGTL rule. Ragel parsers match regular grammars and do not backtrack. Matching will be something like O(length of your longest string)

Another option, as the guys above recommended, is to scan ahead to a delimiter and then test the string against a set. If you have 300 strings in your set, and you know what they are all going to be at compile time, you could take a look at perfect hashing. Applying a perfect hash will be O(1) regardless of how large your set is. Besides the classic gperf, there are easy to use implementations out there that can help you with this approach.

Neither of these approaches will impose any additional runtime dependencies.

0 replies

nlyan · 2019-01-11T01:54:04Z

nlyan
Jan 11, 2019

An article just hit the top of Hacker News today from PostgreSQL: they are now using perfect hashing in their SQL parser to detect (a set of 400) keywords:

https://www.postgresql.org/message-id/flat/E1ghOVt-0007os-2V%40gemulon.postgresql.org

0 replies

ColinH · 2019-01-17T16:46:16Z

ColinH
Jan 17, 2019
Maintainer

It looks like the original question was answered, and additional pointers given (thanks @nlyan) on how to work around the issue, so I'll close this for now. Feel free to re-open and/or continue the discussion if necessary!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of using a lot of istring<>'s in a sor<> #140

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Performance of using a lot of istring<>'s in a sor<> #140

WChrisK Jan 5, 2019

Replies: 5 comments

ColinH Jan 6, 2019 Maintainer

d-frey Jan 6, 2019 Maintainer

nlyan Jan 7, 2019

nlyan Jan 11, 2019

ColinH Jan 17, 2019 Maintainer

WChrisK
Jan 5, 2019

ColinH
Jan 6, 2019
Maintainer

d-frey
Jan 6, 2019
Maintainer

nlyan
Jan 7, 2019

nlyan
Jan 11, 2019

ColinH
Jan 17, 2019
Maintainer