-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for equivalence classes #404
Comments
Researching for grep implementation. It uses libc regexp for this. Basically it looks for equivalence table defined on locale
locale knows what characters are equivalent of others, here equivalent for a. I design a simple test case:
which should result in:
trying to debug
I can't do it. libc as it has to be compiled with optimizations and can't see a single var if there's a way to retrieve that findidx function call result into rust, this would be just detect "[=" "=]" for bracket expression. I will try to find where rust regexp engine parse bracket expression and propose a WIP on this |
I'm little lost into rust locale approach, so rust-locale seems a decent choose for me (sorry if I'm wrong) I think all the pieces are put in place, but I have no skills on bindings to call findidx from libc and understand what is libc doing with that info (here info for es_ES):
to retrieve equivalence chars for a char
So for 'o' in spanish it would start with:
|
@albfan Thanks for doing this research! I don't have time to dig in myself at the moment, but I'd like to just say a couple things:
|
libc regex relay on localedata. Many other regex don't have such concept. I think suggestion on rust-locale is correct rust-locale/rust-locale#26 (comment) Relaying on unicode there's a concept of normalization form for characters. Searching for a character all combining characters on search text should be equivalent. That's a good design because it will not relay on predefined metadata but I guess it could be a slow implementation for rg |
Yeah the regex engine isn't getting support for normalization, sorry. In particular, the regex engine is normalization agnostic. If you care about it, then you need to normalize the regex and the text searched consistently. That means that if ripgrep wants to care about normalization, then it has to normalize the regex and every file before it searches. |
Yes. I think I can fork ripgrep and provide an implementation that does that and see how much it slows the searches (without any idea for PR) About regex, I think again, forking for my special needs can be a decent solution, preprocess all characters for a(2) e(2) i(2) o(2) u(3) n(2) A(2) E(2) I(2) O(2) U(3) = 24 equivalences is now worth. I see all this as refining my toolset, but not as a general helpful contribution I suggest to close this issue |
Normalizing the file is (relatively) straightforward. However, normalizing the regex, in general, can be very complicated. My main concern is with canonical equivalence. In many cases, a single code point is equivalent to a series of two or more code points (generally as part of a single grapheme). Because of the semantics of regex, a simple normalization of the string representation is insufficient, because depending on the position of a character composing or decomposing it would significantly change the meaning of the regex. As a concrete example, consider the regex "[a\u00e0-\u00e5]" which is in NFC form (composed). If this needs to be normalized to NFD (decomposed), the ideal representation would be something like: A more naive implementation could, of course, iterate over every code point in the range, and generate an alternation of all the normalized results. But that would perform poorly, both for normalizing the regex, and the search. Normalizing to "NFC" is even more problematic. For example consider something like Maybe it doesn't make sense for normalization or equivalence to be part of the regex crate, and there should be a separate crate to handle this. However, normalizing the regex and the input is not a trivial workaround. |
@tmccombs Indeed! The "Canonical Equivalents"' section of UTS#18 elaborates on this (and even brings up some of the same issues that you do). Their suggested work-around is pretty straight-forward though:
In other words, instead of trying to devise a general algorithm for turning a regex into a particular normalization form, you have to ask the end user to do it themselves. That is, you ask the user to be aware of normalization forms and craft their regex appropriately. Yes, I agree it isn't an ideal user experience, but I think it's the best that can probably be done in this space until someone comes up with a better idea. (And I wouldn't hold my breath on that front.) There is also a middle ground where you might be able to fix 80% of the cases. That is, you put all literal strings that occur in a regex into NFD form. We don't always need to focus on 100% correctness. For example, this approach would seemingly fix sharkdp/fd#1535. |
See "character equivalents" here: https://www.regular-expressions.info/posixbrackets.html#eq
The major change this seems to require (aside for actually determining character equivalents) is introducing the concept of locale into the regex engine. I'm not sure that's a good idea, but this issue should at least be investigated at some point.
The text was updated successfully, but these errors were encountered: