-
-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests to ensure Unicode string comparison is by codepoint #570
Comments
Also, that two graphemes that look the same aren't identical (for example µ (MICRO SIGN, U+00B5) is not the same as μ (GREEK SMALL LETTER MU, U+2192).
This is actually kind of unfortunate wording, as there can be two graphemes that are identical but differ by codepoint, because of the ordering of the combining characters (underlining, diacritics, accent marks). See http://www.unicode.org/reports/tr15/ on normalization forms - ideally we should specify that string comparisons are done on normalized strings, but I fear the overhead may be prohibitive to some implementations (especially when "most" strings are ascii anyway). |
I wrote the "codepoint-for-codepoint" text, in my basic understanding, there's multiple different ways to compare strings for equality, and preserving codepoints exactly is the only strategy that doesn't lose information. There's a potential that picking any other comparison algorithm could break compatibility with JSON. Since JSON is Unicode, you can renormalize the whole document before validating it, and JSON Schema doesn't need to (normatively) mention this. You can also write your schema to use a particular normal form, and you tell applications "this schema expects [e.g.] NFC." |
I am looking for a first issue - and happy to have a crack at this one. Would it be acceptable to put together a few cases based on the Something like:
|
Thanks! That'd be really helpful! Yep that looks good I think (other than better to mention we're testing codepoint equivalence than to mention mu in the description of the test case, that's not particularly relevant). If we want to be extra nice probably having a second example for combining marks like the original description, i.e. |
OK cool - thanks @Julian - point taken I had a second thought after I sent the comment though is "data" in the test schema meant to be a stand-in for JSON instance data? basically what I'm getting at is that "\u03bc" is a perfectly valid string literal and that maybe we need to differentiate between cases where the characters that comprise a codepoint representation are being sent over the wire (or whatever) as opposed to the character identified by the codepoint and I'm not sure whether the last two tests are correct or not will have a think & a read and raise a PR in due course comments appreciated in the meantime though |
Yes, It's the JSON parser that someone uses's job to turn "\u03bc" into "μ" as part of deserializing (as opposed to JSON Schema), so really there isn't likely to be any difference there, but it's fine/good I think to have both representations. |
Opened a PR the tests are on draft6 since this is the earliest draft of the spec with the 'codepoint-for-codepoint' language. I'm not proposing to add the tests for the suites for drafts 3 & 4 since those drafts just mention that strings should have the "same value" - and don't specify how those values are derived but if these pass the smell test I'd be happy to copy the tests into the test suites for the other draft specs - I just didn't want to crowd the PR with too much duplicate text straight off the bat |
I think this is the key. Since a parser is already doing the conversion, any tests we write are checking the parser, not the validator. |
§4.2.2 of the specification says, for string equality, that it is defined by:
We should add some tests to specifically ensure that a particular grapheme (e.g.
ä
) is not equal to the same grapheme represented by different code point(s).The text was updated successfully, but these errors were encountered: