-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECMA-402 should stop accepting language subtags with more than 3 letters #951
Comments
Can you clarify what you mean by "language subtag"? I assume you mean primary language subtag? From the context and from @hsivonen's bug, I gather that this does mean the primary subtag. I'm not sure about the "implementation cost" when the subtags are already variable width ASCII strings and those longer than three characters would have to be in a very short (er, currently empty) lookup table to be valid.
I don't see how one could "canonicalize" long subtags (excepting those which are already grandfathered/deprecated and thus have mappings to something in the There currently aren't any registered 5*8ALPHA subtags and, while registration of such are allowed, BCP47 and the community both push extremely hard against it (you have to get a rejection slip from ISO639 before a registration will even be considered). However, "forever is a long time" and disallowing well-formed, valid tags from being an Intl.Locale might become a problem in the future (we should stand on the bulwark against any such registration). Of greater concern is this part of BCP47:
While the ISO-639 standard making use of these never ended up happening, there is nothing to prevent such efforts from being revived in the future. |
@zbraniecki @Manishearth See Addison's comment above. Does this impact your assessment? |
It does not change it. @aphillips - ICU4X uses 3 bytes array for Language subtag which gives nice memory property for the size of LanguageIdentifier and Locale in ICU4X. We greatly benefit from being able to fit language-script-region triplet in 3+4+3 bytes and would prefer to avoid having to switch Language to carry 8 bytes alone. Both examples you provided have been written close to 20 years ago and both uses of 4 and 5-8 character languages have never materialized. I think it is extremely unlikely that this will change in the next 1-5-8 years, so having to pessimize the memory structure of a fundamental noun in internationalization for a theoretical use case seems like a bad architectural decision to me. Furthermore, removal is much harder than addition. If we were to execute this operation and limit Language subtag to 2-3 bytes, we can still reverse it in the future if we find compelling use case. It will require stronger conviction to execute, but we're in a much better position these days to coordinate such changes across all major systems, so I'm not worried about that barrier. I asses addition of 5-8 as 2WD. |
Thanks @zbraniecki. I understand about ICU4X's optimizations. I agree that there are no valid primary language subtags of 4 to 8 characters (even in the grandfathered set, although one could quibble there), nor, we all fervently hope, are there likely to be any (for a lot longer than 8 years we hope) I do have a concern about whether |
Yeah, the implementation concern is mostly that 3-character primary language tags are a good optimization that has benefits all over the codebase. 4-character language tags are unlikely to ever materialize due to UTS 35's tweaked locale identifier syntax that allows for identifiers to start with a script. I suspect Overall if we restricted primary language subtags to 3 characters, as Zibi said we could always expand it later. ICU4X would not need a major version update to do so, even. It feels prudent to design the spec conservatively with the understanding that it can be expanded if and when there are useful tags with 5-8 characters in them. Personally I think the most straightforward way to do this and leave the door open for this in the future is to make |
ECMA-402 uses Unicode BCP47 Locale Identifier which does not allow for |
@zbraniecki I'm beating a dead horse here, but... At least on some browsers, the tag @Manishearth suggests:
To be clear, I agree with this approach, but note that Intl.Locale is currently (with quirky exceptions like those above) a "well-formed" implementation by BCP47. Switching to "well-formed" by ULI introduces the possibility of some back compat issues (scripts that rely on invalid tags that have 4*8ALPHA primary language subtags) |
Not in the current spec, though? Current spec references Unicode, not BCP47, and it goes so far as to say that it is specifically choosing Unicode without Unicode's CLDR-specific quirks, which to me feels like a very deliberate choice. |
My conclusion after seeing Apple's Locale.IdentifierType having three variants is that we're past the point of being able to bend the existing status into a cohesive narrative. ECMA-402 will have some reasonable behavior that aims to be lenient on the input (but rejects "_" as a separator for example), but strict on other items. It will mostly work well with BCP47 Unicode, and Unicode, but maybe not just BCP47? ICU4X is the new "chance" for us to settle on a reasonable intersection of the three types of identifiers. We can follow Apple and introduce three, or even four types, delicately different from each other, or we can introduce a single Our angle is that ICU4X should be natural fit for ECMA-402, and this issue is the only place where if we were to try to bend ICU4X to current ECMA-402 we'd have to give up on meaningful performance win. Hence we ask ECMA-402 to bend to us. Knowing that the result is (still?) not a perfect one or the other "defined" type of Locale, but a new one - ECMA-402 Locale. And we commit to make ICU4X match it 1-1. |
There can be only one response to that: 😁 In practice, there is no practical difference between all of these locale identifier regimes. BCP47 has quirks. ULIs have quirks. ECMA-402 locales will have quirks. Those quirks mostly pertain to edge case tags that are rare (if not downright extinct: To cite a historical example: this is how we ended up with multiple "Shift-JIS" encodings (where 99.9% of the characters are encoded identically "but..."). It didn't matter which of the SJIS encodings was most technically correct. The existence of the others made for Bad Things. I don't think ICU4X's performance should be the determining factor. Most programming languages won't garner the same benefits. This is not to say that we should ignore the possibility of efficiency. But I think we should be most concerned with end-to-end interoperability. Languages tags that are well-formed and valid should work in HTML, CSS, ES, and any other part of the Web platform (not to mention non-Web platforms!) So I wholly agree that:
If that's "ECMA-402 Locales", I'm all for it. But, ideally, we should rope in CLDR and maybe WHATWG at the same time? |
I think basically all programming languages used for implementing browsers and JS engines will. Stack size optimizations are common in this space because most things eventually end up allocated in other structures. The only exception is JS for polyfills, where perf is not usually important anyway. I don't particularly consider this a case of "14 competing standards". This is a case of minor tweaks between standards, and in particular, this is making ECMA402 locales a strict subset of ULIs. They already are a strict subset of ULIs, it's just now a bit more strict. This type of thing is extremely common in web standards in my experience, and not overall a huge deal as long as it's documented. I don't think this is just a matter of performance; leaving the door open in the future for other usages seems good too. Overcommitting early to a format to support with no legitimate usage seems suboptimal. |
What are the other examples of where ECMA-402 locales differ from ULIs? Is it Would it be productive for us to talk to CLDR to change the definition in UTS 35? |
The underscores and "root" and starts-with-script-tag bits. ULIs document these as additions, and ECMA402 uses ULIs with an exception for these additions, but I don't think that text in UTS 35 is normative: UTS 35 contains a single definition of ULIs, with a description of how it differs from UTS 35, and ECMA 402 says "use ULIs but without one of the bullet points in that description". |
@Manishearth That description seems to be of what UTS35 calls a unicode_bcp47_locale_id. The differences are found here. The primary leftovers if you use that instead of
The challenge, of course, is that the grammar for these is consistent with BCP47's, with the note:
This is where we entered this conversation: the proposal is to disallow these subtags in the primary language position. Zibi and Manish have made the argument that we can always re-inflate these values later, which is a reasonable assertion. The only question would be whether anyone is using such tags now for locales with no backing data (we'd be breaking them, but maybe they needed breaking?) |
Yeah the "BCP 47 conformance" section is what I was talking about, which does not appear normative. But It's still different from default ULIs, is what I mean.
Yeah, I think that's roughly where I stand. It's fine if these cases get broken now, and making the choice now gives us freedom to properly design in the future if we end up needing things here. |
I guess an interesting question is: is the purpose of Because when it comes to language identifiers, a non-ISO LI is all but guaranteed to cause Locale-using |
See unicode-org/icu4x#3989
Language subtags with more than 3 letters are uncommon in the wild, whereas supporting them comes with an implementation cost. We should look into removing the requirement that we support them.
Two ways to do this:
@hsivonen filed https://bugzilla.mozilla.org/show_bug.cgi?id=1938524 to gather telemetry in Firefox on how widely used this feature might be. It would be useful to have similar data from @FrankYFTang.
The text was updated successfully, but these errors were encountered: