Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECMA-402 should stop accepting language subtags with more than 3 letters #951

Open
sffc opened this issue Dec 21, 2024 · 16 comments
Open

ECMA-402 should stop accepting language subtags with more than 3 letters #951

sffc opened this issue Dec 21, 2024 · 16 comments
Assignees
Labels
c: locale Component: locale identifiers s: in progress Status: the issue has an active proposal
Milestone

Comments

@sffc
Copy link
Contributor

sffc commented Dec 21, 2024

See unicode-org/icu4x#3989

Language subtags with more than 3 letters are uncommon in the wild, whereas supporting them comes with an implementation cost. We should look into removing the requirement that we support them.

Two ways to do this:

  1. Reject locale strings with long language subtags.
  2. Canonicalize long language subtags to something shorter.

@hsivonen filed https://bugzilla.mozilla.org/show_bug.cgi?id=1938524 to gather telemetry in Firefox on how widely used this feature might be. It would be useful to have similar data from @FrankYFTang.

@sffc sffc added c: locale Component: locale identifiers s: in progress Status: the issue has an active proposal labels Dec 21, 2024
@sffc sffc added this to the ES 2025 milestone Dec 21, 2024
@sffc sffc self-assigned this Dec 21, 2024
@aphillips
Copy link

Can you clarify what you mean by "language subtag"? I assume you mean primary language subtag? From the context and from @hsivonen's bug, I gather that this does mean the primary subtag.

I'm not sure about the "implementation cost" when the subtags are already variable width ASCII strings and those longer than three characters would have to be in a very short (er, currently empty) lookup table to be valid.

Canonicalize long language subtags to something shorter.

I don't see how one could "canonicalize" long subtags (excepting those which are already grandfathered/deprecated and thus have mappings to something in the language subtag type already). This would also disguise their invalidity. There are invalid language tags in the wild (strings like "english"), which are just garbage, but Intl.Locale doesn't need the length to reject these?

There currently aren't any registered 5*8ALPHA subtags and, while registration of such are allowed, BCP47 and the community both push extremely hard against it (you have to get a rejection slip from ISO639 before a registration will even be considered). However, "forever is a long time" and disallowing well-formed, valid tags from being an Intl.Locale might become a problem in the future (we should stand on the bulwark against any such registration).

Of greater concern is this part of BCP47:

  1. Four-character language subtags are reserved for possible future standardization.

While the ISO-639 standard making use of these never ended up happening, there is nothing to prevent such efforts from being revived in the future.

@sffc
Copy link
Contributor Author

sffc commented Dec 21, 2024

@zbraniecki @Manishearth See Addison's comment above. Does this impact your assessment?

@zbraniecki
Copy link
Member

zbraniecki commented Dec 22, 2024

It does not change it.

@aphillips - ICU4X uses 3 bytes array for Language subtag which gives nice memory property for the size of LanguageIdentifier and Locale in ICU4X. We greatly benefit from being able to fit language-script-region triplet in 3+4+3 bytes and would prefer to avoid having to switch Language to carry 8 bytes alone.

Both examples you provided have been written close to 20 years ago and both uses of 4 and 5-8 character languages have never materialized. I think it is extremely unlikely that this will change in the next 1-5-8 years, so having to pessimize the memory structure of a fundamental noun in internationalization for a theoretical use case seems like a bad architectural decision to me.

Furthermore, removal is much harder than addition. If we were to execute this operation and limit Language subtag to 2-3 bytes, we can still reverse it in the future if we find compelling use case. It will require stronger conviction to execute, but we're in a much better position these days to coordinate such changes across all major systems, so I'm not worried about that barrier. I asses addition of 5-8 as 2WD.

@aphillips
Copy link

Thanks @zbraniecki. I understand about ICU4X's optimizations. I agree that there are no valid primary language subtags of 4 to 8 characters (even in the grandfathered set, although one could quibble there), nor, we all fervently hope, are there likely to be any (for a lot longer than 8 years we hope)

I do have a concern about whether Intl.Locale is strict or not. Other locale systems are tolerant of well-formed but not valid tags and these do exist in the wild (to everyone's dismay). Such locales do not produce useful behavior, of course, since they are not backed with locale data, but there is the matter of garbage-in/garbage-out roundtripping of such values. Such considerations would apply for eng-US (using the 639-2 code where the 639-1 code is required) or for english or for x-private or i-enochian (grandfathered). Each of these is well-formed. Of these, only x-private and i-enochian are valid (in BCP47's terms). Should I expect these tags to pass through Intl.Locale? What should ECMA-402 normatively require? Note that a number of these "work" currently in at least a few browsers.

@Manishearth
Copy link

Yeah, the implementation concern is mostly that 3-character primary language tags are a good optimization that has benefits all over the codebase.

4-character language tags are unlikely to ever materialize due to UTS 35's tweaked locale identifier syntax that allows for identifiers to start with a script. I suspect hant-HK and similar codes are probably more common (and useful) in the wild than long language subtags.

Overall if we restricted primary language subtags to 3 characters, as Zibi said we could always expand it later. ICU4X would not need a major version update to do so, even. It feels prudent to design the spec conservatively with the understanding that it can be expanded if and when there are useful tags with 5-8 characters in them.

Personally I think the most straightforward way to do this and leave the door open for this in the future is to make Intl.Locale strict. I would expect longer tags to throw an error.

@zbraniecki
Copy link
Member

ECMA-402 uses Unicode BCP47 Locale Identifier which does not allow for i-enochian or x-private. They require und-i-enochian and und-x-private.

@aphillips
Copy link

aphillips commented Dec 24, 2024

@zbraniecki I'm beating a dead horse here, but...

At least on some browsers, the tag english-ohio-us works. This tag is quite obviously not valid, but it is well-formed. The tags i-enochian or x-private and several others that are valid by BCP47 by produce errors and require the user to do strange mutations. und-i-enochian makes no sense--not that i-enochian makes a lot of sense (other grandfathered tags are perhaps more useful, but the useful ones all have Preferred-Value mappings). So it violates my sense of good standardization.

@Manishearth suggests:

Personally I think the most straightforward way to do this and leave the door open for this in the future is to make Intl.Locale strict. I would expect longer tags to throw an error.

To be clear, I agree with this approach, but note that Intl.Locale is currently (with quirky exceptions like those above) a "well-formed" implementation by BCP47. Switching to "well-formed" by ULI introduces the possibility of some back compat issues (scripts that rely on invalid tags that have 4*8ALPHA primary language subtags)

@Manishearth
Copy link

but note that Intl.Locale is currently (with quirky exceptions like those above) a "well-formed" implementation by BCP47

Not in the current spec, though? Current spec references Unicode, not BCP47, and it goes so far as to say that it is specifically choosing Unicode without Unicode's CLDR-specific quirks, which to me feels like a very deliberate choice.

@zbraniecki
Copy link
Member

So it violates my sense of good standardization.

My conclusion after seeing Apple's Locale.IdentifierType having three variants is that we're past the point of being able to bend the existing status into a cohesive narrative.

ECMA-402 will have some reasonable behavior that aims to be lenient on the input (but rejects "_" as a separator for example), but strict on other items. It will mostly work well with BCP47 Unicode, and Unicode, but maybe not just BCP47?

ICU4X is the new "chance" for us to settle on a reasonable intersection of the three types of identifiers. We can follow Apple and introduce three, or even four types, delicately different from each other, or we can introduce a single Locale and nudge the industry to, for all practical reasons, settle on it.

Our angle is that ICU4X should be natural fit for ECMA-402, and this issue is the only place where if we were to try to bend ICU4X to current ECMA-402 we'd have to give up on meaningful performance win. Hence we ask ECMA-402 to bend to us. Knowing that the result is (still?) not a perfect one or the other "defined" type of Locale, but a new one - ECMA-402 Locale. And we commit to make ICU4X match it 1-1.

@aphillips
Copy link

There can be only one response to that:

Image
https://xkcd.com/927/

😁

In practice, there is no practical difference between all of these locale identifier regimes. BCP47 has quirks. ULIs have quirks. ECMA-402 locales will have quirks. Those quirks mostly pertain to edge case tags that are rare (if not downright extinct:i-enochian is a tag for a made up language for talking to angels and here I am using it as an example 😜). The quirks are the result of different groups making spot judgements for how to deal with specific problems, but which are disconnected in time from one another.

To cite a historical example: this is how we ended up with multiple "Shift-JIS" encodings (where 99.9% of the characters are encoded identically "but..."). It didn't matter which of the SJIS encodings was most technically correct. The existence of the others made for Bad Things.

I don't think ICU4X's performance should be the determining factor. Most programming languages won't garner the same benefits. This is not to say that we should ignore the possibility of efficiency. But I think we should be most concerned with end-to-end interoperability. Languages tags that are well-formed and valid should work in HTML, CSS, ES, and any other part of the Web platform (not to mention non-Web platforms!) So I wholly agree that:

we can introduce a single Locale and nudge the industry to, for all practical reasons, settle on it.

If that's "ECMA-402 Locales", I'm all for it. But, ideally, we should rope in CLDR and maybe WHATWG at the same time?

@Manishearth
Copy link

Most programming languages won't garner the same benefits

I think basically all programming languages used for implementing browsers and JS engines will. Stack size optimizations are common in this space because most things eventually end up allocated in other structures.

The only exception is JS for polyfills, where perf is not usually important anyway.


I don't particularly consider this a case of "14 competing standards". This is a case of minor tweaks between standards, and in particular, this is making ECMA402 locales a strict subset of ULIs. They already are a strict subset of ULIs, it's just now a bit more strict. This type of thing is extremely common in web standards in my experience, and not overall a huge deal as long as it's documented.

I don't think this is just a matter of performance; leaving the door open in the future for other usages seems good too. Overcommitting early to a format to support with no legitimate usage seems suboptimal.

@sffc
Copy link
Contributor Author

sffc commented Dec 26, 2024

[ECMA-402 locales] already are a strict subset of ULIs

What are the other examples of where ECMA-402 locales differ from ULIs? Is it i-enochian?

Would it be productive for us to talk to CLDR to change the definition in UTS 35?

@Manishearth
Copy link

What are the other examples of where ECMA-402 locales differ from ULIs? Is it i-enochian?

The underscores and "root" and starts-with-script-tag bits. ULIs document these as additions, and ECMA402 uses ULIs with an exception for these additions, but I don't think that text in UTS 35 is normative: UTS 35 contains a single definition of ULIs, with a description of how it differs from UTS 35, and ECMA 402 says "use ULIs but without one of the bullet points in that description".

@aphillips
Copy link

@Manishearth That description seems to be of what UTS35 calls a unicode_bcp47_locale_id. The differences are found here. The primary leftovers if you use that instead of unicode_locale_id are:

  • The grandfathered tags are not permitted, but all but two of these have mappings. i-enochian is one (and probably irrelevant: we shouldn't hold up the whole internet to chat with angels). i-default is the other non-deprecated tag. It has a special meaning in IETF stuff, similar to root/und but somehow distinct. Perhaps that tag should become deprecated?
  • Private use tags starting with x- are not permitted.
  • Extlangs are mapped away (BCP47 already permits this)

The challenge, of course, is that the grammar for these is consistent with BCP47's, with the note:

While theoretically the unicode_language_subtag may have more than 3 letters through the IANA registration process, in practice that has not occurred.

This is where we entered this conversation: the proposal is to disallow these subtags in the primary language position. Zibi and Manish have made the argument that we can always re-inflate these values later, which is a reasonable assertion. The only question would be whether anyone is using such tags now for locales with no backing data (we'd be breaking them, but maybe they needed breaking?)

@Manishearth
Copy link

That description seems to be of what UTS35 calls a unicode_bcp47_locale_id. The differences are found here.

Yeah the "BCP 47 conformance" section is what I was talking about, which does not appear normative. But unicode_bcp47_locale_id does seem to be normative!

It's still different from default ULIs, is what I mean.

we'd be breaking them, but maybe they needed breaking

Yeah, I think that's roughly where I stand. It's fine if these cases get broken now, and making the choice now gives us freedom to properly design in the future if we end up needing things here.

@Manishearth
Copy link

I guess an interesting question is: is the purpose of Intl.Locale to be fed to other Intl APIs, or do we consider it to also be a general-purpose locale abstraction that we wish for everyone to use whenever they are writing internationalization code in JS, code that may load data from non-browser sources. If the former, then non-ISO LIs are basically useless, if the latter, then they do have some slight use.

Because when it comes to language identifiers, a non-ISO LI is all but guaranteed to cause Locale-using Intl APIs to throw. However one can imagine the scenario where someone is using Intl.Locale with some custom i18n APIs and custom i18n data, and wishes to use non-ISO LIs. This seems to me to be a particularly niche subset of a niche use case, but others may disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: locale Component: locale identifiers s: in progress Status: the issue has an active proposal
Projects
None yet
Development

No branches or pull requests

4 participants