Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example of bengali grapheme clusters out fo data #150

Open
andjc opened this issue Jan 29, 2025 · 0 comments
Open

Example of bengali grapheme clusters out fo data #150

andjc opened this issue Jan 29, 2025 · 0 comments

Comments

@andjc
Copy link

andjc commented Jan 29, 2025

The current editors draft has the following text:

For example, the Bangla user-perceived character kshī ক্ষী is composed of four characters: U+0995 BENGALI LETTER KA + U+09CD BENGALI SIGN VIRAMA + U+09B7 BENGALI LETTER SSA + U+09C0 BENGALI VOWEL SIGN II.
Unicode splits these into two grapheme clusters, unless language-specific tailoring is applied. For more information, see our article Character encodings: Essential concepts.

This describes the behavior prior to Unicode 15.1. UAX29 was updated in the Unicode 15.1 release, adding an additional rule GB9c:

Do not break within certain combinations with Indic_Conjunct_Break (InCB)=Linker

For the example 'ক্ষী' , UAX29 revision 41 and earlier would result in two extended grapheme clusters ('ক্', 'ষী') while UAX29 revision 43 onwards results in a single extended grapheme cluster ('ক্ষী'). So behaviour is dependent on version of UAX29 (i.e. version of Unicode supported).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant