Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WR/ARIB] Character Sets #544

Open
himorin opened this issue May 8, 2020 · 7 comments
Open

[WR/ARIB] Character Sets #544

himorin opened this issue May 8, 2020 · 7 comments
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. imsc1.3 imscvNEXT pr open Wide Review Comment
Milestone

Comments

@himorin
Copy link
Contributor

himorin commented May 8, 2020

Per: w3c/ttwg#116
Comment 2

Primary language subtag Characters
Ja Collection 285*: Basic Japanese
Collection 286*: Japanese Non Ideographic Extension
Collection 371*: JIS2004 Ideographics Extension
(Fullwidth ASCII variants) U+FF01 – U+FF5E
(Fullwidth Symbol variants) U+FFE3, U+FFE5
(Halfwidth Katakana variants) U+FF65 – U+FF9F
(Halfwidth CJK punctuation) U+FF61 – U+FF64
(Additional ideographs and symbols defined in Table 5-2 in Vol.1, Part 2 of ARIB STD-B62)
*: These collections are defined in Annex A to ISO/IEC 10646:2017
@himorin himorin added Wide Review Comment imsc1.2 i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. labels May 8, 2020
@nigelmegitt
Copy link
Contributor

My understanding of this comment is that ARIB is requesting that the characters listed in IMSC appendix B. Common Character Sets, Table 2, are extended by adding a row for language code "ja" as per the table in the first comment in the issue, and that the two references are added as source data.

Since we currently do not have a table for the "ja" language code, and ARIB-TT appears to be a good source of authority for captioning characters in Japan, this seems reasonable to me.

@nigelmegitt
Copy link
Contributor

I wonder if @himorin , @xfq , @r12a or @aphillips might have views about this issue.

@aphillips
Copy link

@nigelmegitt The point of IMSC appendix B, if I understand correctly, is to help implementations to identify minimum base character sets for font/rendering support in specific languages/countries, particularly for limited capacity devices/rendering platforms. In most cases, I am not sure that there is an underlying standard or character set behind the language-specific lists in the appendix. Instead the appendix provides information on minimal implementations. The situation with Japanese may be different, in that ARIB is a standard that defines such a character set.

There has been cooperation in the past between ARIB and Unicode/ISO10646 and occasionally new characters have been encoded to support evolving ARIB character sets. It seems reasonable to document or provide a link to what this character set is for the purposes of IMSC developers. Unlike the existing elements in Table 2, however, this list is extensive and would be more difficult to incorporate using the same methodology. I don't recall whether Unicode maintains any documentation offhand either in the UCD or in CLDR of which characters are in ARIB. Referencing ARIB directly might be a better solution, as I'm not sure it makes sense to try to have IMSC track ARIB instead of just pulling it in by reference. The languages that otherwise appear in Appendix B do not otherwise have a ready reference.

@r12a may have better recollection of the status of ARIB's character set vs. Unicode.

@nigelmegitt
Copy link
Contributor

nigelmegitt commented May 20, 2020

@aphillips we actually asked Unicode for a specific subset of characters per locale within CLDR for subtitle and caption purposes, a long time ago. That request is currently tracked at https://unicode-org.atlassian.net/browse/CLDR-8915 (it used to be on a different Unicode tracker, which no longer seems to be operational).

[Update to this comment:]
I just realised that you helped us with this, adding a comment to that CLDR tracking issue. I don't understand the status of it now though. There's a further comment that it has been moved to "UNSCH" which I cannot yet decode.

@aphillips
Copy link

@nigelmegitt "UNSCH" is "unscheduled". It's in limbo and I'll action myself to follow up with them.

That said, you're kind of asking Unicode to define a "standard" (but which will more likely be a "recommendation" or "best practices"), where as ARIB already is a standard. I think table 2 is more like guidance for implementers.

For this issue, I'd probably add text just above Table 2 along the lines of:

Table 2 specifies supplementary character sets that have proven useful in captioning and subtitling applications for a number of selected languages. Table 2 is non-exhaustive, and will be extended as needs arise. For Japanese, the standard ARIB STD-B62 defines a character set that is recommended as a reference.

@nigelmegitt
Copy link
Contributor

Ah, thanks for that @aphillips . ARIB only defines the characters for Japanese if I understand correctly, whereas the request to CLDR was to define them for every language. We also did offer to contribute to the work for doing so, I believe.

Your proposal for Table 2 works okay; a tweak might be to add a "ja" row and put the text in the second column of that table.

@css-meeting-bot
Copy link
Member

The Timed Text Working Group just discussed [WR/ARIB] Character Sets w3c/imsc#544, and agreed to the following:

  • SUMMARY: TTWG would like to adopt this change in a future version of IMSC.
The full IRC log of that discussion <nigel> Topic: [WR/ARIB] Character Sets #544
<nigel> github: https://github.com//issues/544
<nigel> Nigel: I think the first question to ask is if this is normative/substantive.
<nigel> Pierre: We should try to avoid making substantial changes this far into the process, but
<nigel> .. we could formally because it is only informative.
<nigel> Pierre: That section, regardless of the normative language around it, is meant to inform
<cyril> " this section defines common character sets that authors are encouraged to use."
<nigel> .. implementations. You could conclude that it affects implementations.
<nigel> Cyril: "encouraged to use"
<nigel> Pierre: And the W3C Process definition.
<nigel> -> https://www.w3.org/2019/Process-20190301/#correction-classes Process 6.2.5 Classes of Changes
<nigel> Pierre: Section 8.2 says a document "SHOULD be authored using characters from" the common character sets.
<nigel> Cyril: There's a relationship between the reference fonts and the common character sets, right?
<nigel> Pierre: Also
<nigel> .. I think we should deal with these ARIB comments in the next version of IMSC otherwise
<nigel> .. we may make mistakes.
<nigel> Nigel: Back to Cyril's point, there does seem to be a substantive relationship between
<nigel> .. reference fonts and the common character sets and the §9.3 text on rendering rules.
<nigel> .. So it looks as though changing the common characters changes the code points in the
<nigel> .. reference fonts and therefore the rendering rules.
<nigel> .. (sorry that 9.3 is assuming the introduction is added, otherwise it's 8.3)
<nigel> Nigel: My conclusion is we cannot make this change now but should add it to vNext
<nigel> .. with appropriate care about reference fonts, and checking that the code points are
<nigel> .. indeed all available.
<nigel> .. Any other points to add before I summarise?
<nigel> Pierre: I think that the idea of converging ARIB-TTML and IMSC is really a great goal,
<nigel> .. and we should take the time to do it in collaboration with ARIB. I see that as a pretty
<nigel> .. extensive but worthwhile effort.
<nigel> SUMMARY: TTWG would like to adopt this change in a future version of IMSC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. imsc1.3 imscvNEXT pr open Wide Review Comment
Projects
None yet
Development

No branches or pull requests

5 participants