-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Definition of octal codes in literal strings related to UTF-16BE encoding #494
Comments
I cannot find the use of the phrase "character code" anywhere in 7.3.4 subclauses - and that would be incorrect. But I agree that the language is confusing since the word "character" is used for both the bytes comprising the string in the input PDF as well as what they mean once lexed/de-escaped: 7.3.4.1: "A string object shall consist of a series of zero or more bytes." The correct terminology should be "characters" are what comprise the string in "raw PDF" (pre-lexing), but "bytes" are what they represent post-lexing. So an octal code |
Proposed solution:
and
|
I think that at least one change is also needed in 7.3.4.1: The term “literal characters” is used (meaning “bytes”) there also • As a sequence of literal characters enclosed in parentheses () (using LEFT PARENTHESIS (28h) and RIGHT PARENTHESIS (29h)); see 7.3.4.2, "Literal strings" Also, maybe the final sentence of 7.3.4.1 could be expanded somewhat, to say that “7.9.1+2 explains the use of such “byte strings” to represent characters in string objects, using various character encodings including multi-byte schemes”. Currently it is: Subclause 7.9.2, "String object types" describes the encoding schemes used for the contents of string objects. |
I agree 7.3.4.1, 1st bullet should drop the word "literal" - it should just state "characters" so it is consistent with the terminology throughout 7.3.4.2:
I think other errata we have already applied sufficiently cover encodings and the fact that the lexical form of a string object is orthogonal to any character encoding in string data - see from this point down https://pdf-issues.pdfa.org/32000-2-2020/clause07.html#H7.9.1 |
The "character code" definition is in the last row of Table 3 (in ch. 7.3.4.2). |
That definition could be made more precise, and understandable, as follows: \ddd 8-bit Character code ddd (3 octal digits) (Assuming I interpreted it correctly 😄 !) |
I edited this last comment to make the following correction: |
I would suggest: \ddd Byte code ddd (3 octal digits) I feel that the word "character" is somewhat problematic here. When octal coding is used for a UTF-16BE encoded string, then an octal code \ddd does not map to any character, but a byte of a multiple-byte character, since UTF-16BE has only 2- and 4-byte characters. |
Thanks. Table 3 proposed fix is quite simple - its a byte:
I would NOT add "(3 octal digits)" as that is incorrect - it can be 1-3 bytes as per normative text further down. |
OK with me. |
This is great. Thanks. And yes, "(3 octal digits)" would be incorrect. |
PDF TWG agree |
We feel that there is a small unclarity regarding to literal strings in ISO 32000-2:2020 (and previous versions). In ch. 7.3.4.2 "Literal strings" Table 3, a single "\ddd" octal code is defined as a "character code". Isn't a "character code" something which maps to a character in a codepage in question? For strings encoded with UTF-16BE, a single octal code can not really be used as a mapping character code (i.e. \ddd does not map to a Unicode character). Of course this can be done with multiple octal codes, but the definition is about a single octal code \ddd. From this, it may be unclear for the reader whether it is possible to use octal coding to UTF-16BE encoded string with multiple-byte characters.
It is true that ch. 7.3.4.2 "Literal strings" also states that any 8-bit value can appear also with the octal "notation described". But this still can be understood so that "notation described" refers to defining (i.e. limiting) an octal code as a mapping character code, which leads to the original unclarity.
In future revisions, we suggest to reconsider or open the term "character code" used in octal codes and to give a short sentence about its usage with Unicode in a case where a single character requires multiple bytes.
The text was updated successfully, but these errors were encountered: