-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZnUTF8Encoder filters out any U+FEFF #134
Comments
Well, I am sure you can see that this was done by design: the occurrence of a BOM is a no-op for UTF-8, should not occur and is not recommended. Hence this pragmatic decision. I know that it is used on Windows, that is why there are some provisions for writing a BOM. Should version 3.2 compatibility be a thing, since that is 20+ years old ? Up to now there have not yet been any complaints about this behaviour. What is the practical issue that you encounter ? What do you think should happen/change ? |
Shouldn’t it at most filter out an initial U+FEFF? The example for char input[] = { '\xEF','\xBB','\xBF', '\xEF','\xBB','\xBF', '\x41', '\xEF','\xBB','\xBF', '\x42' }; The output is as follows, so only the initial U+FEFF is filtered out:
|
In Python I see this: $ python3
Python 3.12.2 (main, Feb 6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> bytes = b'\xEF\xBB\xBF\xEF\xBB\xBF\x41\xEF\xBB\xBF\x42'
>>> bytes.decode("utf-8")
'\ufeff\ufeffA\ufeffB' |
In Swift I see this: $ swift repl
Welcome to Apple Swift version 5.10 (swiftlang-5.10.0.13 clang-1500.3.9.4).
Type :help for assistance.
1> import Foundation
2> let byteArray: [UInt8] = [239, 187, 191, 239, 187, 191, 65, 239, 187, 191, 66]
byteArray: [UInt8] = 11 values {
[0] = 239
[1] = 187
[2] = 191
[3] = 239
[4] = 187
[5] = 191
[6] = 65
[7] = 239
[8] = 187
[9] = 191
[10] = 66
}
3> String(data: Data(bytes: byteArray), encoding: .utf8)
$R0: String? = "AB" |
We could look at Java and JavaScript as well. I am rather a fan of Swift's Unicode approach (in general not specifically this), it is quite modern. Like I said, it was a design decision to skip BOM sequences like that. And my question remains: what practical issue did you encounter ? I guess it would not be impossible to add an option to UTFEncoder (the superclass of all UTF encoders) to control this behaviour but then the next question is what the default should be. (There is already a Still we would be adding complexity and I currently do not see why. |
Note that the Swift REPL doesn’t escape U+FEFF when printing a string. This is clearer: 4> String(data: Data(bytes: byteArray), encoding: .utf8)!.unicodeScalars.map{ String($0.value, radix: 16) }
$R1: [String] = 5 values {
[0] = "feff"
[1] = "feff"
[2] = "41"
[3] = "feff"
[4] = "42"
} In the following, the string is different depending on whether ‘utf16BigEndian’ or ‘utf16’ is used: 5> let byteArray2: [UInt8] = [254, 255, 254, 255, 0, 65, 254, 255, 0, 66]
byteArray2: [UInt8] = 10 values {
[…]
}
6> String(data: Data(bytes: byteArray2), encoding: .utf16BigEndian)!.unicodeScalars.map{ String($0.value, radix: 16) }
$R2: [String] = 5 values {
[0] = "feff"
[1] = "feff"
[2] = "41"
[3] = "feff"
[4] = "42"
}
7> String(data: Data(bytes: byteArray2), encoding: .utf16)!.unicodeScalars.map{ String($0.value, radix: 16) }
$R3: [String] = 4 values {
[0] = "feff"
[1] = "41"
[2] = "feff"
[3] = "42"
} Section ‘3.10 Unicode Encoding Schemes’ in the Core Specification says that “in UTF-16BE, an initial byte sequence <FE FF> is interpreted as U+FEFF ZERO WIDTH NO-BREAK SPACE”, while “in the UTF-16 encoding scheme, an initial byte sequence corresponding to U+FEFF is interpreted as a byte order mark”. Deletion of U+FEFF when it is not used as a BOM may lead to a security problem as discussed in section ‘3.5 Deletion of Code Points’ in Unicode Technical Report #36 ‘Unicode Security Considerations’, which is referred to in the section ‘Modification’ on conformance clause C7 within section ‘3.2 Conformance Requirements’ in the Core Specification. |
Add #testUTF8ByteOrderMarkSignificant #134
Maybe not 100% what you want, but at least there is now the option to see BOM occurrences. Thx for the discussion. |
OK thanks! It would seem better for the option to also disable the endianness swapping after the U+FFFE in the following though, I would expect the first, third and fifth characters to be the same, but they are not: ((ZnCharacterEncoder newForEncoding: 'utf16be')
ignoreByteOrderMark: false;
decodeBytes: (ByteArray readHexFrom: ('00FB FEFF 00FB FFFE 00FB' reject: #isSeparator)))
asArray collect: [ :character |
character codePoint printStringBase: 16 nDigits: 4 ]
"⇒ #('00FB' 'FEFF' '00FB' 'FFFE' 'FB00')" |
Metacello new baseline: 'GToolkitForPharo9'; repository: 'github://feenkcom/gtoolkit:v1.0.749/src'; load All commits (including upstream repositories) since last build: feenkcom/gtoolkit-utility@9f68de by John Brant [#3764] adding query methods to RB ast nodes feenkcom/gtoolkit-utility@3fff1c by Alistair Grant Merge remote-tracking branch 'origin/main' feenkcom/gtoolkit-utility@2bc080 by Alistair Grant GtVmRuntimeStatisticsDiffReport>>reportSummaryValues handle 0 total duration feenkcom/gt4pharo@5053cc by Tudor Girba Merge 1c6aa3637b479bf5ada9e378ce6c4edd1150e8b6 feenkcom/gt4pharo@b658b2 by Tudor Girba fix examples due to using small caps in tooltips #3757 feenkcom/gt4pharo@1c6aa3 by Juraj Kubelka example fixes feenkcom/gt4pharo@d86105 by Tudor Girba add examples for expander in after: pragma #3757 feenkcom/gt4pharo@d8ff4c by Tudor Girba Merge 11a5926e082b809995ccd5269dc616b0e05d0dac feenkcom/gt4pharo@81ae8e by Tudor Girba add expander in after: pragma feenkcom/gtoolkit-inspector@431978 by Sven Van Caekenberghe Merge cb4f6c6905f020e23e923833d9200540d730cc60 feenkcom/gtoolkit-inspector@f09d09 by Sven Van Caekenberghe add utf8 lossy decoding to ByteArray Bytes gtView feenkcom/zinc@2b61aa by svenvc Merge 3f60e6d55077d9c601d4245d4c060a5d522f54de feenkcom/zinc@9891ac by svenvc added ZnLossyUTF8Encoder feenkcom/zinc@3f60e6 by Sven Van Caekenberghe Merge pull request #14 from svenvc/master Update Zinc: Add an option #ignoreByteOrderMark to ZnUTFEncoders (true by default) feenkcom/zinc@fc5903 by svenvc Add an option #ignoreByteOrderMark to ZnUTFEncoders (true by default) Add #testUTF8ByteOrderMarkSignificant svenvc/zinc#134 feenkcom/zinc@fd71ff by Sven Van Caekenberghe Update README.md Add Pharo 12 badge/shield feenkcom/gtoolkit-visualizer@92a6d2 by akevalion [#3745] update barcharts to support a common parent and allow them to handle lines decorations feenkcom/lepiter@13327d by Sven Van Caekenberghe Shell script executor is now /bin/bash on macOS and Linux, and powershell on Windows feenkcom/lepiter@465969 by Sven Van Caekenberghe rename ShellCommand (Shell command) snippet to ShellScript (Shell script) snippet feenkcom/gtoolkit-demos@000483 by Oscar Nierstrasz fixed typo in smalltalk slide notes feenkcom/gt4git@2044cb by Sven Van Caekenberghe rewrote GtIceGitRepository>>#runGitWithArgs: to wait asynchronously on native subprocess completion/output
ZnUTF8Encoder filters out any U+FEFF, the following answers
'42'
rather than'FEFF'
for example:This doesn’t quite seem to be in accordance with the Unicode standard, which says (in section ‘23.8 Specials’ of ‘The Unicode Standard, Version 15.0 – Core Specification’):
The text was updated successfully, but these errors were encountered: