-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Editorial: revamp the way we deal with code points and bytes #247
base: main
Are you sure you want to change the base?
Conversation
This is WIP, mainly since I'm a little unsure we want to go this far, but I also kinda like it.
Seems reasonable to me. |
Am I understanding correctly that the purpose is disambiguate code-point and byte conversions? If so, my concern is that the extra rigor creates extra opportunities for errors, and may not be pulling its weight. However, if you prefer it, that's good enough for me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this generally as a direction, but made some "food for thought" comments below.
@@ -1915,7 +1916,7 @@ constructor steps are: | |||
<p class=note>{{DOMString}}, as well as an <a for=/>I/O queue</a> of code units rather than scalar | |||
values, are used here so that a surrogate pair that is split between chunks can be reassembled into | |||
the appropriate scalar value. The behavior is otherwise identical to {{USVString}}. In particular, | |||
lone surrogates will be replaced with U+FFFD. | |||
lone surrogates will be replaced with U+FFFD (�). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Charmod we often followed the convention:
� [U+FFFD REPLACEMENT CHARACTER]
(with the [U+xxxx character name]
part styled distinctly). I say "often" because I willfully ignored the convention whenever it reduced clarity, particularly with long sequences used in this or that example. For examples this like, you might consider something similar, since it makes the text unambiguous?
OTOH, I find this pretty clear and am not sure that the charmod style adds that much. I like quoting the character like this when it's printable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We made up our own convention in https://infra.spec.whatwg.org/#code-points since we found the one in Charmod a bit too verbose, iirc.
|
||
<li><p>If <var>byte</var> is an <a>ASCII byte</a>, return | ||
a code point whose value is <var>byte</var>. | ||
<li><p>Let <var>byteValue</var> be <var>byte</var>'s <a for=byte>value</a>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is byteValue
really needed vs. just saying things like:
If byte is an ASCII byte, then return a code point whose value is byte's value.
I realize that "code point's value" is a different integer type than "byte's value", but we mean the number in any case.
<a for="code point">value</a> is <var>byteValue</var>. | ||
|
||
<li><p>Return a <a>code point</a> whose <a for="code point">value</a> is | ||
0xF780 + <var>byteValue</var> − 0x80. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the problem. You don't want prose here. But can't we just say 0xF780 + byte - 0x80
?
Is there a reason I'm not seeing for why we don't just make the number 0xF700
? Is the reason to emphasize that we're trying to get to/from bytes >= 0x80?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've had some cases where we want to distinguish bytes from numbers. So the question is whether we want to do that here as well. And I guess in some sense we do since we want to return code points or bytes, but a lot of the calculations are on numbers.
I think we could use byte in the calculation directly (as we already did), but it wouldn't really be logically consistent with how we talk about bytes and numbers elsewhere in the web platform.
(I guess another way would be that we say that in equations they are casted to their value.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could define implicit conversions code point → number and byte → number (whatwg/infra#319) and perhaps the other way around too. But even if we don't, we could use short algorithmic phrases inside the formula: "0xF780 + (byte's value) − 0x80".
There are other formulas in the standard that use byte or code point values directly, though, and they should be changed accordingly. (Interestingly, there are formulas dealing with code units around TextEncoder and TextEncoderStream, which don't have this problem because code units seem to be defined directly as a number type.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I intuitively like making the code point <—> byte/number conversions explicit, and don't see as much of a need for distinguishing bytes and numbers. (I'd be OK defining bytes as a subtype of numbers, if we ever make progress on defining numbers.)
<li><p>If <var>code point</var> is in the range U+F780 to U+F7FF, inclusive, return | ||
a byte whose value is <var>code point</var> − 0xF780 + 0x80. | ||
<li><p>If <var>codePointValue</var> is in the range 0xF780 to 0xF7FF, inclusive, then return a | ||
<a>byte</a> whose <a for=byte>value</a> is <var>codePointValue</var> − 0xF780 + 0x80. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usw.
This is WIP, mainly since I'm a little unsure we want to go this far, but I also kinda like it.
@andreubotella @ricea @domenic @aphillips thoughts?
Preview | Diff