-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support National Language Shift Tables #16
base: master
Are you sure you want to change the base?
Conversation
Awesome! Thanks @juliangut, please bear with me as I'm traveling and will have to review this when I get time. |
Hello @Codesleuth, have you had time to review this PR? |
I'm currently looking at it - will post soon. |
Hi @Codesleuth how did you find the PR? does it feel right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to hold off accepting this because of the reasons outlined in the comment. I'm thinking of an API in the future that will do something like this:
splitter.split('∞')
{
"characterSet": "GSM",
"shiftTable": "Portuguese language (Latin script)",
"parts": [
{
"content": "∞",
"length": 1,
"bytes": 1
}
],
"bytes": 1,
"length": 1,
"remainingInPart": 159
}
And then in the case of mixed it would be:
splitter.split('∞Ø')
{
"characterSet": "Unicode",
"parts": [
{
"content": "∞Ø",
"length": 2,
"bytes": 4
}
],
"bytes": 4,
"length": 2,
"remainingInPart": 68
}
I think this will be a better API and won't give false positives for the mixed case being detected as GSM.
} | ||
function validateCharacterWithShiftTable(character) { | ||
var charCodes = GSM_charCodes.concat(GSM_TR_charCodes, GSM_ES_charCodes, GSM_PT_charCodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how this appears to make sense if all you need to do is count all possible valid characters that exist in any shift table, but this creates an invalid situation where the whole text doesn't fit any common shift table, and should be detected as Unicode. Take the following two messages:
∞
valid in only Portuguese language (Latin script)Ø
valid in either Spanish language (Latin script) or the Basic Character Set
If each example was a full message, they would each be valid. However, if we take a message containing both characters:
∞Ø
As far as we know, this is an invalid message because there is no common shift table that supports both at the same time, so it should be detected as Unicode.
This library's character set auto-detection mechanism is pretty important, and we should think about how that can be preserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the PR, when validating a whole message with this method is not used anymore, review message below
var charCodes = [GSM_charCodes, GSM_TR_charCodes, GSM_ES_charCodes, GSM_PT_charCodes]; | ||
for (var i = 0; i < charCodes.length; i++) { | ||
if (validateMessageInCharCodesList(message, charCodes[i])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validating against each shift table char-table independently solves the problem of mixed shift table characters combined within the same message
Hi @Codesleuth I can't argue against a new unified API, that would be great. About the messages with chars of mixed shitf tables you are right, I've just updated the code and tests accordingly, please review my comments |
I'm so sorry for the delay! I will be working on this repo this weekend, which will give me time to review and try out this PR properly. This PR is missing functional tests that cover the options and shift table output, but I can add them as I work on it. |
Currently only supported Turkish, Spanish and Portuguese
Tests added only for charset validation. Message split should not be affected as National Language Shift Tables only affects charset and not number of bytes per char