Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support National Language Shift Tables #16

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

juliangut
Copy link

Currently only supported Turkish, Spanish and Portuguese

Tests added only for charset validation. Message split should not be affected as National Language Shift Tables only affects charset and not number of bytes per char

@Codesleuth
Copy link
Owner

Awesome! Thanks @juliangut, please bear with me as I'm traveling and will have to review this when I get time.

@juliangut
Copy link
Author

Hello @Codesleuth, have you had time to review this PR?

@Codesleuth
Copy link
Owner

I'm currently looking at it - will post soon.

@juliangut
Copy link
Author

Hi @Codesleuth how did you find the PR? does it feel right?

Copy link
Owner

@Codesleuth Codesleuth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to hold off accepting this because of the reasons outlined in the comment. I'm thinking of an API in the future that will do something like this:

splitter.split('∞')
{
  "characterSet": "GSM",
  "shiftTable": "Portuguese language (Latin script)",
  "parts": [
    {
      "content": "",
      "length": 1,
      "bytes": 1
    }
  ],
  "bytes": 1,
  "length": 1,
  "remainingInPart": 159
}

And then in the case of mixed it would be:

splitter.split('∞Ø')
{
  "characterSet": "Unicode",
  "parts": [
    {
      "content": "∞Ø",
      "length": 2,
      "bytes": 4
    }
  ],
  "bytes": 4,
  "length": 2,
  "remainingInPart": 68
}

I think this will be a better API and won't give false positives for the mixed case being detected as GSM.

}
function validateCharacterWithShiftTable(character) {
var charCodes = GSM_charCodes.concat(GSM_TR_charCodes, GSM_ES_charCodes, GSM_PT_charCodes);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how this appears to make sense if all you need to do is count all possible valid characters that exist in any shift table, but this creates an invalid situation where the whole text doesn't fit any common shift table, and should be detected as Unicode. Take the following two messages:

  1. valid in only Portuguese language (Latin script)
  2. Ø valid in either Spanish language (Latin script) or the Basic Character Set

If each example was a full message, they would each be valid. However, if we take a message containing both characters:

∞Ø

As far as we know, this is an invalid message because there is no common shift table that supports both at the same time, so it should be detected as Unicode.

This library's character set auto-detection mechanism is pretty important, and we should think about how that can be preserved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the PR, when validating a whole message with this method is not used anymore, review message below

var charCodes = [GSM_charCodes, GSM_TR_charCodes, GSM_ES_charCodes, GSM_PT_charCodes];
for (var i = 0; i < charCodes.length; i++) {
if (validateMessageInCharCodesList(message, charCodes[i]))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validating against each shift table char-table independently solves the problem of mixed shift table characters combined within the same message

@juliangut
Copy link
Author

Hi @Codesleuth I can't argue against a new unified API, that would be great. About the messages with chars of mixed shitf tables you are right, I've just updated the code and tests accordingly, please review my comments

@Codesleuth
Copy link
Owner

I'm so sorry for the delay! I will be working on this repo this weekend, which will give me time to review and try out this PR properly. This PR is missing functional tests that cover the options and shift table output, but I can add them as I work on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants