You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
WordTokenizer, WordPunctTokenizer, and TreebankWordTokenizer all have similar unusual behavior on accented (tilde-ed?) characters:
> var tokenizer = new natural.WordPunctTokenizer();
> tokenizer.tokenize('São Paulo');
[ 'S', 'ã', 'o', 'Paulo' ]
> var tokenizer = new natural.TreebankWordTokenizer();
> tokenizer.tokenize('São Paulo');
[ 'S', 'ã', 'o', 'Paulo' ]
> var tokenizer = new natural.WordTokenizer();
> tokenizer.tokenize('São Paulo');
[ 'S', 'o', 'Paulo' ]
Is that intended? If so, what would be the ideal way to tokenize English text containing these characters (such as in city and person names)? Map every accented character to an un-accented English equivalent?
The text was updated successfully, but these errors were encountered:
I'm not entirely sure if this is connected, but when using the WordPunctTokenizer... is there any way to avoid converting punctuation symbols such as dots or commas into tokens?
Right now I'm successfully tokenizing words with accented characters, but unfortunately, I'm getting also some tokens such as '.', '(', ')', ':'.
WordTokenizer, WordPunctTokenizer, and TreebankWordTokenizer all have similar unusual behavior on accented (tilde-ed?) characters:
Is that intended? If so, what would be the ideal way to tokenize English text containing these characters (such as in city and person names)? Map every accented character to an un-accented English equivalent?
The text was updated successfully, but these errors were encountered: