Tokenization behavior for accented characters #191

mbc1990 · 2014-10-23T16:22:08Z

WordTokenizer, WordPunctTokenizer, and TreebankWordTokenizer all have similar unusual behavior on accented (tilde-ed?) characters:

> var tokenizer = new natural.WordPunctTokenizer();
> tokenizer.tokenize('São Paulo');
[ 'S', 'ã', 'o', 'Paulo' ]

> var tokenizer = new natural.TreebankWordTokenizer();
> tokenizer.tokenize('São Paulo');
[ 'S', 'ã', 'o', 'Paulo' ]

> var tokenizer = new natural.WordTokenizer();
> tokenizer.tokenize('São Paulo');
[ 'S', 'o', 'Paulo' ]

Is that intended? If so, what would be the ideal way to tokenize English text containing these characters (such as in city and person names)? Map every accented character to an un-accented English equivalent?

The text was updated successfully, but these errors were encountered:

kkoch986 · 2014-10-23T18:25:26Z

@mbc1990 I think you would just have to use the regexptokenizer, WordPunct tokenizer is a good example of this:

var WordPunctTokenizer = function(options) {
    this._pattern = new RegExp(/(\w+|\!|\'|\"")/i);
    RegexpTokenizer.call(this,options)
};

util.inherits(WordPunctTokenizer, RegexpTokenizer);
exports.WordPunctTokenizer = WordPunctTokenizer;

You can just add the other characters you're interested in to the matching class, something like /(\w+|\!|\'|\""|ã)/i should work.

-Ken

rtyx · 2018-12-20T23:13:36Z

I'm not entirely sure if this is connected, but when using the WordPunctTokenizer... is there any way to avoid converting punctuation symbols such as dots or commas into tokens?

Right now I'm successfully tokenizing words with accented characters, but unfortunately, I'm getting also some tokens such as '.', '(', ')', ':'.

kkoch986 added the Help/Questions label Oct 23, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization behavior for accented characters #191

Tokenization behavior for accented characters #191

mbc1990 commented Oct 23, 2014

kkoch986 commented Oct 23, 2014

rtyx commented Dec 20, 2018

Tokenization behavior for accented characters #191

Tokenization behavior for accented characters #191

Comments

mbc1990 commented Oct 23, 2014

kkoch986 commented Oct 23, 2014

rtyx commented Dec 20, 2018