Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization behavior for accented characters #191

Open
mbc1990 opened this issue Oct 23, 2014 · 2 comments
Open

Tokenization behavior for accented characters #191

mbc1990 opened this issue Oct 23, 2014 · 2 comments

Comments

@mbc1990
Copy link
Contributor

mbc1990 commented Oct 23, 2014

WordTokenizer, WordPunctTokenizer, and TreebankWordTokenizer all have similar unusual behavior on accented (tilde-ed?) characters:

> var tokenizer = new natural.WordPunctTokenizer();
> tokenizer.tokenize('São Paulo');
[ 'S', 'ã', 'o', 'Paulo' ]

> var tokenizer = new natural.TreebankWordTokenizer();
> tokenizer.tokenize('São Paulo');
[ 'S', 'ã', 'o', 'Paulo' ]

> var tokenizer = new natural.WordTokenizer();
> tokenizer.tokenize('São Paulo');
[ 'S', 'o', 'Paulo' ]

Is that intended? If so, what would be the ideal way to tokenize English text containing these characters (such as in city and person names)? Map every accented character to an un-accented English equivalent?

@kkoch986
Copy link
Member

@mbc1990 I think you would just have to use the regexptokenizer, WordPunct tokenizer is a good example of this:

var WordPunctTokenizer = function(options) {
    this._pattern = new RegExp(/(\w+|\!|\'|\"")/i);
    RegexpTokenizer.call(this,options)
};

util.inherits(WordPunctTokenizer, RegexpTokenizer);
exports.WordPunctTokenizer = WordPunctTokenizer;

You can just add the other characters you're interested in to the matching class, something like /(\w+|\!|\'|\""|ã)/i should work.

-Ken

@rtyx
Copy link

rtyx commented Dec 20, 2018

I'm not entirely sure if this is connected, but when using the WordPunctTokenizer... is there any way to avoid converting punctuation symbols such as dots or commas into tokens?

Right now I'm successfully tokenizing words with accented characters, but unfortunately, I'm getting also some tokens such as '.', '(', ')', ':'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants