Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about chinese #177

Open
ahl5esoft opened this issue Jul 24, 2014 · 7 comments
Open

about chinese #177

ahl5esoft opened this issue Jul 24, 2014 · 7 comments

Comments

@ahl5esoft
Copy link

how to use classifier in chinese?

@ahl5esoft
Copy link
Author

classifier.addDocument('五裕紫菜片', '干货');
classifier.addDocument('优香岛桂皮', '干货');
classifier.addDocument('苗家辣妹辣椒', '干货');
classifier.addDocument('海博卷尺', '小五金');
classifier.addDocument('三达SD-156A双重过滤烟嘴', '小五金');
classifier.addDocument('波斯BS-I3091测电笔', '小五金');
classifier.train();

classifier.classify('紫菜')
=> 干货

classifier.classify('双重过滤')
=> 干货

classifier.classify('波斯')
=> 干货

why?

@kkoch986 kkoch986 added Bugs and removed Bugs labels Jul 24, 2014
@kkoch986
Copy link
Member

The classifier relies on a tokenizer and stemmer so that could be part of the problem, I don't think we have a chinese stemmer at the moment and if you use the english one it will use the english tokenizer which probably wont help much.

This is part of the reason why we need #159, it could help ensure that when a tokenizer is used that its the correct language.

@mike820324
Copy link

I think Chinese Language doesn't need stemming at all, but how to tokenize a Chinese document will be a very painful job. );

@smilechun
Copy link

Not sure is it possible, but i tried to applied nodejieba to classification and it seems work.

var nodejieba = require("nodejieba");
var natural = require('natural'),
classifier = new natural.BayesClassifier();

classifier.addDocument(nodejieba.cut("红掌拨清波"), 'poem');
classifier.addDocument(nodejieba.cut("想睇戲"), 'action');
classifier.addDocument(nodejieba.cut("南京市长江大桥"), 'place');
classifier.train();

console.log(classifier.classify(nodejieba.cut('红掌拨清波')));
console.log(classifier.classify(nodejieba.cut("想睇戲")));
console.log(classifier.classify(nodejieba.cut('南京市长江大桥睇戲')));

@loretoparisi
Copy link

so basically would be possible to add a TokenizerZh by using nodejieba.cut as tokenization function override?

@titanism
Copy link

You can use https://github.com/yishn/chinese-tokenizer for tokenization. Perhaps @Hugo-ter-Doest would like to add this directly to the package? Similar to the port done for Japanese tokenizer.

@Hugo-ter-Doest
Copy link
Collaborator

Will look into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants