-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
about chinese #177
Comments
classifier.addDocument('五裕紫菜片', '干货'); classifier.classify('紫菜') classifier.classify('双重过滤') classifier.classify('波斯') why? |
The classifier relies on a tokenizer and stemmer so that could be part of the problem, I don't think we have a chinese stemmer at the moment and if you use the english one it will use the english tokenizer which probably wont help much. This is part of the reason why we need #159, it could help ensure that when a tokenizer is used that its the correct language. |
I think Chinese Language doesn't need stemming at all, but how to tokenize a Chinese document will be a very painful job. ); |
Not sure is it possible, but i tried to applied nodejieba to classification and it seems work. var nodejieba = require("nodejieba"); classifier.addDocument(nodejieba.cut("红掌拨清波"), 'poem'); console.log(classifier.classify(nodejieba.cut('红掌拨清波'))); |
so basically would be possible to add a |
You can use https://github.com/yishn/chinese-tokenizer for tokenization. Perhaps @Hugo-ter-Doest would like to add this directly to the package? Similar to the port done for Japanese tokenizer. |
Will look into this. |
how to use classifier in chinese?
The text was updated successfully, but these errors were encountered: