-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TfIdf can use chinese ? #212
Comments
I havent personally tried tfidf with chinese, first glance it doesnt seem to work. You probably need to change the tokenizer but i dont think we have a chinese tokenizer yet. I'll leave this open for awhile and see if anyone else has any experience with tfidf this way. |
Is it possible to use something like the following libraries to first tokenize the Chinese sentences or document? which should separate a Chinese sentences into several Chinese tokens. for example the following string |
@mike820324 did you get any further with this? I'm also using nodejieba on some Chinese NLP projects, but not sure if i should move to python for the project for NLTK etc. |
I have no problem with the Chinese tokenizer, but the code still doesn't work. When I checked 我: 0 Is this a problem? How to fix that? |
what do the two values mean in list of terms? Is this a basic frequency or the inverse frequency related to the text? FWIW frequency word lists are a mixed bag for chinese. I think that Jieba has its own built in, which while trained on not the most representative material, would at least match the same tokens... |
You can use https://github.com/yishn/chinese-tokenizer for tokenization. Perhaps @Hugo-ter-Doest would like to add this directly to the package? Similar to the port done for Japanese tokenizer. |
my code is :
var natural = require('natural'),
TfIdf = natural.TfIdf,
tfidf = new TfIdf();
tfidf.addDocument('中文測試', 's1');
var s = JSON.stringify(tfidf);
console.log(s)
The text was updated successfully, but these errors were encountered: