Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TfIdf can use chinese ? #212

Open
c941010623 opened this issue Jan 30, 2015 · 6 comments
Open

TfIdf can use chinese ? #212

c941010623 opened this issue Jan 30, 2015 · 6 comments

Comments

@c941010623
Copy link

my code is :

var natural = require('natural'),
TfIdf = natural.TfIdf,
tfidf = new TfIdf();

tfidf.addDocument('中文測試', 's1');
var s = JSON.stringify(tfidf);
console.log(s)

@kkoch986
Copy link
Member

kkoch986 commented Feb 2, 2015

I havent personally tried tfidf with chinese, first glance it doesnt seem to work.

You probably need to change the tokenizer but i dont think we have a chinese tokenizer yet. I'll leave this open for awhile and see if anyone else has any experience with tfidf this way.

@mike820324
Copy link

Is it possible to use something like the following libraries to first tokenize the Chinese sentences or document?
https://github.com/dotSlashLu/nodescws
https://github.com/yanyiwu/nodejieba

which should separate a Chinese sentences into several Chinese tokens.

for example the following string
"中文測試", which means "Chinese test", will become the following list,
["中文", "測試"], which means ["Chinese", "test"].

@dcsan
Copy link

dcsan commented Jun 24, 2017

@mike820324 did you get any further with this? I'm also using nodejieba on some Chinese NLP projects, but not sure if i should move to python for the project for NLTK etc.

@anton-bot
Copy link
Contributor

I have no problem with the Chinese tokenizer, but the code still doesn't work. When I checked listTerms(), it assigns the tfidf of zero to all terms:

我: 0
搵: 0
緊: 0
游泳池: 0
你們: 0
喺邊度: 0

Is this a problem? How to fix that?

@dcsan
Copy link

dcsan commented Mar 19, 2018

what do the two values mean in list of terms? Is this a basic frequency or the inverse frequency related to the text?

FWIW frequency word lists are a mixed bag for chinese. I think that Jieba has its own built in, which while trained on not the most representative material, would at least match the same tokens...

@titanism
Copy link

You can use https://github.com/yishn/chinese-tokenizer for tokenization. Perhaps @Hugo-ter-Doest would like to add this directly to the package? Similar to the port done for Japanese tokenizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants