TfIdf can use chinese ? #212

c941010623 · 2015-01-30T03:48:49Z

my code is :

var natural = require('natural'),
TfIdf = natural.TfIdf,
tfidf = new TfIdf();

tfidf.addDocument('中文測試', 's1');
var s = JSON.stringify(tfidf);
console.log(s)

kkoch986 · 2015-02-02T14:41:46Z

I havent personally tried tfidf with chinese, first glance it doesnt seem to work.

You probably need to change the tokenizer but i dont think we have a chinese tokenizer yet. I'll leave this open for awhile and see if anyone else has any experience with tfidf this way.

mike820324 · 2015-02-28T12:51:45Z

Is it possible to use something like the following libraries to first tokenize the Chinese sentences or document?
https://github.com/dotSlashLu/nodescws
https://github.com/yanyiwu/nodejieba

which should separate a Chinese sentences into several Chinese tokens.

for example the following string
"中文測試", which means "Chinese test", will become the following list,
["中文", "測試"], which means ["Chinese", "test"].

dcsan · 2017-06-24T14:10:14Z

@mike820324 did you get any further with this? I'm also using nodejieba on some Chinese NLP projects, but not sure if i should move to python for the project for NLTK etc.

anton-bot · 2018-03-19T11:37:21Z

I have no problem with the Chinese tokenizer, but the code still doesn't work. When I checked listTerms(), it assigns the tfidf of zero to all terms:

我: 0
搵: 0
緊: 0
游泳池: 0
你們: 0
喺邊度: 0

Is this a problem? How to fix that?

dcsan · 2018-03-19T12:51:22Z

what do the two values mean in list of terms? Is this a basic frequency or the inverse frequency related to the text?

FWIW frequency word lists are a mixed bag for chinese. I think that Jieba has its own built in, which while trained on not the most representative material, would at least match the same tokens...

titanism · 2022-06-12T06:27:01Z

You can use https://github.com/yishn/chinese-tokenizer for tokenization. Perhaps @Hugo-ter-Doest would like to add this directly to the package? Similar to the port done for Japanese tokenizer.

kkoch986 added Help/Questions International Support labels Feb 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TfIdf can use chinese ? #212

TfIdf can use chinese ? #212

c941010623 commented Jan 30, 2015

kkoch986 commented Feb 2, 2015

mike820324 commented Feb 28, 2015

dcsan commented Jun 24, 2017

anton-bot commented Mar 19, 2018

dcsan commented Mar 19, 2018

titanism commented Jun 12, 2022

TfIdf can use chinese ? #212

TfIdf can use chinese ? #212

Comments

c941010623 commented Jan 30, 2015

kkoch986 commented Feb 2, 2015

mike820324 commented Feb 28, 2015

dcsan commented Jun 24, 2017

anton-bot commented Mar 19, 2018

dcsan commented Mar 19, 2018

titanism commented Jun 12, 2022