about chinese #177

ahl5esoft · 2014-07-24T08:55:45Z

how to use classifier in chinese?

ahl5esoft · 2014-07-24T08:57:43Z

classifier.addDocument('五裕紫菜片', '干货');
classifier.addDocument('优香岛桂皮', '干货');
classifier.addDocument('苗家辣妹辣椒', '干货');
classifier.addDocument('海博卷尺', '小五金');
classifier.addDocument('三达SD-156A双重过滤烟嘴', '小五金');
classifier.addDocument('波斯BS-I3091测电笔', '小五金');
classifier.train();

classifier.classify('紫菜')
=> 干货

classifier.classify('双重过滤')
=> 干货

classifier.classify('波斯')
=> 干货

why?

kkoch986 · 2014-07-26T13:25:40Z

The classifier relies on a tokenizer and stemmer so that could be part of the problem, I don't think we have a chinese stemmer at the moment and if you use the english one it will use the english tokenizer which probably wont help much.

This is part of the reason why we need #159, it could help ensure that when a tokenizer is used that its the correct language.

mike820324 · 2015-02-28T13:22:14Z

I think Chinese Language doesn't need stemming at all, but how to tokenize a Chinese document will be a very painful job. );

smilechun · 2016-09-19T09:42:08Z

Not sure is it possible, but i tried to applied nodejieba to classification and it seems work.

var nodejieba = require("nodejieba");
var natural = require('natural'),
classifier = new natural.BayesClassifier();

classifier.addDocument(nodejieba.cut("红掌拨清波"), 'poem');
classifier.addDocument(nodejieba.cut("想睇戲"), 'action');
classifier.addDocument(nodejieba.cut("南京市长江大桥"), 'place');
classifier.train();

console.log(classifier.classify(nodejieba.cut('红掌拨清波')));
console.log(classifier.classify(nodejieba.cut("想睇戲")));
console.log(classifier.classify(nodejieba.cut('南京市长江大桥睇戲')));

loretoparisi · 2018-02-07T16:05:16Z

so basically would be possible to add a TokenizerZh by using nodejieba.cut as tokenization function override?

titanism · 2022-06-12T06:26:57Z

You can use https://github.com/yishn/chinese-tokenizer for tokenization. Perhaps @Hugo-ter-Doest would like to add this directly to the package? Similar to the port done for Japanese tokenizer.

Hugo-ter-Doest · 2022-06-13T11:42:46Z

Will look into this.

kkoch986 added Bugs and removed Bugs labels Jul 24, 2014

RobinQu mentioned this issue Aug 19, 2014

Chinese language support fergiemcdowall/norch#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about chinese #177

about chinese #177

ahl5esoft commented Jul 24, 2014

ahl5esoft commented Jul 24, 2014

kkoch986 commented Jul 26, 2014

mike820324 commented Feb 28, 2015

smilechun commented Sep 19, 2016

loretoparisi commented Feb 7, 2018

titanism commented Jun 12, 2022

Hugo-ter-Doest commented Jun 13, 2022

about chinese #177

about chinese #177

Comments

ahl5esoft commented Jul 24, 2014

ahl5esoft commented Jul 24, 2014

kkoch986 commented Jul 26, 2014

mike820324 commented Feb 28, 2015

smilechun commented Sep 19, 2016

loretoparisi commented Feb 7, 2018

titanism commented Jun 12, 2022

Hugo-ter-Doest commented Jun 13, 2022