Subdivide Chinese dataset by character sets #3437

peakji · 2023-02-05T04:25:05Z

peakji
Feb 5, 2023

For better data consistency, I suggest subdividing Chinese into two datasets:

zh-hans (Chinese written using the Simplified Chinese script)
zh-hant (Chinese written using the Traditional Chinese script)

Although their grammars are similar, there are a large number of characters that do not overlap. For example 国 ("nation" in SC) and 國 ("nation" in TC) have the same meaning but different Unicode codepoints. When training the model, different tokens will be translated into different embeddings, but there is no standard "normalziation mapping" between Simplified Chinese and Traditional Chinese due to some wording conventions, e.g. 鼠标 vs 滑鼠. Therefore the relationship between Simplified Chinese and Traditional Chinese has been seen as a translation rather than a simple case-folding.

As mentioned in IETF BCP 47, some other languages like Serbian have similar issues. I'm not sure if the difference between them is as significant as the Chinese writing systems, but I guess the impact on the trained model could be minor as their "alphabet" is way smaller than Chinese. In fact, Simplified Chinese and Traditional Chinese characters make up a large part of the vocabulary file of mBERT and XLM.

I have many years of experience in multilingual LM and information retrieval, and would really like to contribute to this project. In addition to technical contributions, I'm also willing to contribute Chinese translation and spread the word in the local NLP community.

bitplane · 2023-02-13T02:56:12Z

bitplane
Feb 13, 2023
Collaborator

Chatting about this on Discord, the data/social issue of doing this now means data will be split and it'll slow the take-off of collection in Chinese.

Are there technical/data quality issues around this too? Like does one map to fewer GPT tokens? Would transliterating simplified to traditional or vice versa exclude people or harm data collection more than having more users would benefit?

Is it something we can do later on, or are we causing data pollution by not doing it now?

0 replies

peakji · 2023-02-14T10:16:03Z

peakji
Feb 14, 2023
Author

It depends on our requirements for data quality.

The relationship between Simplified and Traditional Chinese is more like translation than transliteration, one-to-one mapping between characters cannot be simply relied upon. This not only involves the difference in terms as I mentioned above (e.g. 鼠标 vs 滑鼠), but also the overall language habits of different regions.

I think it is totally fine to handle it while post-processing, as long as we have a reliable machine translation solution.

0 replies

lone-wolf-akela · 2023-02-14T15:15:03Z

lone-wolf-akela
Feb 14, 2023

I've noticed that there are conversations like the one below appearing in the collected data:

Here, the user is asking a question in traditional Chinese, and then the assistant answers in simplified Chinese. The user then tried to correct the assistant and asked it to re-answer in traditional Chinese.
I wonder if this kind of training data will automatically teach the model to learn when to answer in which kind of Chinese.
And also, if we use some machine translation solution to preprocess these training data into traditional or simplified Chinese, this whole conversation will become to make no sense after the translation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subdivide Chinese dataset by character sets #3437

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Subdivide Chinese dataset by character sets #3437

peakji Feb 5, 2023

Replies: 3 comments

bitplane Feb 13, 2023 Collaborator

peakji Feb 14, 2023 Author

lone-wolf-akela Feb 14, 2023

peakji
Feb 5, 2023

bitplane
Feb 13, 2023
Collaborator

peakji
Feb 14, 2023
Author

lone-wolf-akela
Feb 14, 2023