Replies: 3 comments
-
Chatting about this on Discord, the data/social issue of doing this now means data will be split and it'll slow the take-off of collection in Chinese. Are there technical/data quality issues around this too? Like does one map to fewer GPT tokens? Would transliterating simplified to traditional or vice versa exclude people or harm data collection more than having more users would benefit? Is it something we can do later on, or are we causing data pollution by not doing it now? |
Beta Was this translation helpful? Give feedback.
-
It depends on our requirements for data quality. The relationship between Simplified and Traditional Chinese is more like translation than transliteration, one-to-one mapping between characters cannot be simply relied upon. This not only involves the difference in terms as I mentioned above (e.g. I think it is totally fine to handle it while post-processing, as long as we have a reliable machine translation solution. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
For better data consistency, I suggest subdividing Chinese into two datasets:
zh-hans
(Chinese written using the Simplified Chinese script)zh-hant
(Chinese written using the Traditional Chinese script)Although their grammars are similar, there are a large number of characters that do not overlap. For example
国
("nation" in SC) and國
("nation" in TC) have the same meaning but different Unicode codepoints. When training the model, different tokens will be translated into different embeddings, but there is no standard "normalziation mapping" between Simplified Chinese and Traditional Chinese due to some wording conventions, e.g.鼠标
vs滑鼠
. Therefore the relationship between Simplified Chinese and Traditional Chinese has been seen as a translation rather than a simple case-folding.As mentioned in IETF BCP 47, some other languages like Serbian have similar issues. I'm not sure if the difference between them is as significant as the Chinese writing systems, but I guess the impact on the trained model could be minor as their "alphabet" is way smaller than Chinese. In fact, Simplified Chinese and Traditional Chinese characters make up a large part of the vocabulary file of mBERT and XLM.
I have many years of experience in multilingual LM and information retrieval, and would really like to contribute to this project. In addition to technical contributions, I'm also willing to contribute Chinese translation and spread the word in the local NLP community.
Beta Was this translation helpful? Give feedback.
All reactions