You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have the following feature request: Often one needs to finetune a model to add diacritics. Luckily, we can finetune with --warmstart ... --codec.keep_loaded False. In such cases the actual witnesses of the diacritics are usually still sparse in the GT. So it would likely be helpful if the weights of the additional characters / codepoints could be initialized from those of characters that are similar looking or similar in function. Perhaps as an option --codec.init_new_from_old '["à": "a", "ś": "s" ...]' ...
The text was updated successfully, but these errors were encountered:
to automatically choose similar characters from the existing codec?
I would recommend against that. Those are purely visual confusions – they all have very different semantics. In contrast, what we usually want here is merely slightly different confusions, both visually and semantically. Notice how there are no diacritics in the Unicode confusions, for example. But if you init an a from an α or an а, then you give the system the wrong hints (making inference confusion of these pairs more likely). I would say this is only warranted when the respective old characters cannot reappear together with the new characters anymore (and none of their respective charset).
Another experiment that might be worthwhile beyond the pure initialization: regularize the dense output weights such that these confusables stay close to each other.
I have the following feature request: Often one needs to finetune a model to add diacritics. Luckily, we can finetune with
--warmstart ... --codec.keep_loaded False
. In such cases the actual witnesses of the diacritics are usually still sparse in the GT. So it would likely be helpful if the weights of the additional characters / codepoints could be initialized from those of characters that are similar looking or similar in function. Perhaps as an option--codec.init_new_from_old '["à": "a", "ś": "s" ...]'
...The text was updated successfully, but these errors were encountered: