featreq: when warmstart-training, init weights of new chars from existing ones #319

bertsky · 2022-04-27T17:19:38Z

I have the following feature request: Often one needs to finetune a model to add diacritics. Luckily, we can finetune with --warmstart ... --codec.keep_loaded False. In such cases the actual witnesses of the diacritics are usually still sparse in the GT. So it would likely be helpful if the weights of the additional characters / codepoints could be initialized from those of characters that are similar looking or similar in function. Perhaps as an option --codec.init_new_from_old '["à": "a", "ś": "s" ...]' ...

The text was updated successfully, but these errors were encountered:

andbue · 2022-04-28T11:53:42Z

Great idea! Maybe we could integrate unicode confusables (for example: https://util.unicode.org/UnicodeJsps/confusables.jsp?a=calamari&r=None – data files available as well http://www.unicode.org/reports/tr39/#Data_Files) to automatically choose similar characters from the existing codec? Would be interesting to see how this affects training time and accuracy!

bertsky · 2022-04-28T13:51:23Z

Maybe we could integrate unicode confusables (for example: https://util.unicode.org/UnicodeJsps/confusables.jsp?a=calamari&r=None – data files

Oh, what a nice resource!

to automatically choose similar characters from the existing codec?

I would recommend against that. Those are purely visual confusions – they all have very different semantics. In contrast, what we usually want here is merely slightly different confusions, both visually and semantically. Notice how there are no diacritics in the Unicode confusions, for example. But if you init an a from an α or an а, then you give the system the wrong hints (making inference confusion of these pairs more likely). I would say this is only warranted when the respective old characters cannot reappear together with the new characters anymore (and none of their respective charset).

Another experiment that might be worthwhile beyond the pure initialization: regularize the dense output weights such that these confusables stay close to each other.

andbue added the enhancement New feature or request label Apr 27, 2022

bertsky added the training Concerns how to achieve good model quality label Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

featreq: when warmstart-training, init weights of new chars from existing ones #319

featreq: when warmstart-training, init weights of new chars from existing ones #319

bertsky commented Apr 27, 2022

andbue commented Apr 28, 2022

bertsky commented Apr 28, 2022 •

edited

Loading

featreq: when warmstart-training, init weights of new chars from existing ones #319

featreq: when warmstart-training, init weights of new chars from existing ones #319

Comments

bertsky commented Apr 27, 2022

andbue commented Apr 28, 2022

bertsky commented Apr 28, 2022 • edited Loading

bertsky commented Apr 28, 2022 •

edited

Loading