Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

featreq: when warmstart-training, init weights of new chars from existing ones #319

Open
bertsky opened this issue Apr 27, 2022 · 2 comments
Labels
enhancement New feature or request training Concerns how to achieve good model quality

Comments

@bertsky
Copy link
Collaborator

bertsky commented Apr 27, 2022

I have the following feature request: Often one needs to finetune a model to add diacritics. Luckily, we can finetune with --warmstart ... --codec.keep_loaded False. In such cases the actual witnesses of the diacritics are usually still sparse in the GT. So it would likely be helpful if the weights of the additional characters / codepoints could be initialized from those of characters that are similar looking or similar in function. Perhaps as an option --codec.init_new_from_old '["à": "a", "ś": "s" ...]' ...

@andbue andbue added the enhancement New feature or request label Apr 27, 2022
@andbue
Copy link
Member

andbue commented Apr 28, 2022

Great idea! Maybe we could integrate unicode confusables (for example: https://util.unicode.org/UnicodeJsps/confusables.jsp?a=calamari&r=None – data files available as well http://www.unicode.org/reports/tr39/#Data_Files) to automatically choose similar characters from the existing codec? Would be interesting to see how this affects training time and accuracy!

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 28, 2022

Maybe we could integrate unicode confusables (for example: https://util.unicode.org/UnicodeJsps/confusables.jsp?a=calamari&r=None – data files

Oh, what a nice resource!

to automatically choose similar characters from the existing codec?

I would recommend against that. Those are purely visual confusions – they all have very different semantics. In contrast, what we usually want here is merely slightly different confusions, both visually and semantically. Notice how there are no diacritics in the Unicode confusions, for example. But if you init an a from an α or an а, then you give the system the wrong hints (making inference confusion of these pairs more likely). I would say this is only warranted when the respective old characters cannot reappear together with the new characters anymore (and none of their respective charset).

Another experiment that might be worthwhile beyond the pure initialization: regularize the dense output weights such that these confusables stay close to each other.

@bertsky bertsky added the training Concerns how to achieve good model quality label Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request training Concerns how to achieve good model quality
Projects
None yet
Development

No branches or pull requests

2 participants