You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I trained a model for spanish spellchecking and I'm using it to correct some ocr's files (I'm making a full process digitalizing some very old typewriter documents and want to enhance the text result). The problem I have now is that the model show me some candidates when I use GetCandidates but it doesn't change it when using FixFragment. I wonder if it's something todo with context (n-grams and so) or perhaps with the symbols that are in the sentece.
Here is an example:
text = 'posee respectívámente en las localí&ades de Carlos M. Naón'
corrector.GetCandidates(['localí&ades'],0) -> ('localidades', 'localí&ades', 'localicades', 'localizades')
corrector.FixFragment(text) -> 'posee respectívamente en las local&ades de Carlos M. Naón'
It corrects "respectívámente" but it only erase the 'i' in "localí&ades". Maybe its about the tokens it uses to check when a word starts and ends.
Another example without special characters:
text='con anterioridad a la sanción de la ley'
corrector.FixFragment(text) -> 'con anterioridad la la canción de la ley'
It changes "sanción" with "canción" but if a GetCandidates, "sanción" is okey.
corrector.GetCandidates(['sanción'],0) -> ('sanción', 'canción', 'sanación', 'sanchón', 'sención', 'anción', 'sancion', 'sanión', 'kanción', 'sunción', 'sandión', 'sancián', 'sansión', 'sanció')
Currently it don't expect special tokens inside words. You can try to replace all tokens inside words to some character - in this case it should start to correct them. I will think how to handle this case better.
Thanks, I'll try that.
Another workaround I'm thinking but don't know if will work is to re-train the model adding to the alphabet txt the symbols so it recognize when doing the spellchecking.
Another workaround I'm thinking but don't know if will work is to re-train the model adding to the alphabet txt the symbols so it recognize when doing the spellchecking.
Yes, it should work even better. You can try on a small corpus first and let me know if it helps.
Hi everybody,
I trained a model for spanish spellchecking and I'm using it to correct some ocr's files (I'm making a full process digitalizing some very old typewriter documents and want to enhance the text result). The problem I have now is that the model show me some candidates when I use GetCandidates but it doesn't change it when using FixFragment. I wonder if it's something todo with context (n-grams and so) or perhaps with the symbols that are in the sentece.
Here is an example:
text = 'posee respectívámente en las localí&ades de Carlos M. Naón'
corrector.GetCandidates(['localí&ades'],0) -> ('localidades', 'localí&ades', 'localicades', 'localizades')
corrector.FixFragment(text) -> 'posee respectívamente en las local&ades de Carlos M. Naón'
It corrects "respectívámente" but it only erase the 'i' in "localí&ades". Maybe its about the tokens it uses to check when a word starts and ends.
Another example without special characters:
text='con anterioridad a la sanción de la ley'
corrector.FixFragment(text) -> 'con anterioridad la la canción de la ley'
It changes "sanción" with "canción" but if a GetCandidates, "sanción" is okey.
corrector.GetCandidates(['sanción'],0) -> ('sanción', 'canción', 'sanación', 'sanchón', 'sención', 'anción', 'sancion', 'sanión', 'kanción', 'sunción', 'sandión', 'sancián', 'sansión', 'sanció')
I think this issue is similar to #85
Thanks for the help!
The text was updated successfully, but these errors were encountered: