-
-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for romanisation #267
Comments
Note that I am unsure how simple/hard romanisation is depending on the language, since I have zero experience with languages that need this sort of preprocessing. So any solution making it into RapidFuzz would need to be:
Depending on the amount of work this requires, it might make sense to make this a separate project. This is really not an integral step of the matching but a preprocessing step, which is likely helpful to users in and of itself (probably some projects for this already exist). I would be happy to mention these solutions in my documentation to help users coming from a language benefiting from romanisation. |
This feels out-of-scope for RapidFuzz, because transcribing non-Roman languages is a totally separate problem-space. I think users should just do it separately and pass in the inputs to RapidFuzz, because then they will have complete freedom of implementation – there are many ways to transcribe, each with different tradeoffs, and none are perfect. I'll give an example for Japanese, but a similar approach could be taken for Chinese. Getting the pronunciation of Japanese textGetting the phonetic transcriptions for Japanese is a straightforward process, but you'll need some pretty heavy dependencies for it. Installation
pip install fugashi
pip install unidic
# Warning: the download for UniDic is around 770 MB!
python -m unidic download Usagefrom fugashi import GenericTagger
import unidic
tagger = GenericTagger('-d "{}"'.format(unidic.DICDIR))
def get_pronunciation(text, tagger):
acc = ""
pron_index = 9
for word in tagger(text):
pron = (
word.feature[pron_index]
if len(word.feature) > pron_index
else word.surface
)
if pron == "*":
pron = word.surface
acc = acc + pron
return acc
print(get_pronunciation("東京に住む。"))
# "トーキョーニスム。" From there, you'd need a separate library to map the phonetic (katakana) characters to Roman characters – but actually just getting them as far as phonetic characters could be enough for your purposes. |
For Japanese cutlet runs on top of fugashi and could probably be used in a preprocessing function. It's a bit heavy needing unidic or unidic-lite, but maybe an example in the documentation would be enough? |
I think a documentation section on options for romanisation for different languages would make sense. It is a fairly common thing people run into when matching non roman-languages and so having some documentation for this would be useful. |
As described in #7 metrics like the levenshtein distance only make much sense for langauges like chinese, if there is support for romanisation.
@mrtolkien @lingvisa I opened this new issue to track support for romanisation. Note that:
This should be implemented as a separate preprocessing function for the current
default_process
method.The text was updated successfully, but these errors were encountered: