Transformer models for Language Detection #138
Replies: 2 comments 3 replies
-
Hi @ArtanisTheOne, thank you for your valuable input, very much appreciated. :) I'm always interested in alternative approaches to language detection, so thank you for that. I'm curious about your results. If you are worried about latency issues, you can try the Lingua implementations in Go or Rust. The Go one has the exact same feature set as the Python one. The Rust one is a bit behind but I'm working on it to catch up. |
Beta Was this translation helpful? Give feedback.
-
Here's the folder for the broken version of the detector, it completed the script in 267s (commented out all detection but for the model & fasttext and implemented batching for ctranslate2 to speed it up). |
Beta Was this translation helpful? Give feedback.
-
I've been experimenting with language detection for a few months due to the necessity of accurate language detection for a translation project, where the detection of a wrong language can lead to text going down an incorrect pipeline and output nonsense to the individual who requested a translation. Because of this, I've been looking into language detection libraries such as lingua - but it's an incredibly complex thing to balance accuracy with latency, as you guys are well aware.
Lingua is amazing, and I thank the maintainers/developers for it, but for so many cases it isn't usable due to latency issues with detections - especially in a production environment where people expect results automatically (the downside of the internet ig).
So to solve this issue for myself, I finetuned a pre-trained AI model - amazing concept - called mT5 (have only used small version so far), a pre-trained model from Google that has seen over 101 languages' in it's unsupervised pretraining phase. It's still training right now but early results (a day into training) show similar outputs to lingua's low accuracy mode (using lingua's 3 classes of test sets). I still need to conduct testing, incorporating the model's execution into your accuracy reporter (thanks for that btw)
This model, once finetuned with the Huggingface Trainer API, can be converted to the library CTranslate2, which provides outstanding support for the inference of Transformer models, which I use for my translation projects and this model. This allows the utilization of cpu for fast inference, where a gpu may not be accessible (and makes the optimized-cpu throughtput similar to unoptimized-gpu throughput). The latency is low for what's expected of a large machine learning model pipeline - which is stated as such in the ReadMe - thanks to CTranslate2's efficiency. And it can use CPU or GPU, offering those with a GPU the ability to use it to speed detections even more. I need to conduct further testing regarding throughput and accuracy (currently continuing training so can't conduct accurate throughput measurements)
To sum up:
Pros
translate_batch
method)Cons
score_batch
[this returns a perplexity token log score, not a score sum of 1], but some limited testing of mine found some issues)Neutral (couldnt choose if it's a con or pro)
Let me know if there's any interest in results or the model, just thought that it's something that should be shared
mT5 paper
CTranslate2 Docs
Beta Was this translation helpful? Give feedback.
All reactions