Transformer models for Language Detection #138

ArtanisTheOne · 2023-03-25T01:31:55Z

ArtanisTheOne
Mar 25, 2023

I've been experimenting with language detection for a few months due to the necessity of accurate language detection for a translation project, where the detection of a wrong language can lead to text going down an incorrect pipeline and output nonsense to the individual who requested a translation. Because of this, I've been looking into language detection libraries such as lingua - but it's an incredibly complex thing to balance accuracy with latency, as you guys are well aware.
Lingua is amazing, and I thank the maintainers/developers for it, but for so many cases it isn't usable due to latency issues with detections - especially in a production environment where people expect results automatically (the downside of the internet ig).
So to solve this issue for myself, I finetuned a pre-trained AI model - amazing concept - called mT5 (have only used small version so far), a pre-trained model from Google that has seen over 101 languages' in it's unsupervised pretraining phase. It's still training right now but early results (a day into training) show similar outputs to lingua's low accuracy mode (using lingua's 3 classes of test sets). I still need to conduct testing, incorporating the model's execution into your accuracy reporter (thanks for that btw)

This model, once finetuned with the Huggingface Trainer API, can be converted to the library CTranslate2, which provides outstanding support for the inference of Transformer models, which I use for my translation projects and this model. This allows the utilization of cpu for fast inference, where a gpu may not be accessible (and makes the optimized-cpu throughtput similar to unoptimized-gpu throughput). The latency is low for what's expected of a large machine learning model pipeline - which is stated as such in the ReadMe - thanks to CTranslate2's efficiency. And it can use CPU or GPU, offering those with a GPU the ability to use it to speed detections even more. I need to conduct further testing regarding throughput and accuracy (currently continuing training so can't conduct accurate throughput measurements)

To sum up:

Pros

Faster detection
Efficient detection batching
Ability to suppress detections of specific languages (suppressed_sequences in translate_batch method)
Selection of gpu/cpu selection (as well as intra_threads and inter_threads if required)
Low memory usage (295mb model file on disk [ctranslate2 conversion])
Transformer neural model, possibly able to pick up on nuances of language that statistical n-gram models may not
Utilization of a pre-trained transformer model - has seen tons of data from it's pretrained 107 langs

Cons

Inability for detection scores (Only possibility is using score_batch [this returns a perplexity token log score, not a score sum of 1], but some limited testing of mine found some issues)
One unified model - any finetuning or adding of languages needs to finetune the entire model (meaning finetuning has to show all language data when doing so to prevent catastrophic forgetting)
Possibly more utilization of computer resources (it's a 300m parameter model, so it does need 'some' resources)
Ghost of a chance for model to sometimes output sequences (inferences) that aren't a language code (con of using a seq-seq model vs classification models i suppose) - i need to investigate this further but it did not affect accuracy results at all

Neutral (couldnt choose if it's a con or pro)

Relatively low training time, for my model with support of 97~ language (9.7m corpora total) - really competitive results at 20h training mark [RTX 3090]

Let me know if there's any interest in results or the model, just thought that it's something that should be shared

mT5 paper
CTranslate2 Docs

pemistahl · 2023-03-25T09:08:18Z

pemistahl
Mar 25, 2023
Maintainer

Hi @ArtanisTheOne, thank you for your valuable input, very much appreciated. :) I'm always interested in alternative approaches to language detection, so thank you for that. I'm curious about your results.

If you are worried about latency issues, you can try the Lingua implementations in Go or Rust. The Go one has the exact same feature set as the Python one. The Rust one is a bit behind but I'm working on it to catch up.

1 reply

ArtanisTheOne Mar 25, 2023
Author

Good to know. As for the Transformer model I need to change the tokenizer vocab / model embeddings so that each language code is a token, the generation of non-existent language codes turned out to be a big problem when running the lingua scripts (results showing as Unknown due to lang codes like 'ilt' being generated)

I assume this is simply because the model gets more confused on shorter texts (as sequence length increased, unknown generation decreased significantly). Will get back to you with more once done

ArtanisTheOne · 2023-03-26T03:10:17Z

ArtanisTheOne
Mar 26, 2023
Author

Here's the folder for the broken version of the detector, it completed the script in 267s (commented out all detection but for the model & fasttext and implemented batching for ctranslate2 to speed it up).

accuracy-reports.zip