-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Greek is identified as Catalan when no Greek model is loaded #64
Comments
Changed phrase to Hebrew and it's still detected as Catalan. Looks like it just returns first loaded model (ca) as 99% confidence of detection... |
Yes, it's a documented shortcoming of the current state of this library. I'll cite the readme:
|
I ended up loading all models even that I don't need them all. This way I get more or less predictable outcome at least in the case as described here. @fabiankessler do you want to keep this ticket open or close it and track that improvement somewhere else? |
Lingua library loads models as necessary based on detected scripts (as an example implementation of a feature like this). However, loading all those models could use a few gigabytes - whereas Optimaize readme states: @edudar your primary goal for not loading all languages was to increase speed. I think text of that length would be ~few ms timescale for detection. Did you benchmark detection and find it to be slow when using all language models? |
To be honest, I don’t recall all the details from that time and I haven’t been working on that of mine project for probably 3 years already... on the high level, it was for search application and we did detection in real-time and with p99 latency target around 15-20ms even a few ms make a difference. |
I don't load all models to speed up detection process for languages that we use and testing out mis-detections revealed that text like "Η γλώσσα είναι η ικανότητα να αποκτήσουν και να χρησιμοποιήσουν περίπλοκα συστήματα επικοινωνίας , ιδιαίτερα την ανθρώπινη ικανότητα να το πράξουν , και μια γλώσσα είναι κάθε συγκεκριμένο παράδειγμα ενός τέτοιου συστήματος . Η επιστημονική μελέτη της γλώσσας ονομάζεται γλωσσολογία ." is identified as 99% Catalan while Catalan is rominized language with A-Z alphabet.
The text was updated successfully, but these errors were encountered: