Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greek is identified as Catalan when no Greek model is loaded #64

Open
edudar opened this issue Sep 12, 2016 · 5 comments
Open

Greek is identified as Catalan when no Greek model is loaded #64

edudar opened this issue Sep 12, 2016 · 5 comments

Comments

@edudar
Copy link

edudar commented Sep 12, 2016

I don't load all models to speed up detection process for languages that we use and testing out mis-detections revealed that text like "Η γλώσσα είναι η ικανότητα να αποκτήσουν και να χρησιμοποιήσουν περίπλοκα συστήματα επικοινωνίας , ιδιαίτερα την ανθρώπινη ικανότητα να το πράξουν , και μια γλώσσα είναι κάθε συγκεκριμένο παράδειγμα ενός τέτοιου συστήματος . Η επιστημονική μελέτη της γλώσσας ονομάζεται γλωσσολογία ." is identified as 99% Catalan while Catalan is rominized language with A-Z alphabet.

@edudar
Copy link
Author

edudar commented Sep 12, 2016

Changed phrase to Hebrew and it's still detected as Catalan. Looks like it just returns first loaded model (ca) as 99% confidence of detection...

@fabiankessler
Copy link
Contributor

Yes, it's a documented shortcoming of the current state of this library. I'll cite the readme:

This software cannot handle it well when the input text is in none of the expected (and supported) languages. For example if you only load the language profiles from English and German, but the text is written in French, the program may pick the more likely one, or say it doesn't know. (An improvement would be to clearly detect that it's unlikely one of the supported languages.)

@edudar
Copy link
Author

edudar commented Oct 7, 2016

I ended up loading all models even that I don't need them all. This way I get more or less predictable outcome at least in the case as described here. @fabiankessler do you want to keep this ticket open or close it and track that improvement somewhere else?

@james-s-w-clark
Copy link

Lingua library loads models as necessary based on detected scripts (as an example implementation of a feature like this). However, loading all those models could use a few gigabytes - whereas Optimaize readme states:
Loading all 71 language profiles uses 74MB ram to store the data in memory

@edudar your primary goal for not loading all languages was to increase speed. I think text of that length would be ~few ms timescale for detection. Did you benchmark detection and find it to be slow when using all language models?
Did you also prevent reloading of models into memory for each usage?

@edudar
Copy link
Author

edudar commented Jun 4, 2020

To be honest, I don’t recall all the details from that time and I haven’t been working on that of mine project for probably 3 years already... on the high level, it was for search application and we did detection in real-time and with p99 latency target around 15-20ms even a few ms make a difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants