You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have documents with URLs and added the UrlTextFilter to remove them so I get a good language detection. But on some test data the language detection was wrong or at least with a very low accuracy.
The test document (german text) with the UrlTextFilter shows a propability of 0.15 for german and 0.7 for nl.
The URLs are rather complex with some special chars (brackets and so on) in it. After removing the URLs with a more complex regexp before sending the text to the language detector, the probability for the same text is 0.99 for german.
So I suggest you improve the regular expressions.
I'll try to provide a PR, but have to check this first...
The text was updated successfully, but these errors were encountered:
nbartels
added a commit
to nbartels/language-detector
that referenced
this issue
Jul 8, 2017
I have documents with URLs and added the UrlTextFilter to remove them so I get a good language detection. But on some test data the language detection was wrong or at least with a very low accuracy.
The test document (german text) with the UrlTextFilter shows a propability of 0.15 for german and 0.7 for nl.
The URLs are rather complex with some special chars (brackets and so on) in it. After removing the URLs with a more complex regexp before sending the text to the language detector, the probability for the same text is 0.99 for german.
So I suggest you improve the regular expressions.
I'll try to provide a PR, but have to check this first...
The text was updated successfully, but these errors were encountered: