UrlTextFilter finds not all urls which results into wrong detection #77

nbartels · 2017-07-04T08:47:15Z

I have documents with URLs and added the UrlTextFilter to remove them so I get a good language detection. But on some test data the language detection was wrong or at least with a very low accuracy.

The test document (german text) with the UrlTextFilter shows a propability of 0.15 for german and 0.7 for nl.

The URLs are rather complex with some special chars (brackets and so on) in it. After removing the URLs with a more complex regexp before sending the text to the language detector, the probability for the same text is 0.99 for german.

So I suggest you improve the regular expressions.

I'll try to provide a PR, but have to check this first...

nbartels added a commit to nbartels/language-detector that referenced this issue Jul 8, 2017

Issue optimaize#77 - better link detection for UrlTextFilter added

ce6fa91

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UrlTextFilter finds not all urls which results into wrong detection #77

UrlTextFilter finds not all urls which results into wrong detection #77

nbartels commented Jul 4, 2017

UrlTextFilter finds not all urls which results into wrong detection #77

UrlTextFilter finds not all urls which results into wrong detection #77

Comments

nbartels commented Jul 4, 2017