Enhance default tokenizer. #227

ViliusS · 2020-10-26T18:06:39Z

I would like to propose improvements to the default TNTSearch Tokenizer. I've always felt that TNTSearch produces "unexpected" results from the user standpoint, especially when searching for more complex word forms in large text blocks. At first I thought improving ProductTokenizer by alternatively splitting by "/[\s,.:;"']/", but even this simple regex doesn't cover most common cases, hence my suggestion to enhance the default tokenizer.

This patch adds Unicode groups for dashes and connectors to the ignore list of split characters since they are mostly used inside words, not as word spliting characters. This enables searching for product models, SKU or UUID numbers, dates in their short form, function/constant names in programming manuals, etc.

While at it, also add @ character to the same list. Originally it belongs to Other punctuation Unicode group, but nowadays it is used more as a connector in email addresses or social network names than a separating character.

At least for me, this improves search results a lot.

Add Unicode groups for dashes and connectors to the ignore list of split characters since they are mostly used inside words, not as word spliting characters. This enables searching for product models, SKU or UUID numbers, dates in their short form, function/constant names in programming manuals, etc. While at it, also add @ character to the same list. Originally it belongs to Other punctuation Unicode group, but nowadays it is used more as a connector in email addresses or social network names than a separating character. This improves the results of searching email addresses a lot.

nticaric · 2020-10-26T20:13:46Z

This proposal makes sense. Thanks for the PR!

nticaric merged commit 8cf3879 into teamtnt:master Oct 26, 2020

ViliusS deleted the enhance-default-tokenizer branch February 17, 2021 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance default tokenizer. #227

Enhance default tokenizer. #227

ViliusS commented Oct 26, 2020

nticaric commented Oct 26, 2020

Enhance default tokenizer. #227

Enhance default tokenizer. #227

Conversation

ViliusS commented Oct 26, 2020

nticaric commented Oct 26, 2020