Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance default tokenizer. #227

Merged
merged 1 commit into from
Oct 26, 2020

Conversation

ViliusS
Copy link
Contributor

@ViliusS ViliusS commented Oct 26, 2020

I would like to propose improvements to the default TNTSearch Tokenizer. I've always felt that TNTSearch produces "unexpected" results from the user standpoint, especially when searching for more complex word forms in large text blocks. At first I thought improving ProductTokenizer by alternatively splitting by "/[\s,.:;"']/", but even this simple regex doesn't cover most common cases, hence my suggestion to enhance the default tokenizer.

This patch adds Unicode groups for dashes and connectors to the ignore list of split characters since they are mostly used inside words, not as word spliting characters. This enables searching for product models, SKU or UUID numbers, dates in their short form, function/constant names in programming manuals, etc.

While at it, also add @ character to the same list. Originally it belongs to Other punctuation Unicode group, but nowadays it is used more as a connector in email addresses or social network names than a separating character.

At least for me, this improves search results a lot.

Add Unicode groups for dashes and connectors to the ignore list of split characters since they are mostly used inside words, not as word spliting characters. This enables searching for product models, SKU or UUID numbers, dates in their short form, function/constant names in programming manuals, etc.

While at it, also add @ character to the same list. Originally it belongs to Other punctuation Unicode group, but nowadays it is used more as a connector in email addresses or social network names than a separating character. This improves the results of searching email addresses a lot.
@nticaric
Copy link
Contributor

This proposal makes sense. Thanks for the PR!

@nticaric nticaric merged commit 8cf3879 into teamtnt:master Oct 26, 2020
@ViliusS ViliusS deleted the enhance-default-tokenizer branch February 17, 2021 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants