Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I would like to propose improvements to the default TNTSearch Tokenizer. I've always felt that TNTSearch produces "unexpected" results from the user standpoint, especially when searching for more complex word forms in large text blocks. At first I thought improving ProductTokenizer by alternatively splitting by "/[\s,.:;"']/", but even this simple regex doesn't cover most common cases, hence my suggestion to enhance the default tokenizer.
This patch adds Unicode groups for dashes and connectors to the ignore list of split characters since they are mostly used inside words, not as word spliting characters. This enables searching for product models, SKU or UUID numbers, dates in their short form, function/constant names in programming manuals, etc.
While at it, also add @ character to the same list. Originally it belongs to Other punctuation Unicode group, but nowadays it is used more as a connector in email addresses or social network names than a separating character.
At least for me, this improves search results a lot.