Add CJK (Chinese, Japanese, Korean) search support #2299
Merged
+174
−5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Problem
The search functionality was broken for CJK languages because:
\wregex which only matches ASCII word characters, removing all CJK text\bword boundaries that don't apply to CJK charactersSolution
\p{L}\p{N}(Unicode letters/numbers) instead of\wstem_contentcallback to pre-process content before indexing, and stem queries before matchingAlso extracts
CJK_PATTERNconstant toSearchmodule to avoid duplication.How it works
For CJK text like "日本語" (Japanese), the stemmer splits it into "日 本 語" (three separate tokens). This allows SQLite FTS5 with
unicode61tokenizer to match partial searches like "日本" or even single characters like "日".English words are still processed normally with the Mittens stemmer (e.g., "running" → "run").
Test plan
🤖 Generated with Claude Code