Skip to content

Conversation

@Maklu
Copy link
Contributor

@Maklu Maklu commented Jan 6, 2026

Summary

  • Fix search to support CJK (Chinese, Japanese, Korean) characters
  • Previously, CJK characters were silently dropped during query sanitization

Problem

The search functionality was broken for CJK languages because:

  1. Query sanitization used \w regex which only matches ASCII word characters, removing all CJK text
  2. Stemmer split text by whitespace, which doesn't work for CJK (these languages don't use spaces between words)
  3. Highlighter used \b word boundaries that don't apply to CJK characters

Solution

  • Query: Use \p{L}\p{N} (Unicode letters/numbers) instead of \w
  • Stemmer: Detect CJK characters and split each character into individual tokens for FTS indexing (CJK languages don't have word boundaries like English)
  • Highlighter: Skip word boundary matching for terms containing CJK characters
  • SQLite adapter: Add stem_content callback to pre-process content before indexing, and stem queries before matching

Also extracts CJK_PATTERN constant to Search module to avoid duplication.

How it works

For CJK text like "日本語" (Japanese), the stemmer splits it into "日 本 語" (three separate tokens). This allows SQLite FTS5 with unicode61 tokenizer to match partial searches like "日本" or even single characters like "日".

English words are still processed normally with the Mittens stemmer (e.g., "running" → "run").

Test plan

  • Added tests for Chinese character search
  • Added tests for Japanese character search
  • Added tests for Korean character search
  • Added tests for mixed CJK and English search
  • Added tests for CJK punctuation handling
  • All existing search tests pass
  • Verified English search still works correctly (stemming preserved)

🤖 Generated with Claude Code

The search functionality was silently dropping CJK characters because:

1. Query sanitization used `\w` which only matches ASCII word characters
2. Stemmer split by whitespace, which doesn't work for CJK languages
3. Highlighter used `\b` word boundaries that don't apply to CJK

This commit fixes all three issues:

- Query: Use `\p{L}\p{N}` (Unicode letters/numbers) instead of `\w`
- Stemmer: Preserve CJK characters as-is without stemming, since CJK
  languages don't have stemming rules like English
- Highlighter: Skip word boundary matching for CJK terms

Also extracts `CJK_PATTERN` to `Search` module to avoid duplication.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@Maklu Maklu marked this pull request as draft January 6, 2026 08:30
- Split each CJK character into individual tokens for FTS5 indexing
- Add stem_content callback to SQLite adapter to pre-process content
- Stem search queries before matching against FTS5 index
- Update stemmer tests to reflect character-splitting behavior

This enables CJK search on SQLite installations where the FTS5
tokenizer cannot natively segment CJK text.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@Maklu Maklu marked this pull request as ready for review January 6, 2026 08:52
Copy link
Member

@jorgemanrubia jorgemanrubia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks great to me @Maklu, thanks a lot.

@monorkin I would love your eyes before merging.

Copy link
Contributor

@monorkin monorkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me too 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants