Add CJK (Chinese, Japanese, Korean) search support #2299

Maklu · 2026-01-06T08:23:45Z

Summary

Fix search to support CJK (Chinese, Japanese, Korean) characters
Previously, CJK characters were silently dropped during query sanitization

Problem

The search functionality was broken for CJK languages because:

Query sanitization used \w regex which only matches ASCII word characters, removing all CJK text
Stemmer split text by whitespace, which doesn't work for CJK (these languages don't use spaces between words)
Highlighter used \b word boundaries that don't apply to CJK characters

Solution

Query: Use \p{L}\p{N} (Unicode letters/numbers) instead of \w
Stemmer: Detect CJK characters and split each character into individual tokens for FTS indexing (CJK languages don't have word boundaries like English)
Highlighter: Skip word boundary matching for terms containing CJK characters
SQLite adapter: Add stem_content callback to pre-process content before indexing, and stem queries before matching

Also extracts CJK_PATTERN constant to Search module to avoid duplication.

How it works

For CJK text like "日本語" (Japanese), the stemmer splits it into "日本語" (three separate tokens). This allows SQLite FTS5 with unicode61 tokenizer to match partial searches like "日本" or even single characters like "日".

English words are still processed normally with the Mittens stemmer (e.g., "running" → "run").

Test plan

Added tests for Chinese character search
Added tests for Japanese character search
Added tests for Korean character search
Added tests for mixed CJK and English search
Added tests for CJK punctuation handling
All existing search tests pass
Verified English search still works correctly (stemming preserved)

🤖 Generated with Claude Code

The search functionality was silently dropping CJK characters because: 1. Query sanitization used `\w` which only matches ASCII word characters 2. Stemmer split by whitespace, which doesn't work for CJK languages 3. Highlighter used `\b` word boundaries that don't apply to CJK This commit fixes all three issues: - Query: Use `\p{L}\p{N}` (Unicode letters/numbers) instead of `\w` - Stemmer: Preserve CJK characters as-is without stemming, since CJK languages don't have stemming rules like English - Highlighter: Skip word boundary matching for CJK terms Also extracts `CJK_PATTERN` to `Search` module to avoid duplication. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Split each CJK character into individual tokens for FTS5 indexing - Add stem_content callback to SQLite adapter to pre-process content - Stem search queries before matching against FTS5 index - Update stemmer tests to reflect character-splitting behavior This enables CJK search on SQLite installations where the FTS5 tokenizer cannot natively segment CJK text. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

jorgemanrubia

This one looks great to me @Maklu, thanks a lot.

@monorkin I would love your eyes before merging.

monorkin

Looks good to me too 👍

Maklu marked this pull request as draft January 6, 2026 08:30

Maklu marked this pull request as ready for review January 6, 2026 08:52

jorgemanrubia approved these changes Jan 7, 2026

View reviewed changes

monorkin approved these changes Jan 8, 2026

View reviewed changes

jorgemanrubia merged commit db804b8 into basecamp:main Jan 8, 2026

jorgemanrubia mentioned this pull request Jan 8, 2026

Revert "Add CJK (Chinese, Japanese, Korean) search support" #2321

Merged

Maklu mentioned this pull request Jan 10, 2026

Add CJK (Chinese, Japanese, Korean) search support #2337

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CJK (Chinese, Japanese, Korean) search support #2299

Add CJK (Chinese, Japanese, Korean) search support #2299

Maklu commented Jan 6, 2026 •

edited

Loading

Uh oh!

jorgemanrubia left a comment

Uh oh!

monorkin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add CJK (Chinese, Japanese, Korean) search support #2299

Add CJK (Chinese, Japanese, Korean) search support #2299

Conversation

Maklu commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

How it works

Test plan

Uh oh!

jorgemanrubia left a comment

Choose a reason for hiding this comment

Uh oh!

monorkin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Maklu commented Jan 6, 2026 •

edited

Loading