Skip to content

Conversation

@Maklu
Copy link
Contributor

@Maklu Maklu commented Jan 10, 2026

Summary

This PR re-implements CJK search support, fixing the SQLite FTS5 display issue that caused the previous PR (#2299) to be reverted.

Problem

The original implementation had a bug where SQLite's highlight() and snippet() functions returned stemmed content instead of original text. For example:

  • "Card to delete" displayed as "card to delet"
  • "日本語" displayed as "日 本 語"

Solution

Key fix: Separate storage from display.

  1. FTS table: Stores stemmed content for search matching
  2. Main table: Stores original content for display
  3. Highlighting: Use Search::Highlighter instead of FTS5's highlight() function

This ensures search works correctly while displaying original text to users.

Changes

File Change
search.rb Add shared CJK_PATTERN constant
highlighter.rb Add CJK-aware highlight() and snippet() with cjk_dominant? detection
query.rb Use \p{L}\p{N} for Unicode character support
stemmer.rb Add character-level tokenization for CJK
record/sqlite.rb Use Search::Highlighter instead of FTS5 functions

Test plan

  • All 919 tests pass
  • Added tests for Chinese, Japanese, and Korean text
  • Added tests for mixed CJK/English content
  • Added tests for CJK snippet truncation
  • Verified search results display original text (not stemmed)

Fix search functionality for CJK languages by addressing three issues:

1. Query sanitization: Switch from ASCII-only \w to Unicode-aware \p{L}\p{N}
2. Stemmer: Add character-level tokenization for CJK (e.g., "日本語" → "日 本 語")
3. Highlighter: Add CJK-aware highlighting without word boundaries

For SQLite FTS5: Store original content in main table, stemmed content in
FTS table. Use Search::Highlighter instead of FTS5's highlight() function
to display original text in search results.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13e2a7efc1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant