Skip to content

Fix incorrect char_interval for non-ASCII text due to tokenizer merging (Fixes #334)#350

Draft
ambicuity wants to merge 2 commits intogoogle:mainfrom
ambicuity:fix/issue-334-regex-tokenizer
Draft

Fix incorrect char_interval for non-ASCII text due to tokenizer merging (Fixes #334)#350
ambicuity wants to merge 2 commits intogoogle:mainfrom
ambicuity:fix/issue-334-regex-tokenizer

Conversation

@ambicuity
Copy link
Copy Markdown

@ambicuity ambicuity commented Feb 10, 2026

Description

This PR fixes Issue #334 where RegexTokenizer was incorrectly merging Latin characters with adjacent non-ASCII characters (e.g., 'Hello世界') into a single token. This caused WordAligner to fail when identifying entities that are substrings of these merged tokens.

The fix involves:

  • Updating langextract/core/tokenizer.py to use regex library's V1 features.
  • Refining _LETTERS_PATTERN to strictly exclude CJK scripts using set subtraction.
  • Explicitly handling CJK characters as their own token group.

Fixes #334

Bug fix

How Has This Been Tested?

I verified the fix using a reproduction script that specifically tests mixed script inputs (Latin + CJK, Emojis).
I also ran the full test suite locally:

$ pytest tests/

All tests passed.

Checklist:

  • I have read and acknowledged Google's Open Source Code of conduct.
  • I have read the Contributing page, and I either signed the Google Individual CLA or am covered by my company's Corporate CLA.
  • I have discussed my proposed solution with code owners in the linked issue(s) and we have agreed upon the general approach.
  • I have made any needed documentation changes, or noted in the linked issue(s) that documentation elsewhere needs updating.
  • I have added tests, or I have ensured existing tests cover the changes
  • I have followed Google's Python Style Guide and ran pylint over the affected code.

@github-actions github-actions bot added the size/XS Pull request with less than 50 lines changed label Feb 10, 2026
@ambicuity ambicuity marked this pull request as draft February 10, 2026 17:36
IgnatG added a commit to IgnatG/langextract that referenced this pull request Feb 17, 2026
PR google#350 — Fix incorrect char_interval for non-ASCII text:
- Add _CJK_SCRIPTS constant for Han, Hiragana, Katakana, Hangul detection
- Modify _LETTERS_PATTERN with regex V1 set subtraction to exclude CJK
- Add _CJK_PATTERN for standalone CJK character matching
- Update _TOKEN_PATTERN and _WORD_PATTERN with flags=regex.V1
- Fix trailing whitespace in japanese_extraction.md example

PR google#257 — Add retry mechanism for transient API errors:
- New retry_utils.py: is_transient_error(), retry_on_transient_errors(),
  retry_chunk_processing() decorators with exponential backoff + jitter
- annotation.py: _process_batch_with_retry(), retry params threaded through
  annotate_documents/text and single/sequential pass methods
- extraction.py: retry params in extract() signature, passed via retry_kwargs
- gemini.py: @retry_chunk_processing() decorator on _process_single_prompt
- New retry_utils_test.py + AnnotatorRetryPolicyTest in annotation_test.py

Upstream: google#350, google#257
Fixes: google#334
IgnatG added a commit to IgnatG/langextract that referenced this pull request Feb 17, 2026
- annotation.py: add retry_utils import, retry params to annotate_text()
  signature and docstring, pass-through to annotate_documents()
- extraction.py: retry params in extract() signature + retry_kwargs dict
- gemini.py: retry_chunk_processing decorator import
- CHERRY_PICK_TRACKER.md: mark PRs google#350 and google#257 as applied, add log entries

Upstream: google#257
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Pull request with less than 50 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug / unexpected behavior: char_interval wrong for non-ASCII text when using certain providers in v1.1.1

1 participant