Fix incorrect char_interval for non-ASCII text due to tokenizer merging (Fixes #334)#350
Draft
ambicuity wants to merge 2 commits intogoogle:mainfrom
Draft
Fix incorrect char_interval for non-ASCII text due to tokenizer merging (Fixes #334)#350ambicuity wants to merge 2 commits intogoogle:mainfrom
ambicuity wants to merge 2 commits intogoogle:mainfrom
Conversation
IgnatG
added a commit
to IgnatG/langextract
that referenced
this pull request
Feb 17, 2026
PR google#350 — Fix incorrect char_interval for non-ASCII text: - Add _CJK_SCRIPTS constant for Han, Hiragana, Katakana, Hangul detection - Modify _LETTERS_PATTERN with regex V1 set subtraction to exclude CJK - Add _CJK_PATTERN for standalone CJK character matching - Update _TOKEN_PATTERN and _WORD_PATTERN with flags=regex.V1 - Fix trailing whitespace in japanese_extraction.md example PR google#257 — Add retry mechanism for transient API errors: - New retry_utils.py: is_transient_error(), retry_on_transient_errors(), retry_chunk_processing() decorators with exponential backoff + jitter - annotation.py: _process_batch_with_retry(), retry params threaded through annotate_documents/text and single/sequential pass methods - extraction.py: retry params in extract() signature, passed via retry_kwargs - gemini.py: @retry_chunk_processing() decorator on _process_single_prompt - New retry_utils_test.py + AnnotatorRetryPolicyTest in annotation_test.py Upstream: google#350, google#257 Fixes: google#334
IgnatG
added a commit
to IgnatG/langextract
that referenced
this pull request
Feb 17, 2026
- annotation.py: add retry_utils import, retry params to annotate_text() signature and docstring, pass-through to annotate_documents() - extraction.py: retry params in extract() signature + retry_kwargs dict - gemini.py: retry_chunk_processing decorator import - CHERRY_PICK_TRACKER.md: mark PRs google#350 and google#257 as applied, add log entries Upstream: google#257
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes Issue #334 where
RegexTokenizerwas incorrectly merging Latin characters with adjacent non-ASCII characters (e.g., 'Hello世界') into a single token. This causedWordAlignerto fail when identifying entities that are substrings of these merged tokens.The fix involves:
langextract/core/tokenizer.pyto useregexlibrary's V1 features._LETTERS_PATTERNto strictly exclude CJK scripts using set subtraction.Fixes #334
Bug fix
How Has This Been Tested?
I verified the fix using a reproduction script that specifically tests mixed script inputs (Latin + CJK, Emojis).
I also ran the full test suite locally:
All tests passed.
Checklist:
pylintover the affected code.