Fix incorrect char_interval for non-ASCII text due to tokenizer merging (Fixes #334) by ambicuity · Pull Request #350 · google/langextract

ambicuity · 2026-02-10T17:20:43Z

Description

This PR fixes Issue #334 where RegexTokenizer was incorrectly merging Latin characters with adjacent non-ASCII characters (e.g., 'Hello世界') into a single token. This caused WordAligner to fail when identifying entities that are substrings of these merged tokens.

The fix involves:

Updating langextract/core/tokenizer.py to use regex library's V1 features.
Refining _LETTERS_PATTERN to strictly exclude CJK scripts using set subtraction.
Explicitly handling CJK characters as their own token group.

Fixes #334

Bug fix

How Has This Been Tested?

I verified the fix using a reproduction script that specifically tests mixed script inputs (Latin + CJK, Emojis).
I also ran the full test suite locally:

$ pytest tests/

All tests passed.

Checklist:

I have read and acknowledged Google's Open Source Code of conduct.
I have read the Contributing page, and I either signed the Google Individual CLA or am covered by my company's Corporate CLA.
I have discussed my proposed solution with code owners in the linked issue(s) and we have agreed upon the general approach.
I have made any needed documentation changes, or noted in the linked issue(s) that documentation elsewhere needs updating.
I have added tests, or I have ensured existing tests cover the changes
I have followed Google's Python Style Guide and ran pylint over the affected code.

…ng (Fixes google#334)

PR google#350 — Fix incorrect char_interval for non-ASCII text: - Add _CJK_SCRIPTS constant for Han, Hiragana, Katakana, Hangul detection - Modify _LETTERS_PATTERN with regex V1 set subtraction to exclude CJK - Add _CJK_PATTERN for standalone CJK character matching - Update _TOKEN_PATTERN and _WORD_PATTERN with flags=regex.V1 - Fix trailing whitespace in japanese_extraction.md example PR google#257 — Add retry mechanism for transient API errors: - New retry_utils.py: is_transient_error(), retry_on_transient_errors(), retry_chunk_processing() decorators with exponential backoff + jitter - annotation.py: _process_batch_with_retry(), retry params threaded through annotate_documents/text and single/sequential pass methods - extraction.py: retry params in extract() signature, passed via retry_kwargs - gemini.py: @retry_chunk_processing() decorator on _process_single_prompt - New retry_utils_test.py + AnnotatorRetryPolicyTest in annotation_test.py Upstream: google#350, google#257 Fixes: google#334

- annotation.py: add retry_utils import, retry params to annotate_text() signature and docstring, pass-through to annotate_documents() - extraction.py: retry params in extract() signature + retry_kwargs dict - gemini.py: retry_chunk_processing decorator import - CHERRY_PICK_TRACKER.md: mark PRs google#350 and google#257 as applied, add log entries Upstream: google#257

Fix incorrect char_interval for non-ASCII text due to tokenizer mergi…

64df81a

…ng (Fixes google#334)

github-actions bot added the size/XS Pull request with less than 50 lines changed label Feb 10, 2026

Fix formatting issues via autoformat.sh

ba45db2

ambicuity marked this pull request as draft February 10, 2026 17:36

ambicuity mentioned this pull request Feb 10, 2026

Bug / unexpected behavior: char_interval wrong for non-ASCII text when using certain providers in v1.1.1 #334

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect char_interval for non-ASCII text due to tokenizer merging (Fixes #334)#350

Fix incorrect char_interval for non-ASCII text due to tokenizer merging (Fixes #334)#350
ambicuity wants to merge 2 commits intogoogle:mainfrom
ambicuity:fix/issue-334-regex-tokenizer

ambicuity commented Feb 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ambicuity commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ambicuity commented Feb 10, 2026 •

edited

Loading