Bug / unexpected behavior: char_interval wrong for non-ASCII text when using certain providers in v1.1.1

**What happened**  
When extracting entities from text containing non-ASCII characters (Chinese, Japanese, accented Latin, emojis, etc.), the `char_interval` values returned are incorrect — they often point to the wrong positions or are shifted.

This seems to happen more consistently with some providers/models than others.

**Minimal example to reproduce**

```python
from langextract import Extractor

extractor = Extractor(
    provider="gemini",                # also happens with "ollama" + llama3.1
    model="gemini-1.5-flash-001",     # try also "ollama/llama3.1:8b"
)

text = "这是一个测试句子。Hello 世界！This is a test with 价格 $99.99 and こんにちは."

result = extractor.extract(
    text=text,
    schema={
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "entity": {"type": "string"},
                "char_interval": {"type": "array", "items": {"type": "integer"}}
            },
            "required": ["entity", "char_interval"]
        }
    },
    prompt="Extract every named entity (person, location, price, language word) and give its exact character start-end interval in the original text."
)

print(result)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug / unexpected behavior: char_interval wrong for non-ASCII text when using certain providers in v1.1.1 #334

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug / unexpected behavior: char_interval wrong for non-ASCII text when using certain providers in v1.1.1 #334

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions