Skip to content

Bug / unexpected behavior: char_interval wrong for non-ASCII text when using certain providers in v1.1.1 #334

@frankomondo

Description

@frankomondo

What happened
When extracting entities from text containing non-ASCII characters (Chinese, Japanese, accented Latin, emojis, etc.), the char_interval values returned are incorrect — they often point to the wrong positions or are shifted.

This seems to happen more consistently with some providers/models than others.

Minimal example to reproduce

from langextract import Extractor

extractor = Extractor(
    provider="gemini",                # also happens with "ollama" + llama3.1
    model="gemini-1.5-flash-001",     # try also "ollama/llama3.1:8b"
)

text = "这是一个测试句子。Hello 世界!This is a test with 价格 $99.99 and こんにちは."

result = extractor.extract(
    text=text,
    schema={
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "entity": {"type": "string"},
                "char_interval": {"type": "array", "items": {"type": "integer"}}
            },
            "required": ["entity", "char_interval"]
        }
    },
    prompt="Extract every named entity (person, location, price, language word) and give its exact character start-end interval in the original text."
)

print(result)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions