What happened
When extracting entities from text containing non-ASCII characters (Chinese, Japanese, accented Latin, emojis, etc.), the char_interval values returned are incorrect — they often point to the wrong positions or are shifted.
This seems to happen more consistently with some providers/models than others.
Minimal example to reproduce
from langextract import Extractor
extractor = Extractor(
provider="gemini", # also happens with "ollama" + llama3.1
model="gemini-1.5-flash-001", # try also "ollama/llama3.1:8b"
)
text = "这是一个测试句子。Hello 世界!This is a test with 价格 $99.99 and こんにちは."
result = extractor.extract(
text=text,
schema={
"type": "array",
"items": {
"type": "object",
"properties": {
"entity": {"type": "string"},
"char_interval": {"type": "array", "items": {"type": "integer"}}
},
"required": ["entity", "char_interval"]
}
},
prompt="Extract every named entity (person, location, price, language word) and give its exact character start-end interval in the original text."
)
print(result)
What happened
When extracting entities from text containing non-ASCII characters (Chinese, Japanese, accented Latin, emojis, etc.), the
char_intervalvalues returned are incorrect — they often point to the wrong positions or are shifted.This seems to happen more consistently with some providers/models than others.
Minimal example to reproduce