Description
Thank you for developing/maintaining this tool 🙏
I've encountered seemingly wrong entity span information when I have emojis in a text.
How to reproduce the issue
-
First in a jupyter notebook,
import jupyterannotate TEXTS = [ "Hi John!", "👍 John!", "👍🏿 John!", ] annotation_widget = jupyterannotate.AnnotateWidget( docs=TEXTS, labels=["NAME"] ) annotation_widget
-
Annotate "John" as NAME in all 3 texts.
-
Spans are set like this
spans = annotation_widget.spans spans # [[{'start': 3, 'end': 7, 'text': 'John', 'label': 'NAME'}], # [{'start': 3, 'end': 7, 'text': 'John', 'label': 'NAME'}], # [{'start': 5, 'end': 9, 'text': 'John', 'label': 'NAME'}]]
-
Expect slicing texts with position information all give "John", but actually not when emojis present before "John"s.
for i in range(len(TEXTS)): print(f'{i+1}. "{TEXTS[i][spans[i][0]["start"] : spans[i][0]["end"]]}"') # 1. "John" # 2. "ohn!" # 3. "hn!"
Expected behaviour
It prints below (all the same)
1. "John"
2. "John"
3. "John"
Possible cause
It seems to be related to the difference in how Python and JavaScript count string length. c.f. JavaScript vs Python emoji length
Since this library works on Jupyter notebook (and behind the scene js being used is an implementation detail), it would be great if we can get text length in Python friendly way; so that we can do the further processing on the same notebook without any issues.