Skip to content

Entity position information goes wrong when emojis appear before entities #3

Open
@ryu1kn

Description

@ryu1kn

Thank you for developing/maintaining this tool 🙏

I've encountered seemingly wrong entity span information when I have emojis in a text.

How to reproduce the issue

  1. First in a jupyter notebook,

    import jupyterannotate
    
    TEXTS = [
        "Hi John!",
        "👍 John!",
        "👍🏿 John!",
    ]
    
    annotation_widget = jupyterannotate.AnnotateWidget(
        docs=TEXTS,
        labels=["NAME"]
    )
    annotation_widget
  2. Annotate "John" as NAME in all 3 texts.

  3. Spans are set like this

    spans = annotation_widget.spans
    spans
    
    # [[{'start': 3, 'end': 7, 'text': 'John', 'label': 'NAME'}],
    #  [{'start': 3, 'end': 7, 'text': 'John', 'label': 'NAME'}],
    #  [{'start': 5, 'end': 9, 'text': 'John', 'label': 'NAME'}]]
  4. Expect slicing texts with position information all give "John", but actually not when emojis present before "John"s.

    for i in range(len(TEXTS)):
        print(f'{i+1}. "{TEXTS[i][spans[i][0]["start"] : spans[i][0]["end"]]}"')
    
    # 1. "John"
    # 2. "ohn!"
    # 3. "hn!"

Expected behaviour

It prints below (all the same)

1. "John"
2. "John"
3. "John"

Possible cause

It seems to be related to the difference in how Python and JavaScript count string length. c.f. JavaScript vs Python emoji length

Since this library works on Jupyter notebook (and behind the scene js being used is an implementation detail), it would be great if we can get text length in Python friendly way; so that we can do the further processing on the same notebook without any issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions