Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity position information goes wrong when emojis appear before entities #3

Open
ryu1kn opened this issue Dec 7, 2023 · 1 comment

Comments

@ryu1kn
Copy link

ryu1kn commented Dec 7, 2023

Thank you for developing/maintaining this tool 🙏

I've encountered seemingly wrong entity span information when I have emojis in a text.

How to reproduce the issue

  1. First in a jupyter notebook,

    import jupyterannotate
    
    TEXTS = [
        "Hi John!",
        "👍 John!",
        "👍🏿 John!",
    ]
    
    annotation_widget = jupyterannotate.AnnotateWidget(
        docs=TEXTS,
        labels=["NAME"]
    )
    annotation_widget
  2. Annotate "John" as NAME in all 3 texts.

  3. Spans are set like this

    spans = annotation_widget.spans
    spans
    
    # [[{'start': 3, 'end': 7, 'text': 'John', 'label': 'NAME'}],
    #  [{'start': 3, 'end': 7, 'text': 'John', 'label': 'NAME'}],
    #  [{'start': 5, 'end': 9, 'text': 'John', 'label': 'NAME'}]]
  4. Expect slicing texts with position information all give "John", but actually not when emojis present before "John"s.

    for i in range(len(TEXTS)):
        print(f'{i+1}. "{TEXTS[i][spans[i][0]["start"] : spans[i][0]["end"]]}"')
    
    # 1. "John"
    # 2. "ohn!"
    # 3. "hn!"

Expected behaviour

It prints below (all the same)

1. "John"
2. "John"
3. "John"

Possible cause

It seems to be related to the difference in how Python and JavaScript count string length. c.f. JavaScript vs Python emoji length

Since this library works on Jupyter notebook (and behind the scene js being used is an implementation detail), it would be great if we can get text length in Python friendly way; so that we can do the further processing on the same notebook without any issues.

@ryu1kn
Copy link
Author

ryu1kn commented Dec 11, 2023

This is my workaround for now. Interacting with the library through a wrapper. The wrapper adjusts the spans.

from itertools import accumulate

class JupyterAnnotateWrap:
    def __init__(self, jupyter_annotate_widget):
        self._widget = jupyter_annotate_widget

    @property
    def spans(self):
        return self._shift_spans(JupyterAnnotateWrap._to_py_pos, self._widget.spans)

    @spans.setter
    def spans(self, py_spans):
        self._widget.spans = self._shift_spans(JupyterAnnotateWrap._to_js_pos, py_spans)

    def _shift_spans(self, pos_shift_fn, new_spans_for_docs):
        def shift_span(doc, span):
            return span | {
                'start': pos_shift_fn(doc, span['start']),
                'end': pos_shift_fn(doc, span['end'])
            }

        return [[shift_span(self._widget.docs[idx], s) for s in spans_for_doc]
                for idx, spans_for_doc in enumerate(new_spans_for_docs)]

    @property
    def widget(self):
        return self._widget

    @staticmethod
    def _to_py_pos(text, pos_js):
        char_lengths = [1 if JupyterAnnotateWrap._is_single_utf16(c) else 2 for c in [*text]]
        char_positions = list(accumulate([0] + char_lengths))
        return char_positions.index(pos_js)

    @staticmethod
    def _to_js_pos(text, pos_py):
        return pos_py + sum(not JupyterAnnotateWrap._is_single_utf16(c) for c in text[0:pos_py])

    # c.f. https://stackoverflow.com/questions/65971218/python3-counting-utf-16-code-points-in-a-string#answer-65972323
    @staticmethod
    def _is_single_utf16(c: chr): return ord(c) < 2**16

Create a wrapper with a widget. Accessing the widget is through widget property.

wrapper = JupyterAnnotateWrap(annotation_widget)
wrapper.widget

Now I get this

spans = wrapper.spans

for i in range(len(TEXTS)):
    print(f'{i+1}. "{TEXTS[i]}" --> "{TEXTS[i][spans[i][0]["start"] : spans[i][0]["end"]]}"')

# 1. "Hi John!" --> "John"
# 2. "👍 John!" --> "John"
# 3. "👍🏿 John!" --> "John"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant