You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For the moment the sentencizer makes a new sentence when there is a "." character followed by a capitalized letter.
This can be problematic for some codes or accronyms, as they can be constructed with those patterns (example : "V.I.H",), and will be divided in different sentences.
The ADICAP codes analysed by the eds.adicap pipeline can be found in text in the form : "code ADICAP : B.H.HP.A7A0", and the eds.contextual-matcher used behind will not capture the code.
A solution would be to create a new sentence if there is a . followed by a space/new line/other separation and a capitalized letter.
How to reproduce the bug
import spacy
nlp = spacy.blank("eds")
nlp.add_pipe("eds.normalizer")
nlp.add_pipe("eds.sentences")
code = "B.H.HP.A7A0"
for sent in nlp(code).sents:
print(sent.text)
B.
H.
HP.
A7A0
Your Environment
Operating System: Ubuntu 22.04.1 LTS
Python Version Used: 3.10.6
spaCy Version Used: 3.4.1
EDS-NLP Version Used: 0.7.4
Environment Information:
The text was updated successfully, but these errors were encountered:
etienneguevel
changed the title
Feature request: Modification of the way the sentencizer cuts the document
Sentencizer cut codes in different sentences while it's the same token
Jan 11, 2023
Thanks for this issue ! this has been solved by changing the tokenization rules to distinguish "real" end-of-sentence periods from abbreviation periods in #192
Description
For the moment the sentencizer makes a new sentence when there is a "." character followed by a capitalized letter.
This can be problematic for some codes or accronyms, as they can be constructed with those patterns (example : "V.I.H",), and will be divided in different sentences.
The ADICAP codes analysed by the
eds.adicap
pipeline can be found in text in the form : "code ADICAP : B.H.HP.A7A0", and theeds.contextual-matcher
used behind will not capture the code.A solution would be to create a new sentence if there is a . followed by a space/new line/other separation and a capitalized letter.
How to reproduce the bug
B.
H.
HP.
A7A0
Your Environment
The text was updated successfully, but these errors were encountered: