-
Notifications
You must be signed in to change notification settings - Fork 34
Fixes on tokeniser, normalisation, qualifiers and CI #329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Span in BaseQualifier.processSpan in BaseQualifier.process
Coverage Report
Files without new missing coverage
263 files skipped due to complete coverage. Coverage failure: total of 97.77% is less than 97.78% ❌ |
5f31166 to
4f90b63
Compare
Span in BaseQualifier.process4f90b63 to
585b9d2
Compare
6852be5 to
c1cf750
Compare
2038fb9 to
232ca91
Compare
fe81659 to
1ffa7c6
Compare
|
|
|
||
| assert not (max_steps and max_epochs), "Use only steps or epochs" | ||
| if max_epochs: | ||
| max_steps = int(0.9 * (4464 / batch_size[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks oddly specific 🤔
d2e1f39 to
65669dc
Compare


Description
Regarding tokenization:
In texts, words can be split with "-" when too long. This can impede matching:
dia-\nbetewon't be matched by a simple "diabete" regex. To this end:EDS.Tokenizernow threats-\nas a token by itselfeds.pollutioncan tag this token a to-be-discardedRegarding
ignore_space_tokensWith
ignore_space_tokens=True, usingedsnlp.utils.doc_to_text.get_text(which is used under the hood by e.g. the regex matcher) will remove linebreaks, which can be problematic in texts with enumeration without trailing spaces. E.g,get_text("Tabac\nAlcool\nSport", "TEXT", ignore_space_tokens=True) would ouput"TabacAlcoolSport"`.Now, we replace this
\nwith a space when necessaryRegarding the status mapping of behavior/disorder pipes
For entities matched by those pipes, there is:
_.statusattribute, by default set to 1, but that can take the value 2_.detailed_statusattribute, which is actually a getter that uses a mapping dictionary to get the human-readable statusWhen loading already-annotated docs, it can occurs that a status will be automaticaly set to None. To avoid a
KeyError, when now handle thisstatus=NonecaseRegarding CI
ubuntu-latestdoesn't support python 3.7 anymore, so we should useubuntu-22Checklist