Skip to content

Commit

Permalink
Bug fixes, table pipe is more robust, Updated doc
Browse files Browse the repository at this point in the history
  • Loading branch information
Christel Gérardin committed May 24, 2023
1 parent fed2a02 commit 07871ed
Show file tree
Hide file tree
Showing 5 changed files with 172 additions and 18 deletions.
9 changes: 6 additions & 3 deletions docs/pipelines/misc/measurements.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,15 @@ We use simple regular expressions to extract and normalize measurements, and use
By default, the `eds.measurements` pipeline lets you match all measurements, i.e measurements in most units as well as unitless measurements. If a unit is not in our register,
then you can add It manually. If not, the measurement will be matched without Its unit.

If you prefer to match specific measurements only, you can create your own measurement config. Nevertheless, some default measurements configs are already provided out of the box:
If you prefer matching specific measurements only, you can create your own measurement config anda set `all_measurements` parameter to `False`. Nevertheless, some default measurements configs are already provided out of the box:

| Measurement name | Example |
| ---------------- | ---------------------- |
| `eds.size` | `1m50`, `1.50m` |
| `eds.weight` | `12kg`, `1kg300` |
| `eds.bmi` | `BMI: 24`, `24 kg.m-2` |
| `eds.volume` | `2 cac`, `8ml` |
| `eds.bool` | `positive`, `negatif` |

The normalized value can then be accessed via the `span._.value` attribute and converted on the fly to a desired unit (eg `span._.value.g_per_cl` or `span._.value.kg_per_m3` for a density).

Expand All @@ -25,15 +26,17 @@ The measurements that can be extracted can have one or many of the following cha
- Measurements with range indication (escpecially < or >)
- Measurements with power

The measurement can be written in many coplex forms. Among them, this pipe can detect:
The measurement can be written in many complex forms. Among them, this pipe can detect:
- Measurements with range indication, numerical value, power and units in many different orders and separated by customizable stop words
- Composed units (eg `1m50`)
- Measurement with "unitless patterns", i.e some textual information next to a numerical value which allows us to retrieve a unit even if It is not written (eg in the text `Height: 80`, this pipe will a detect the numlerical value `80`and match It to the unit `kg`)
- Elliptic enumerations (eg `32, 33 et 34mol`) of measurements of the same type and split the measurements accordingly

## Usage

The matched measurements are labelised with `eds.measurement` by default. However, if you are only creating your own measurement or using a predefined one, your measurements will be labeled with the name of this measurement (eg `eds.weight`).
This pipe works better with `eds.dates` and `eds.tables` pipe at the same time. These pipes let `eds.measurements` skip dates as measurements and make a specific matching for each table, benefitting of the structured data.

The matched measurements are labeled with a default measurement name if available (eg `eds.size`), else `eds.measurement` if any measure is linked to the dimension of the measure's unit and if `all_measurements` is set to `True`.

As said before, each matched measurement can be accessed via the `span._.value`. This gives you a `SimpleMeasurement` object with the following attributes :
- `value_range` ("<", "=" or ">")
Expand Down
105 changes: 105 additions & 0 deletions docs/pipelines/misc/tables.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Tables

The `eds.tables` pipeline's role is to detect tables present in a medical document.
We use simple regular expressions to extract tables like text.

## Usage

This pipe lets you match different forms of tables. They can have a frame or not, rows can be spread on multiple consecutive lines (in case of a bad parsing for example)... You can also indicate the presence of headers with the `col_names` and `row_names` boolean parameters.

Each matched table is returned as a `Span` object. You can then access to an equivalent dictionnary formatted table with `table` extension or use `to_pandas_table()` to get the equivalent pandas DataFrame. The key of the dictionnary is determined as folowed:
- If `col_names` is True, then, the dictionnary keys are the names of the columns (str).
- Elif `row_names` is True, then, the dictionnary keys are the names (str).
- Else the dictionnary keys are indexes of the columns (int).

`to_pandas_table()` can be customised with `as_spans` parameter. If set to `True`, then the pandas dataframe will contain the cells as spans, else the pandas dataframe will contain the cells as raw strings.

```python
import spacy

nlp = spacy.blank("fr")
nlp.add_pipe("eds.normalizer")
nlp.add_pipe("eds.tables")

text = """
SERVICE
MEDECINE INTENSIVE –
REANIMATION
Réanimation / Surveillance Continue
Médicale
COMPTE RENDU D'HOSPITALISATION du 05/06/2020 au 10/06/2020
Madame DUPONT Marie, née le 16/05/1900, âgée de 20 ans, a été hospitalisée en réanimation du
05/06/1920 au 10/06/1920 pour intoxication médicamenteuse volontaire.
Examens complémentaires
Hématologie
Numération
Leucocytes ¦x10*9/L ¦4.97 ¦4.09-11
Hématies ¦x10*12/L¦4.68 ¦4.53-5.79
Hémoglobine ¦g/dL ¦14.8 ¦13.4-16.7
Hématocrite ¦% ¦44.2 ¦39.2-48.6
VGM ¦fL ¦94.4 + ¦79.6-94
TCMH ¦pg ¦31.6 ¦27.3-32.8
CCMH ¦g/dL ¦33.5 ¦32.4-36.3
Plaquettes ¦x10*9/L ¦191 ¦172-398
VMP ¦fL ¦11.5 + ¦7.4-10.8
Sur le plan neurologique : Devant la persistance d'une confusion à distance de l'intoxication au
...
2/2Pat : <NOM> <Prenom>|F |<date> | <ipp> |Intitulé RCP
"""

doc = nlp(text)

# A table span
table = doc.spans["tables"][0]
# Leucocytes ¦x10*9/L ¦4.97 ¦4.09-11
# Hématies ¦x10*12/L¦4.68 ¦4.53-5.79
# Hémoglobine ¦g/dL ¦14.8 ¦13.4-16.7
# Hématocrite ¦% ¦44.2 ¦39.2-48.6
# VGM ¦fL ¦94.4 + ¦79.6-94
# TCMH ¦pg ¦31.6 ¦27.3-32.8
# CCMH ¦g/dL ¦33.5 ¦32.4-36.3
# Plaquettes ¦x10*9/L ¦191 ¦172-398
# VMP ¦fL ¦11.5 + ¦7.4-10.8

# Convert span to Pandas table
df = table._.to_pd_table(as_spans=False)
type(df)
# >> pandas.core.frame.DataFrame
```
The pd DataFrame:
| | 0 | 1 | 2 | 3 |
| ---: | :---------- | :------- | :----- | :-------- |
| 0 | Leucocytes | x10*9/L | 4.97 | 4.09-11 |
| 1 | Hématies | x10*12/L | 4.68 | 4.53-5.79 |
| 2 | Hémoglobine | g/dL | 14.8 | 13.4-16.7 |
| 3 | Hématocrite | % | 44.2 | 39.2-48.6 |
| 4 | VGM | fL | 94.4 + | 79.6-94 |
| 5 | TCMH | pg | 31.6 | 27.3-32.8 |
| 6 | CCMH | g/dL | 33.5 | 32.4-36.3 |
| 7 | Plaquettes | x10*9/L | 191 | 172-398 |
| 8 | VMP | fL | 11.5 + | 7.4-10.8 |

## Declared extensions

The `eds.tables` pipeline declares two [spaCy extension](https://spacy.io/usage/processing-pipelines#custom-components-attributes) on the `Span` object. The first one is `to_pd_table()` method which returns a parsed pandas version of the table. The second one is `table` which contains the table stored as a dictionnary containing cells as `Span` objects.

## Configuration

The pipeline can be configured using the following parameters :

| Parameter | Explanation | Default |
| ----------------- | ------------------------------------------------ | ---------------------- |
| `tables_pattern` | Pattern to identify table spans | `rf"(\b.*{sep}.*\n)+"` |
| `sep_pattern` | Pattern to identify column separation | `r"¦"` |
| `ignore_excluded` | Ignore excluded tokens | `True` |
| `attr` | spaCy attribute to match on, eg `NORM` or `TEXT` | `"TEXT"` |

## Authors and citation

The `eds.tables` pipeline was developed by AP-HP's Data Science team.
6 changes: 3 additions & 3 deletions edsnlp/pipelines/misc/tables/factory.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from typing import Dict, List, Optional, Union
from typing import List, Optional

from spacy.language import Language

Expand All @@ -20,8 +20,8 @@
def create_component(
nlp: Language,
name: str,
tables_pattern: Optional[Dict[str, Union[List[str], str]]],
sep_pattern: Optional[str],
tables_pattern: Optional[List[str]],
sep_pattern: Optional[List[str]],
attr: str,
ignore_excluded: bool,
col_names: Optional[bool] = False,
Expand Down
4 changes: 2 additions & 2 deletions edsnlp/pipelines/misc/tables/patterns.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
sep = r"¦"
regex = rf"(?:{sep}?(?:[^{sep}\n]*{sep})+[^{sep}\n]*{sep}?\n)+"
sep = [r"¦", r"|"]
regex = [r"(?:¦?(?:[^¦\n]*¦)+[^¦\n]*¦?\n)+", r"(?:\|?(?:[^\|\n]*\|)+[^\|\n]*\|?\n)+"]
66 changes: 56 additions & 10 deletions edsnlp/pipelines/misc/tables/tables.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from typing import Dict, Optional, Union
from typing import List, Optional

import pandas as pd
from spacy.language import Language
Expand All @@ -19,10 +19,10 @@ class TablesMatcher:
----------
nlp : Language
spaCy nlp pipeline to use for matching.
tables_pattern : Optional[str]
The regex pattern to identify tables.
sep_pattern : Optional[str]
The regex pattern to identify separators
tables_pattern : Optional[List[str]]
The regex patterns to identify tables.
sep_pattern : Optional[List[str]]
The regex patterns to identify separators
in the detected tables
col_names : Optional[bool]
Whether the tables_pattern matches column names
Expand All @@ -39,9 +39,9 @@ class TablesMatcher:
def __init__(
self,
nlp: Language,
tables_pattern: Optional[str],
sep_pattern: Optional[str],
attr: Union[Dict[str, str], str],
tables_pattern: Optional[List[str]],
sep_pattern: Optional[List[str]],
attr: str,
ignore_excluded: bool,
col_names: Optional[bool] = False,
row_names: Optional[bool] = False,
Expand All @@ -54,7 +54,7 @@ def __init__(
sep_pattern = patterns.sep

self.regex_matcher = RegexMatcher(attr=attr, ignore_excluded=True)
self.regex_matcher.add("table", [tables_pattern])
self.regex_matcher.add("table", tables_pattern)

self.term_matcher = EDSPhraseMatcher(nlp.vocab, attr=attr, ignore_excluded=True)
self.term_matcher.build_patterns(
Expand Down Expand Up @@ -138,7 +138,53 @@ def get_tables(self, matches):
if all(row[-1].start == row[-1].end for row in processed_table):
processed_table = [row[:-1] for row in processed_table]

tables_list.append(processed_table)
# Check if all rows have the same dimension.
# If not, try to merge neighbour rows
# to find a new table
row_len = len(processed_table[0])
if not all(len(row) == row_len for row in processed_table):

# Method to find all possible lengths of the rows
def divisors(n):
result = set()
for i in range(1, int(n**0.5) + 1):
if n % i == 0:
result.add(i)
result.add(n // i)
return sorted(list(result))

if self.col_names:
n_rows = len(processed_table) - 1
else:
n_rows = len(processed_table)

for n_rows_to_merge in divisors(n_rows):
row_len = sum(len(row) for row in processed_table[:n_rows_to_merge])
if all(
sum(
len(row)
for row in processed_table[
i * n_rows_to_merge : (i + 1) * n_rows_to_merge
]
)
== row_len
for i in range(n_rows // n_rows_to_merge)
):
processed_table = [
[
cell
for subrow in processed_table[
i * n_rows_to_merge : (i + 1) * n_rows_to_merge
]
for cell in subrow
]
for i in range(n_rows // n_rows_to_merge)
]
tables_list.append(processed_table)
break
continue
else:
tables_list.append(processed_table)

# Convert to dictionnaries according to self.col_names
# and self.row_names
Expand Down

0 comments on commit 07871ed

Please sign in to comment.