Skip to content

Commit 4c1378f

Browse files
committed
📝 Add tutorials in the documentation
1 parent b9b7496 commit 4c1378f

9 files changed

+573
-0
lines changed

docs/tutorials/00_GettingStarted.md

+134
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# Getting started with Melusine
2+
3+
Let's run **emergency detection** with melusine :
4+
5+
* Load a fake email dataset
6+
* Load a demonstration pipeline
7+
* Run the pipeline
8+
* Apply email cleaning transformations
9+
* Apply emergency detection
10+
11+
## Input data
12+
13+
Email datasets typically contain information about:
14+
15+
- Email sender
16+
- Email recipients
17+
- Email subject/header
18+
- Email body
19+
- Attachments data
20+
21+
The present tutorial only makes use of the **body** and **header** data.
22+
23+
| | body | header |
24+
|:---|:---------------------------------|:------------|
25+
| 0 | This is an ëmèrgénçy | Help |
26+
| 1 | How is life ? | Hey ! |
27+
| 2 | Urgent update about Mr. Annoying | Latest news |
28+
| 3 | Please call me now | URGENT |
29+
30+
## Code
31+
32+
A typical code for a melusine-based application looks like this :
33+
34+
```Python
35+
--8<--
36+
docs_src/GettingStarted/tutorial001.py:simple_pipeline
37+
--8<--
38+
```
39+
40+
1. This tutorial uses one of the default pipeline configuration `demo_pipeline`. Melusine users will typically define their own pipeline configuration.
41+
See more in the [Configurations tutorial](06_Configurations.md){target=_blank}
42+
43+
## Output data
44+
45+
The pipeline created extra columns in the dataset.
46+
Some columns are temporary variables required by detectors (ex: `normalized_body`)
47+
and some are detection results with direct business value (ex: `emergency_result`).
48+
49+
| | body | header | normalized_body | emergency_result |
50+
|:---|:---------------------------------|:------------|:---------------------------------|:-------------------|
51+
| 0 | This is an ëmèrgénçy | Help | This is an emergency | True |
52+
| 1 | How is life ? | Hey ! | How is life ? | False |
53+
| 2 | Urgent update about Mr. Annoying | Latest news | Urgent update about Mr. Annoying | False |
54+
| 3 | Please call me now | URGENT | Please call me now | True |
55+
56+
## Pipeline steps
57+
58+
Illustration of the pipeline used in the present tutorial :
59+
60+
``` mermaid
61+
---
62+
title: Demonstration pipeline
63+
---
64+
flowchart LR
65+
Input[[Email]] --> A(Cleaner)
66+
A(Cleaner) --> C(Normalizer)
67+
C --> F(Emergency\nDetector)
68+
F --> Output[[Qualified Email]]
69+
```
70+
71+
* `Cleaner` : Cleaning transformations such as uniformization of line breaks (`\r\n` -> `\n`)
72+
* `Normalizer` : Text normalisation to delete/replace non utf8 characters (`éöà` -> `eoa`)
73+
* `EmergencyDetector` : Detection of urgent emails
74+
75+
76+
!!! info
77+
This demonstration pipeline is kept minimal but typical pipelines include more complex preprocessing and a variety of detectors.
78+
For example, pipelines may contain:
79+
80+
- Email Segmentation : Split email conversation into unitary messages
81+
- ContentTagging : Associate tags (SIGNATURE, FOOTER, BODY) to parts of messages
82+
- Appointment detection : For exemple, detect "construction work will take place on 01/01/2024" as an appointment email.
83+
- More on preprocessing in the [MelusineTransformers tutorial](02_MelusineTransformers.md){target=_blank}
84+
- More on detectors in the [MelusineDetector tutorial](05a_MelusineDetectors.md){target=_blank}
85+
86+
87+
## Debug mode
88+
89+
End users typically want to know what lead melusine to a specific detection result. The debug mode generates additional explainability info.
90+
91+
```Python
92+
--8<--
93+
docs_src/GettingStarted/tutorial002.py:debug_pipeline
94+
--8<--
95+
```
96+
97+
98+
A new column `debug_emergency` is created.
99+
100+
| | ... | emergency_result | debug_emergency |
101+
|:---|:----|:-------------------|:------------------|
102+
| 0 | ... | True | [details_below] |
103+
| 1 | ... | False | [details_below] |
104+
| 2 | ... | False | [details_below] |
105+
| 3 | ... | True | [details_below] |
106+
107+
Inspecting the debug data gives a lot of info:
108+
109+
- `text` : Effective text considered for detection.
110+
- `EmergencyRegex` : melusine used an `EmergencyRegex` object to run detection.
111+
- `match_result` : The `EmergencyRegex` did not match the text
112+
- `positive_match_data` : The `EmergencyRegex` matched **positively** the text pattern "Urgent" (Required condition)
113+
- `negative_match_data` : The `EmergencyRegex` matched **negatively** the text pattern "Mr. Annoying" (Forbidden condition)
114+
- `BLACKLIST` : Detection groups can be defined to easily link a matching pattern to the corresponding regex. DEFAULT is used if no detection group is specified.
115+
116+
117+
```Python
118+
# print(df.iloc[2]["debug_emergency"])
119+
{
120+
'text': 'Latest news\nUrgent update about Mr. Annoying'},
121+
'EmergencyRegex': {
122+
'match_result': False,
123+
'negative_match_data': {
124+
'BLACKLIST': [
125+
{'match_text': 'Mr. Annoying', 'start': 32, 'stop': 44}
126+
]},
127+
'neutral_match_data': {},
128+
'positive_match_data': {
129+
'DEFAULT': [
130+
{'match_text': 'Urgent', 'start': 12, 'stop': 18}
131+
]
132+
}
133+
}
134+
```

docs/tutorials/01_MelusinePipeline.md

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# MelusinePipeline
2+
3+
The `MelusinePipeline` class is at the core of melusine. It inherits from the `sklearn.Pipeline` class and adds extra functionalities such as :
4+
5+
- Instantiation from configurations
6+
- Input/output coherence check
7+
- Debug mode
8+
9+
## Code
10+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# MelusineTransformers

docs/tutorials/03_MelusineRegex.md

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# MelusineRegex

docs/tutorials/04_UsingModels.md

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# Using AI models
+120
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Melusine Detectors
2+
3+
The `MelusineDetector` component aims at standardizing how detection
4+
is performed in a `MelusinePipeline`.
5+
6+
!!! tip
7+
Project running over several years (such as email automation)
8+
may accumulate technical debt over time. Standardizing code practices
9+
can limit the technical debt and ease the onboarding of new developers.
10+
11+
The `MelusineDetector` class splits detection into three steps:
12+
13+
- `pre_detect`: Select/combine the inputs needed for detection.
14+
Ex: Select the text parts tagged as `BODY` and combine them with the text
15+
in the email header.
16+
- `detect`: Use regular expressions, ML models or heuristics to run detection
17+
on the input text.
18+
- `post_detect`: Run detection rules such as thresholding or combine results from multiple models.
19+
20+
The method `transform` is defined by the BaseClass `MelusineDetector` and will call
21+
the pre_detect/detect/post_detect methods in turn (Template pattern).
22+
23+
```Python
24+
# Instantiate Detector
25+
detector = MyDetector()
26+
27+
# Run pre_detect, detect and post_detect on input data
28+
data_with_detection = detector.transform(data)
29+
```
30+
31+
Here is the full code of a MelusineDetector to detect emails related to viruses.
32+
The next sections break down the different parts of the code.
33+
34+
```Python
35+
--8<--
36+
docs_src/MelusineDetectors/tutorial001.py:detector
37+
--8<--
38+
```
39+
40+
The detector is run on a simple dataframe:
41+
```Python
42+
--8<--
43+
docs_src/MelusineDetectors/tutorial001.py:run
44+
--8<--
45+
```
46+
47+
The output is a dataframe with a new `virus_result` column.
48+
49+
| | body | header | virus_result |
50+
|---:|:--------------------------|:----------------------|:---------------|
51+
| 0 | This is a dangerous virus | test | True |
52+
| 1 | test | test | False |
53+
| 2 | test | viruses are dangerous | True |
54+
| 3 | corona virus is annoying | test | False |
55+
56+
!!! tip
57+
Columns that are not declared in the `output_columns` are dropped automatically.
58+
59+
60+
## Detector init
61+
In the init method, you should call the superclass init and provide:
62+
63+
- A name for the detector
64+
- Inputs columns
65+
- Output columns
66+
67+
```Python
68+
--8<--
69+
docs_src/MelusineDetectors/tutorial001.py:detector_init
70+
--8<--
71+
```
72+
73+
!!! tip
74+
If the init method of the super class is enough (parameters `name`, `input_columns` and `output_columns`)
75+
you may skip the init method entirely when defining your `MelusineDetector`.
76+
77+
78+
## Detector pre_detect
79+
The `pre_detect` method simply combines the header text and the body text
80+
(separated by a line break).
81+
```Python
82+
--8<--
83+
docs_src/MelusineDetectors/tutorial001.py:pre_detect
84+
--8<--
85+
```
86+
87+
## Detector detect
88+
The `detect` applies two regexes on the selected text:
89+
- A positive regex to catch mentions to viruses
90+
- A negative regex to avoid false positive detections
91+
```Python
92+
--8<--
93+
docs_src/MelusineDetectors/tutorial001.py:detect
94+
--8<--
95+
```
96+
97+
## Detector post_detect
98+
The `post_detect` combines the regex detection result to determine the final result.
99+
```Python
100+
--8<--
101+
docs_src/MelusineDetectors/tutorial001.py:post_detect
102+
--8<--
103+
```
104+
105+
## Are MelusineDetectors mandatory for melusine?
106+
No.
107+
108+
You can use any scikit-learn compatible component in your `MelusinePipeline`.
109+
However, we recommend using the `MelusineDetector` (and `MelusineTransformer`)
110+
classes to benefit from:
111+
112+
- Code standardization
113+
- Input columns validation
114+
- Dataframe backend variabilization
115+
Today dict and pandas backend are supported but more backends may be added (e.g. polars)
116+
- Debug mode
117+
- Multiprocessing
118+
119+
Check-out the [next tutorial](05a_MelusineDetectors.md){target=_blank}
120+
to discover advanced features of the `MelusineDetector` class.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Advanced Melusine Detectors
2+
3+
This tutorial presents the advanced features of the `MelusineDetector` class:
4+
5+
- Debug mode
6+
- Row wise methods vs DataFrame wise methods
7+
- Custom transform methods
8+
9+
## Debug mode
10+
`MelusineDetector` are designed to be easily debugged. For that purpose, the
11+
pre-detect/detect/post-detect methods all have a `debug_mode` argument.
12+
The debug mode is activated by setting the debug attribute of a dataframe to True.
13+
14+
```Python hl_lines="3"
15+
import pandas as pd
16+
df = pd.DataFrame({"bla": [1, 2, 3]})
17+
df.debug = True
18+
```
19+
20+
!!! warning
21+
Debug mode activation is backend dependent. With a DictBackend, tou should use `my_dict["debug"] = True`
22+
23+
When debug mode is activated, a column named "DETECTOR_NAME_debug" containing an empty
24+
dictionary is automatically created.
25+
Populating this debug dict with debug info is then left to the user's responsibility.
26+
27+
Exemple of a detector with debug data
28+
```Python hl_lines="21 22 37-53"
29+
--8<--
30+
docs_src/MelusineDetectors/tutorial003.py:detector
31+
--8<--
32+
```
33+
34+
In the end, an extra column is created containing debug data:
35+
36+
| | virus_result | debug_virus |
37+
|---:|:---------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------|
38+
| 0 | True | {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}} |
39+
| 1 | False | {'detection_input': '...', 'positive_match_data': {'result': False, 'match_text': None}, 'negative_match_data': {'result': False, 'match_text': None}} |
40+
| 2 | True | {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}} |
41+
| 3 | False | {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': True, 'match_text': 'corona virus'}} |e | {'detection_input': 'test\ncorona virus is annoying', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': True, 'match_text': 'corona virus'}} |
42+
43+
44+
## Row methods vs dataframe methods
45+
There are two ways to use the pre-detect/detect/post-detect methods:
46+
47+
- Row wise: The method works on a single row of a DataFrame.
48+
In that case, a map-like method is used to apply it on an entire dataframe
49+
(typically pandas.DataFrame.apply is used with the PandasBackend)
50+
- Dataframe wise: The method works directly on the entire DataFrame.
51+
52+
!!! tip
53+
Using row wise methods make your code backend independent. You may
54+
switch from a `PandasBackend` to a `DictBackend` at any time.
55+
The `PandasBackend` also supports multiprocessing for row wise methods.
56+
57+
To use row wise methods, you just need to name the first parameter of "row".
58+
Otherwise, dataframe wise transformations are used.
59+
60+
Exemple of a Detector with dataframe wise method (works with a PandasBackend only).
61+
```Python hl_lines="22 28 39"
62+
--8<--
63+
docs_src/MelusineDetectors/tutorial002.py:detector
64+
--8<--
65+
```
66+
67+
## Custom transform methods
68+
If you are not happy with the `pre_detect`/`detect`/`post_detect` transform methods, you:
69+
70+
- Use custom template methods
71+
- Use regular pipeline steps (not inheriting from the `MelusineDetector` class)
72+
73+
In this exemple, the `prepare`/`run` custom transform methods are used
74+
instead of the default `pre_detect`/`detect`/`post_detect`.
75+
76+
```Python
77+
--8<--
78+
docs_src/MelusineDetectors/tutorial004.py:detector
79+
--8<--
80+
```
81+
82+
To configure custom transform methods you need to:
83+
84+
- inherit from the melusine.base.BaseMelusineDetector class
85+
- define the `transform_methods` property
86+
87+
The `transform` method will now call `prepare` and `run`.
88+
89+
```Python
90+
--8<--
91+
docs_src/MelusineDetectors/tutorial004.py:run
92+
--8<--
93+
```
94+
95+
We can check that the `run` method was indeed called.
96+
97+
| | input_col | output_col |
98+
|---:|:------------|-------------:|
99+
| 0 | test1 | 12345 |
100+
| 1 | test2 | 12345 |

0 commit comments

Comments
 (0)