📝 Add tutorials in the documentation

HugoPerrier · HugoPerrier · commit 4c1378fc041d · 2023-12-12T16:44:49.000+01:00
diff --git a/docs/tutorials/00_GettingStarted.md b/docs/tutorials/00_GettingStarted.md
@@ -0,0 +1,134 @@
+# Getting started with Melusine
+
+Let's run **emergency detection** with melusine :
+
+* Load a fake email dataset
+* Load a demonstration pipeline
+* Run the pipeline  
+    * Apply email cleaning transformations  
+    * Apply emergency detection
+
+## Input data
+
+Email datasets typically contain information about:
+
+- Email sender
+- Email recipients
+- Email subject/header
+- Email body
+- Attachments data
+
+The present tutorial only makes use of the **body** and **header** data.
+
+|    | body                             | header      |
+|:---|:---------------------------------|:------------|
+| 0  | This is an ëmèrgénçy             | Help        |
+| 1  | How is life ?                    | Hey !       |
+| 2  | Urgent update about Mr. Annoying | Latest news |
+| 3  | Please call me now               | URGENT      |
+
+## Code
+
+A typical code for a melusine-based application looks like this :
+
+```Python
+--8<--
+docs_src/GettingStarted/tutorial001.py:simple_pipeline
+--8<--
+```
+
+1. This tutorial uses one of the default pipeline configuration `demo_pipeline`. Melusine users will typically define their own pipeline configuration.
+   See more in the [Configurations tutorial](06_Configurations.md){target=_blank}
+
+## Output data
+
+The pipeline created extra columns in the dataset.
+Some columns are temporary variables required by detectors (ex: `normalized_body`)
+and some are detection results with direct business value (ex: `emergency_result`).
+
+|    | body                             | header      | normalized_body             | emergency_result   |
+|:---|:---------------------------------|:------------|:---------------------------------|:-------------------|
+| 0  | This is an ëmèrgénçy             | Help        | This is an emergency             | True               |
+| 1  | How is life ?                    | Hey !       | How is life ?                    | False              |
+| 2  | Urgent update about Mr. Annoying | Latest news | Urgent update about Mr. Annoying | False              |
+| 3  | Please call me now               | URGENT      | Please call me now               | True               |
+
+## Pipeline steps
+
+Illustration of the pipeline used in the present tutorial :
+
+``` mermaid
+---
+title: Demonstration pipeline
+---
+flowchart LR
+    Input[[Email]] --> A(Cleaner)
+    A(Cleaner) --> C(Normalizer)
+    C --> F(Emergency\nDetector)
+    F --> Output[[Qualified Email]]
+```
+
+* `Cleaner` : Cleaning transformations such as uniformization of line breaks (`\r\n` -> `\n`)
+* `Normalizer` : Text normalisation to delete/replace non utf8 characters (`éöà` -> `eoa`)
+* `EmergencyDetector` : Detection of urgent emails
+
+
+!!! info
+    This demonstration pipeline is kept minimal but typical pipelines include more complex preprocessing and a variety of detectors.
+    For example, pipelines may contain:
+
+    - Email Segmentation : Split email conversation into unitary messages
+    - ContentTagging : Associate tags (SIGNATURE, FOOTER, BODY) to parts of messages
+    - Appointment detection : For exemple, detect "construction work will take place on 01/01/2024" as an appointment email.
+    - More on preprocessing in the [MelusineTransformers tutorial](02_MelusineTransformers.md){target=_blank}
+    - More on detectors in the [MelusineDetector tutorial](05a_MelusineDetectors.md){target=_blank}
+
+
+## Debug mode
+
+End users typically want to know what lead melusine to a specific detection result. The debug mode generates additional explainability info.
+
+```Python
+--8<--
+docs_src/GettingStarted/tutorial002.py:debug_pipeline
+--8<--
+```
+
+
+A new column `debug_emergency` is created.
+
+|    | ... | emergency_result   | debug_emergency   |
+|:---|:----|:-------------------|:------------------|
+| 0  | ... | True               | [details_below]   |
+| 1  | ... | False              | [details_below]   |
+| 2  | ... | False              | [details_below]   |
+| 3  | ... | True               | [details_below]   |
+
+Inspecting the debug data gives a lot of info:
+
+- `text` : Effective text considered for detection.
+- `EmergencyRegex` : melusine used an `EmergencyRegex` object to run detection.
+- `match_result` : The `EmergencyRegex` did not match the text
+- `positive_match_data` : The `EmergencyRegex` matched **positively** the text pattern "Urgent" (Required condition)
+- `negative_match_data` : The `EmergencyRegex` matched **negatively** the text pattern "Mr. Annoying" (Forbidden condition)
+- `BLACKLIST` : Detection groups can be defined to easily link a matching pattern to the corresponding regex. DEFAULT is used if no detection group is specified.
+
+
+```Python
+# print(df.iloc[2]["debug_emergency"])
+{
+  'text': 'Latest news\nUrgent update about Mr. Annoying'},
+  'EmergencyRegex': {
+    'match_result': False,
+    'negative_match_data': {
+      'BLACKLIST': [
+        {'match_text': 'Mr. Annoying', 'start': 32, 'stop': 44}
+      ]},
+    'neutral_match_data': {},
+    'positive_match_data': {
+      'DEFAULT': [
+        {'match_text': 'Urgent', 'start': 12, 'stop': 18}
+      ]
+    }
+  }
+```
diff --git a/docs/tutorials/01_MelusinePipeline.md b/docs/tutorials/01_MelusinePipeline.md
@@ -0,0 +1,10 @@
+# MelusinePipeline  
+
+The `MelusinePipeline` class is at the core of melusine. It inherits from the `sklearn.Pipeline` class and adds extra functionalities such as :
+
+- Instantiation from configurations
+- Input/output coherence check
+- Debug mode
+
+## Code
+
diff --git a/docs/tutorials/02_MelusineTransformers.md b/docs/tutorials/02_MelusineTransformers.md
@@ -0,0 +1 @@
+# MelusineTransformers
diff --git a/docs/tutorials/03_MelusineRegex.md b/docs/tutorials/03_MelusineRegex.md
@@ -0,0 +1 @@
+# MelusineRegex
diff --git a/docs/tutorials/04_UsingModels.md b/docs/tutorials/04_UsingModels.md
@@ -0,0 +1 @@
+# Using AI models
diff --git a/docs/tutorials/05a_MelusineDetectors.md b/docs/tutorials/05a_MelusineDetectors.md
@@ -0,0 +1,120 @@
+# Melusine Detectors
+
+The `MelusineDetector` component aims at standardizing how detection 
+is performed in a `MelusinePipeline`. 
+
+!!! tip
+    Project running over several years (such as email automation) 
+    may accumulate technical debt over time. Standardizing code practices 
+    can limit the technical debt and ease the onboarding of new developers.
+
+The `MelusineDetector` class splits detection into three steps:
+
+- `pre_detect`: Select/combine the inputs needed for detection.
+Ex: Select the text parts tagged as `BODY` and combine them with the text 
+in the email header.
+- `detect`: Use regular expressions, ML models or heuristics to run detection
+on the input text.
+- `post_detect`: Run detection rules such as thresholding or combine results from multiple models.
+
+The method `transform` is defined by the BaseClass `MelusineDetector` and will call 
+the pre_detect/detect/post_detect methods in turn (Template pattern).
+
+```Python
+# Instantiate Detector
+detector = MyDetector()
+
+# Run pre_detect, detect and post_detect on input data
+data_with_detection = detector.transform(data)
+```
+
+Here is the full code of a MelusineDetector to detect emails related to viruses. 
+The next sections break down the different parts of the code.
+
+```Python
+--8<--
+docs_src/MelusineDetectors/tutorial001.py:detector
+--8<--
+```
+
+The detector is run on a simple dataframe:
+```Python
+--8<--
+docs_src/MelusineDetectors/tutorial001.py:run
+--8<--
+```
+
+The output is a dataframe with a new `virus_result` column.
+
+|    | body                      | header                | virus_result   |
+|---:|:--------------------------|:----------------------|:---------------|
+|  0 | This is a dangerous virus | test                  | True           |
+|  1 | test                      | test                  | False          |
+|  2 | test                      | viruses are dangerous | True           |
+|  3 | corona virus is annoying  | test                  | False          |
+
+!!! tip
+    Columns that are not declared in the `output_columns` are dropped automatically.
+
+
+## Detector init
+In the init method, you should call the superclass init and provide:
+
+- A name for the detector
+- Inputs columns
+- Output columns
+
+```Python
+--8<--
+docs_src/MelusineDetectors/tutorial001.py:detector_init
+--8<--
+```
+
+!!! tip
+    If the init method of the super class is enough (parameters `name`, `input_columns` and `output_columns`)
+    you may skip the init method entirely when defining your `MelusineDetector`.
+
+
+## Detector pre_detect
+The `pre_detect` method simply combines the header text and the body text
+(separated by a line break).
+```Python
+--8<--
+docs_src/MelusineDetectors/tutorial001.py:pre_detect
+--8<--
+```
+
+## Detector detect
+The `detect` applies two regexes on the selected text:
+- A positive regex to catch mentions to viruses
+- A negative regex to avoid false positive detections
+```Python
+--8<--
+docs_src/MelusineDetectors/tutorial001.py:detect
+--8<--
+```
+
+## Detector post_detect
+The `post_detect` combines the regex detection result to determine the final result.
+```Python
+--8<--
+docs_src/MelusineDetectors/tutorial001.py:post_detect
+--8<--
+```
+
+## Are MelusineDetectors mandatory for melusine?
+No.  
+
+You can use any scikit-learn compatible component in your `MelusinePipeline`. 
+However, we recommend using the `MelusineDetector` (and `MelusineTransformer`) 
+classes to benefit from:
+
+- Code standardization
+- Input columns validation
+- Dataframe backend variabilization
+  Today dict and pandas backend are supported but more backends may be added (e.g. polars)
+- Debug mode
+- Multiprocessing
+
+Check-out the [next tutorial](05a_MelusineDetectors.md){target=_blank} 
+to discover advanced features of the `MelusineDetector` class.
diff --git a/docs/tutorials/05b_MelusineDetectorsAdvanced.md b/docs/tutorials/05b_MelusineDetectorsAdvanced.md
@@ -0,0 +1,100 @@
+# Advanced Melusine Detectors
+
+This tutorial presents the advanced features of the `MelusineDetector` class:
+
+- Debug mode
+- Row wise methods vs DataFrame wise methods
+- Custom transform methods
+
+## Debug mode
+`MelusineDetector` are designed to be easily debugged. For that purpose, the 
+pre-detect/detect/post-detect methods all have a `debug_mode` argument. 
+The debug mode is activated by setting the debug attribute of a dataframe to True.
+
+```Python hl_lines="3"
+import pandas as pd
+df = pd.DataFrame({"bla": [1, 2, 3]})
+df.debug = True
+```
+
+!!! warning
+    Debug mode activation is backend dependent. With a DictBackend, tou should use `my_dict["debug"] = True`
+
+When debug mode is activated, a column named "DETECTOR_NAME_debug" containing an empty 
+dictionary is automatically created.
+Populating this debug dict with debug info is then left to the user's responsibility. 
+
+Exemple of a detector with debug data
+```Python hl_lines="21 22 37-53"
+--8<--
+docs_src/MelusineDetectors/tutorial003.py:detector
+--8<--
+```
+
+In the end, an extra column is created containing debug data:
+
+|    | virus_result   | debug_virus                                                                                                                                                       |
+|---:|:---------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+|  0 | True           | {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}}          |
+|  1 | False          | {'detection_input': '...', 'positive_match_data': {'result': False, 'match_text': None}, 'negative_match_data': {'result': False, 'match_text': None}}            |
+|  2 | True           | {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}}          |
+|  3 | False          | {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': True, 'match_text': 'corona virus'}} |e          | {'detection_input': 'test\ncorona virus is annoying', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': True, 'match_text': 'corona virus'}} |                                                                                                                                                     
+
+
+## Row methods vs dataframe methods
+There are two ways to use the pre-detect/detect/post-detect methods:
+
+- Row wise: The method works on a single row of a DataFrame.
+In that case, a map-like method is used to apply it on an entire dataframe
+(typically pandas.DataFrame.apply is used with the PandasBackend)
+- Dataframe wise: The method works directly on the entire DataFrame.
+
+!!! tip
+    Using row wise methods make your code backend independent. You may 
+    switch from a `PandasBackend` to a `DictBackend` at any time. 
+    The `PandasBackend` also supports multiprocessing for row wise methods.
+
+To use row wise methods, you just need to name the first parameter of "row". 
+Otherwise, dataframe wise transformations are used.
+
+Exemple of a Detector with dataframe wise method (works with a PandasBackend only).
+```Python hl_lines="22 28 39"
+--8<--
+docs_src/MelusineDetectors/tutorial002.py:detector
+--8<--
+```
+
+## Custom transform methods
+If you are not happy with the `pre_detect`/`detect`/`post_detect` transform methods, you: 
+
+- Use custom template methods
+- Use regular pipeline steps (not inheriting from the `MelusineDetector` class)
+
+In this exemple, the `prepare`/`run` custom transform methods are used
+instead of the default `pre_detect`/`detect`/`post_detect`.
+
+```Python
+--8<--
+docs_src/MelusineDetectors/tutorial004.py:detector
+--8<--
+```
+
+To configure custom transform methods you need to: 
+
+- inherit from the melusine.base.BaseMelusineDetector class
+- define the `transform_methods` property
+
+The `transform` method will now call `prepare` and `run`.
+
+```Python
+--8<--
+docs_src/MelusineDetectors/tutorial004.py:run
+--8<--
+```
+
+We can check that the `run` method was indeed called.
+
+|    | input_col   |   output_col |
+|---:|:------------|-------------:|
+|  0 | test1       |        12345 |
+|  1 | test2       |        12345 |
diff --git a/docs/tutorials/06_Configurations.md b/docs/tutorials/06_Configurations.md
diff --git a/docs/tutorials/07_BasicClassification.md b/docs/tutorials/07_BasicClassification.md