|
| 1 | +# Getting started with Melusine |
| 2 | + |
| 3 | +Let's run **emergency detection** with melusine : |
| 4 | + |
| 5 | +* Load a fake email dataset |
| 6 | +* Load a demonstration pipeline |
| 7 | +* Run the pipeline |
| 8 | + * Apply email cleaning transformations |
| 9 | + * Apply emergency detection |
| 10 | + |
| 11 | +## Input data |
| 12 | + |
| 13 | +Email datasets typically contain information about: |
| 14 | + |
| 15 | +- Email sender |
| 16 | +- Email recipients |
| 17 | +- Email subject/header |
| 18 | +- Email body |
| 19 | +- Attachments data |
| 20 | + |
| 21 | +The present tutorial only makes use of the **body** and **header** data. |
| 22 | + |
| 23 | +| | body | header | |
| 24 | +|:---|:---------------------------------|:------------| |
| 25 | +| 0 | This is an ëmèrgénçy | Help | |
| 26 | +| 1 | How is life ? | Hey ! | |
| 27 | +| 2 | Urgent update about Mr. Annoying | Latest news | |
| 28 | +| 3 | Please call me now | URGENT | |
| 29 | + |
| 30 | +## Code |
| 31 | + |
| 32 | +A typical code for a melusine-based application looks like this : |
| 33 | + |
| 34 | +```Python |
| 35 | +--8<-- |
| 36 | +docs_src/GettingStarted/tutorial001.py:simple_pipeline |
| 37 | +--8<-- |
| 38 | +``` |
| 39 | + |
| 40 | +1. This tutorial uses one of the default pipeline configuration `demo_pipeline`. Melusine users will typically define their own pipeline configuration. |
| 41 | + See more in the [Configurations tutorial](06_Configurations.md){target=_blank} |
| 42 | + |
| 43 | +## Output data |
| 44 | + |
| 45 | +The pipeline created extra columns in the dataset. |
| 46 | +Some columns are temporary variables required by detectors (ex: `normalized_body`) |
| 47 | +and some are detection results with direct business value (ex: `emergency_result`). |
| 48 | + |
| 49 | +| | body | header | normalized_body | emergency_result | |
| 50 | +|:---|:---------------------------------|:------------|:---------------------------------|:-------------------| |
| 51 | +| 0 | This is an ëmèrgénçy | Help | This is an emergency | True | |
| 52 | +| 1 | How is life ? | Hey ! | How is life ? | False | |
| 53 | +| 2 | Urgent update about Mr. Annoying | Latest news | Urgent update about Mr. Annoying | False | |
| 54 | +| 3 | Please call me now | URGENT | Please call me now | True | |
| 55 | + |
| 56 | +## Pipeline steps |
| 57 | + |
| 58 | +Illustration of the pipeline used in the present tutorial : |
| 59 | + |
| 60 | +``` mermaid |
| 61 | +--- |
| 62 | +title: Demonstration pipeline |
| 63 | +--- |
| 64 | +flowchart LR |
| 65 | + Input[[Email]] --> A(Cleaner) |
| 66 | + A(Cleaner) --> C(Normalizer) |
| 67 | + C --> F(Emergency\nDetector) |
| 68 | + F --> Output[[Qualified Email]] |
| 69 | +``` |
| 70 | + |
| 71 | +* `Cleaner` : Cleaning transformations such as uniformization of line breaks (`\r\n` -> `\n`) |
| 72 | +* `Normalizer` : Text normalisation to delete/replace non utf8 characters (`éöà` -> `eoa`) |
| 73 | +* `EmergencyDetector` : Detection of urgent emails |
| 74 | + |
| 75 | + |
| 76 | +!!! info |
| 77 | + This demonstration pipeline is kept minimal but typical pipelines include more complex preprocessing and a variety of detectors. |
| 78 | + For example, pipelines may contain: |
| 79 | + |
| 80 | + - Email Segmentation : Split email conversation into unitary messages |
| 81 | + - ContentTagging : Associate tags (SIGNATURE, FOOTER, BODY) to parts of messages |
| 82 | + - Appointment detection : For exemple, detect "construction work will take place on 01/01/2024" as an appointment email. |
| 83 | + - More on preprocessing in the [MelusineTransformers tutorial](02_MelusineTransformers.md){target=_blank} |
| 84 | + - More on detectors in the [MelusineDetector tutorial](05a_MelusineDetectors.md){target=_blank} |
| 85 | + |
| 86 | + |
| 87 | +## Debug mode |
| 88 | + |
| 89 | +End users typically want to know what lead melusine to a specific detection result. The debug mode generates additional explainability info. |
| 90 | + |
| 91 | +```Python |
| 92 | +--8<-- |
| 93 | +docs_src/GettingStarted/tutorial002.py:debug_pipeline |
| 94 | +--8<-- |
| 95 | +``` |
| 96 | + |
| 97 | + |
| 98 | +A new column `debug_emergency` is created. |
| 99 | + |
| 100 | +| | ... | emergency_result | debug_emergency | |
| 101 | +|:---|:----|:-------------------|:------------------| |
| 102 | +| 0 | ... | True | [details_below] | |
| 103 | +| 1 | ... | False | [details_below] | |
| 104 | +| 2 | ... | False | [details_below] | |
| 105 | +| 3 | ... | True | [details_below] | |
| 106 | + |
| 107 | +Inspecting the debug data gives a lot of info: |
| 108 | + |
| 109 | +- `text` : Effective text considered for detection. |
| 110 | +- `EmergencyRegex` : melusine used an `EmergencyRegex` object to run detection. |
| 111 | +- `match_result` : The `EmergencyRegex` did not match the text |
| 112 | +- `positive_match_data` : The `EmergencyRegex` matched **positively** the text pattern "Urgent" (Required condition) |
| 113 | +- `negative_match_data` : The `EmergencyRegex` matched **negatively** the text pattern "Mr. Annoying" (Forbidden condition) |
| 114 | +- `BLACKLIST` : Detection groups can be defined to easily link a matching pattern to the corresponding regex. DEFAULT is used if no detection group is specified. |
| 115 | + |
| 116 | + |
| 117 | +```Python |
| 118 | +# print(df.iloc[2]["debug_emergency"]) |
| 119 | +{ |
| 120 | + 'text': 'Latest news\nUrgent update about Mr. Annoying'}, |
| 121 | + 'EmergencyRegex': { |
| 122 | + 'match_result': False, |
| 123 | + 'negative_match_data': { |
| 124 | + 'BLACKLIST': [ |
| 125 | + {'match_text': 'Mr. Annoying', 'start': 32, 'stop': 44} |
| 126 | + ]}, |
| 127 | + 'neutral_match_data': {}, |
| 128 | + 'positive_match_data': { |
| 129 | + 'DEFAULT': [ |
| 130 | + {'match_text': 'Urgent', 'start': 12, 'stop': 18} |
| 131 | + ] |
| 132 | + } |
| 133 | + } |
| 134 | +``` |
0 commit comments