[docs] improved documentation for transforms

- modified docstrings that were unclear - added README.md section - misc: updated CONTRIBUTING.md
mathysgrapotte · Nov 25, 2024 · 7ffd173 · 7ffd173
1 parent 683d23e
commit 7ffd173
Show file tree

Hide file tree

Showing 3 changed files with 69 additions and 86 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -60,89 +60,6 @@ As usual:
 1. if you updated the documentation or the project dependencies:
     1. run `make docs`
     1. go to http://localhost:8000 and check that everything looks good
-1. follow our [commit message convention](#commit-message-convention)
 
-If you are unsure about how to fix or ignore a warning,
-just let the continuous integration fail,
-and we will help you during review.
 
-Don't bother updating the changelog, we will take care of this.
-
-## Commit message convention
-
-Commit messages must follow our convention based on the
-[Angular style](https://gist.github.com/stephenparish/9941e89d80e2bc58a153#format-of-the-commit-message)
-or the [Karma convention](https://karma-runner.github.io/4.0/dev/git-commit-msg.html):
-
-```
-<type>[(scope)]: Subject
-
-[Body]
-```
-
-**Subject and body must be valid Markdown.**
-Subject must have proper casing (uppercase for first letter
-if it makes sense), but no dot at the end, and no punctuation
-in general.
-
-Scope and body are optional. Type can be:
-
-- `build`: About packaging, building wheels, etc.
-- `chore`: About packaging or repo/files management.
-- `ci`: About Continuous Integration.
-- `deps`: Dependencies update.
-- `docs`: About documentation.
-- `feat`: New feature.
-- `fix`: Bug fix.
-- `perf`: About performance.
-- `refactor`: Changes that are not features or bug fixes.
-- `style`: A change in code style/format.
-- `tests`: About tests.
-
-If you write a body, please add trailers at the end
-(for example issues and PR references, or co-authors),
-without relying on GitHub's flavored Markdown:
-
-```
-Body.
-
-Issue #10: https://github.com/namespace/project/issues/10
-Related to PR namespace/other-project#15: https://github.com/namespace/other-project/pull/15
-```
-
-These "trailers" must appear at the end of the body,
-without any blank lines between them. The trailer title
-can contain any character except colons `:`.
-We expect a full URI for each trailer, not just GitHub autolinks
-(for example, full GitHub URLs for commits and issues,
-not the hash or the #issue-number).
-
-We do not enforce a line length on commit messages summary and body,
-but please avoid very long summaries, and very long lines in the body,
-unless they are part of code blocks that must not be wrapped.
-
-## Pull requests guidelines
-
-Link to any related issue in the Pull Request message.
-
-During the review, we recommend using fixups:
-
-```bash
-# SHA is the SHA of the commit you want to fix
-git commit --fixup=SHA
-```
-
-Once all the changes are approved, you can squash your commits:
-
-```bash
-git rebase -i --autosquash main
-```
-
-And force-push:
-
-```bash
-git push -f
-```
-
-If this seems all too complicated, you can push or force-push each new commit,
-and we will squash them ourselves if needed, before merging.
+Then you can pull request and we will review. Make sure you join our [slack](https://nfcore.slack.com/channels/deepmodeloptim) hosted on nf-core to talk and build with us!
diff --git a/README.md b/README.md
@@ -151,7 +151,6 @@ class DnaToFloat(AbstractExperiment):
         self.float = {
             "encoder": encoders.FloatEncoder(),
         }
-        self.split = {"RandomSplitter": splitters.RandomSplitter()}
 ```
 
 Here we define the `data_type` for the dna and float types, note that those `data_type` are the same as the ones defined in the samplesheet dataset above, for example, a dataset on which this experiment would run could look like this: 
@@ -198,6 +197,65 @@ csv.py contains two important classes, `CsvLoader` and `CsvProcessing`
 >
 >Great, now you know how stimulus transparently loads your data into your pytorch model! While this seems complicated, the only thing you really have to do, is to format your data correctly in a csv samplesheet and define your experiment class with the proper encoders (either by using the provided encoders or by writing your own).
 
+### Data transformation
+
+Measuring the impact of data transformations (noising, down/upsampling, augmentation...) on models at training time is a major feature of stimulus.
+
+Data transformations materialize as `DataTransformer` classes, and should inherit from the `AbstractDataTransformer` class (see [docs](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/encoding/encoders/#stimulus.data.encoding.encoders.AbstractEncoder))
+
+> NOTE:
+> Writing your own `DataTransformer` class is the same as writing your own `Encoder` class, you should overwrite the `transform` and `transform_all` methods
+
+> WARNING:
+> Every `DataTransformer` class has to have `seed` in `transform` and `transform_all` methods parameters, and `np.random.seed(seed)` should be called in those methods.
+
+> WARNING:
+> Every `DataTransformer` class should have an `add_row` argument set to either `True` or `False` depending on if it is augmenting the data (adding rows) or not.
+
+### Connecting transformations and dataset
+
+Just like encoders, data transformations are defined in the `Experiment` class alongside encoders. Let's upgrade our `DnaToFloat` minimal class defined above to reflect this.
+
+```python
+class DnaToFloat(AbstractExperiment):
+    def __init__(self) -> None:
+        super().__init__()
+        self.dna = {
+            "encoder": encoders.TextOneHotEncoder(alphabet="acgt"),
+            "data_transformation_generators": {
+                "UniformTextMasker": data_transformation_generators.UniformTextMasker(mask="N"),
+                "ReverseComplement": data_transformation_generators.ReverseComplement(),
+                "GaussianChunk": data_transformation_generators.GaussianChunk(),
+            },
+        }
+        self.float = {
+            "encoder": encoders.FloatEncoder(),
+            "data_transformation_generators": {"GaussianNoise": data_transformation_generators.GaussianNoise()},
+        }
+```
+
+As you can see, our `data_type` arguments get an other field, `"data_transformation_generators"`, there we can initialize the `DataTransformer` classes with their relevant parameters. 
+
+In the `csv` module, the `CsvProcessing` class will call the `transform_all` methods from the classes contained in `"data_transformation_generators"` based on the column type and a list of transformations. 
+
+i.e., if we give the `["ReverseComplement","GaussianChunk"]` list to the `CsvProcessing` class `transform` method the data contained in the `mouse_dna:input:dna` column in our minimal example above will be first reverse complemented and then chunked. 
+
+> TIP:
+> Recap :
+> To transform your dataset,
+>
+> - define your own `DataTransformer` class or use one we provide
+>
+> - add it to your experiment class
+>
+> - load your data through `CsvProcessing` 
+>
+> - set a list of transforms
+>
+> - call `CsvProcessing.transform(transform_list)`
+
+
+
 
 ## Installation
 

diff --git a/src/stimulus/data/transform/data_transformation_generators.py b/src/stimulus/data/transform/data_transformation_generators.py
@@ -10,7 +10,15 @@
 class AbstractDataTransformer(ABC):
     """Abstract class for data transformers.
 
-    All data transformers should have the seed in it. This is because the multiprocessing of them could unset the seed.
+    Data transformers implement in_place or augmentation transformations. 
+    Whether it is in_place or augmentation is specified in the "add_row" attribute (should be True or False and set in children classes constructor)
+    
+    Child classes should override the `transform` and `transform_all` methods.
+
+    `transform_all` should always return a list
+
+    Both methods should take an optional `seed` argument set to `None` by default to be compliant with stimulus' core principle of reproducibility.
+    Seed should be initialized through `np.random.seed(seed)` in the method implementation.
 
     Attributes:
         add_row (bool): whether the transformer adds rows to the data