Skip to content

Commit

Permalink
[docs] improved documentation for transforms
Browse files Browse the repository at this point in the history
- modified docstrings that were unclear
- added README.md section
- misc: updated CONTRIBUTING.md
  • Loading branch information
mathysgrapotte committed Nov 25, 2024
1 parent 683d23e commit 7ffd173
Show file tree
Hide file tree
Showing 3 changed files with 69 additions and 86 deletions.
85 changes: 1 addition & 84 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,89 +60,6 @@ As usual:
1. if you updated the documentation or the project dependencies:
1. run `make docs`
1. go to http://localhost:8000 and check that everything looks good
1. follow our [commit message convention](#commit-message-convention)
If you are unsure about how to fix or ignore a warning,
just let the continuous integration fail,
and we will help you during review.
Don't bother updating the changelog, we will take care of this.
## Commit message convention
Commit messages must follow our convention based on the
[Angular style](https://gist.github.com/stephenparish/9941e89d80e2bc58a153#format-of-the-commit-message)
or the [Karma convention](https://karma-runner.github.io/4.0/dev/git-commit-msg.html):
```
<type>[(scope)]: Subject
[Body]
```
**Subject and body must be valid Markdown.**
Subject must have proper casing (uppercase for first letter
if it makes sense), but no dot at the end, and no punctuation
in general.
Scope and body are optional. Type can be:
- `build`: About packaging, building wheels, etc.
- `chore`: About packaging or repo/files management.
- `ci`: About Continuous Integration.
- `deps`: Dependencies update.
- `docs`: About documentation.
- `feat`: New feature.
- `fix`: Bug fix.
- `perf`: About performance.
- `refactor`: Changes that are not features or bug fixes.
- `style`: A change in code style/format.
- `tests`: About tests.
If you write a body, please add trailers at the end
(for example issues and PR references, or co-authors),
without relying on GitHub's flavored Markdown:
```
Body.
Issue #10: https://github.com/namespace/project/issues/10
Related to PR namespace/other-project#15: https://github.com/namespace/other-project/pull/15
```
These "trailers" must appear at the end of the body,
without any blank lines between them. The trailer title
can contain any character except colons `:`.
We expect a full URI for each trailer, not just GitHub autolinks
(for example, full GitHub URLs for commits and issues,
not the hash or the #issue-number).
We do not enforce a line length on commit messages summary and body,
but please avoid very long summaries, and very long lines in the body,
unless they are part of code blocks that must not be wrapped.
## Pull requests guidelines
Link to any related issue in the Pull Request message.
During the review, we recommend using fixups:
```bash
# SHA is the SHA of the commit you want to fix
git commit --fixup=SHA
```
Once all the changes are approved, you can squash your commits:
```bash
git rebase -i --autosquash main
```
And force-push:
```bash
git push -f
```
If this seems all too complicated, you can push or force-push each new commit,
and we will squash them ourselves if needed, before merging.
Then you can pull request and we will review. Make sure you join our [slack](https://nfcore.slack.com/channels/deepmodeloptim) hosted on nf-core to talk and build with us!
60 changes: 59 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,6 @@ class DnaToFloat(AbstractExperiment):
self.float = {
"encoder": encoders.FloatEncoder(),
}
self.split = {"RandomSplitter": splitters.RandomSplitter()}
```

Here we define the `data_type` for the dna and float types, note that those `data_type` are the same as the ones defined in the samplesheet dataset above, for example, a dataset on which this experiment would run could look like this:
Expand Down Expand Up @@ -198,6 +197,65 @@ csv.py contains two important classes, `CsvLoader` and `CsvProcessing`
>
>Great, now you know how stimulus transparently loads your data into your pytorch model! While this seems complicated, the only thing you really have to do, is to format your data correctly in a csv samplesheet and define your experiment class with the proper encoders (either by using the provided encoders or by writing your own).
### Data transformation

Measuring the impact of data transformations (noising, down/upsampling, augmentation...) on models at training time is a major feature of stimulus.

Data transformations materialize as `DataTransformer` classes, and should inherit from the `AbstractDataTransformer` class (see [docs](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/encoding/encoders/#stimulus.data.encoding.encoders.AbstractEncoder))

> NOTE:
> Writing your own `DataTransformer` class is the same as writing your own `Encoder` class, you should overwrite the `transform` and `transform_all` methods
> WARNING:
> Every `DataTransformer` class has to have `seed` in `transform` and `transform_all` methods parameters, and `np.random.seed(seed)` should be called in those methods.
> WARNING:
> Every `DataTransformer` class should have an `add_row` argument set to either `True` or `False` depending on if it is augmenting the data (adding rows) or not.
### Connecting transformations and dataset

Just like encoders, data transformations are defined in the `Experiment` class alongside encoders. Let's upgrade our `DnaToFloat` minimal class defined above to reflect this.

```python
class DnaToFloat(AbstractExperiment):
def __init__(self) -> None:
super().__init__()
self.dna = {
"encoder": encoders.TextOneHotEncoder(alphabet="acgt"),
"data_transformation_generators": {
"UniformTextMasker": data_transformation_generators.UniformTextMasker(mask="N"),
"ReverseComplement": data_transformation_generators.ReverseComplement(),
"GaussianChunk": data_transformation_generators.GaussianChunk(),
},
}
self.float = {
"encoder": encoders.FloatEncoder(),
"data_transformation_generators": {"GaussianNoise": data_transformation_generators.GaussianNoise()},
}
```

As you can see, our `data_type` arguments get an other field, `"data_transformation_generators"`, there we can initialize the `DataTransformer` classes with their relevant parameters.

In the `csv` module, the `CsvProcessing` class will call the `transform_all` methods from the classes contained in `"data_transformation_generators"` based on the column type and a list of transformations.

i.e., if we give the `["ReverseComplement","GaussianChunk"]` list to the `CsvProcessing` class `transform` method the data contained in the `mouse_dna:input:dna` column in our minimal example above will be first reverse complemented and then chunked.

> TIP:
> Recap :
> To transform your dataset,
>
> - define your own `DataTransformer` class or use one we provide
>
> - add it to your experiment class
>
> - load your data through `CsvProcessing`
>
> - set a list of transforms
>
> - call `CsvProcessing.transform(transform_list)`



## Installation

Expand Down
10 changes: 9 additions & 1 deletion src/stimulus/data/transform/data_transformation_generators.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,15 @@
class AbstractDataTransformer(ABC):
"""Abstract class for data transformers.
All data transformers should have the seed in it. This is because the multiprocessing of them could unset the seed.
Data transformers implement in_place or augmentation transformations.
Whether it is in_place or augmentation is specified in the "add_row" attribute (should be True or False and set in children classes constructor)
Child classes should override the `transform` and `transform_all` methods.
`transform_all` should always return a list
Both methods should take an optional `seed` argument set to `None` by default to be compliant with stimulus' core principle of reproducibility.
Seed should be initialized through `np.random.seed(seed)` in the method implementation.
Attributes:
add_row (bool): whether the transformer adds rows to the data
Expand Down

0 comments on commit 7ffd173

Please sign in to comment.