From ad9d0c9e799585d465bec4e4488c19e3650039c7 Mon Sep 17 00:00:00 2001 From: mathysgrapotte Date: Fri, 22 Nov 2024 13:05:36 +0100 Subject: [PATCH 1/7] formating README.md --- README.md | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 51 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 55e354d4..35e4b622 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,6 @@ [![ci](https://github.com/mathysgrapotte/stimulus-py/workflows/ci/badge.svg)](https://github.com/mathysgrapotte/stimulus-py/actions?query=workflow%3Aci) [![documentation](https://img.shields.io/badge/docs-mkdocs-708FCC.svg?style=flat)](https://mathysgrapotte.github.io/stimulus-py/) -[![gitter](https://badges.gitter.im/join%20chat.svg)](https://app.gitter.im/#/room/#stimulus-py:gitter.im) [![Build with us on slack!](http://img.shields.io/badge/slack-nf--core%20%23deepmodeloptim-4A154B?labelColor=000000&logo=slack)](https://nfcore.slack.com/channels/deepmodeloptim) @@ -10,10 +9,58 @@ ## Introduction Most (if not all) quality software is thouroughly tested. Deep neural networks seem to have escaped this paradigm. +In the age of large-scale deep learning, it is critical that early-stage dl models (prototypes) are tested to ensure costly bugs do not happen at scale. Here, we attempt at solving the testing problem by proposing an extensive library to test deep neural networks beyond test-set performance. +Stimulus provides those functionalities: +* Modifying training data to test model's robustness to data perturbations (and uncover which pre-processing steps increase performance) +* Perform hyperparameter tuning on model architecture with user-defined search spaces using Ray[tune] to make sure model performance is comparable across data transformations +* Build an all-against-all model report to guide data pre-processing decisions +Stimulus aims at providing those functionalities in a near future: +* perform routine checks on the model architecture and training process (things like type-checking, model actually runs, weights changed at training, etc.) +* perform routine checks on the model post-training (things like checking for overfitting, out of distribution performance, etc.) +* perform "informed" hyperparameter tuning (see [google's deep learning tuning playbook](https://github.com/google-research/tuning_playbook) [^1]) +* build a scaling-law report to understand how prototypes scale + + +### Repository Organization + +``` +src/stimulus/ πŸ§ͺ +β”œβ”€β”€ analysis/ πŸ“Š +β”‚ └── analysis_default.py +β”œβ”€β”€ cli/ πŸ–₯️ +β”‚ β”œβ”€β”€ analysis_default.py +β”‚ β”œβ”€β”€ check_model.py +β”‚ β”œβ”€β”€ interpret_json.py +β”‚ β”œβ”€β”€ predict.py +β”‚ β”œβ”€β”€ shuffle_csv.py +β”‚ β”œβ”€β”€ split_csv.py +β”‚ β”œβ”€β”€ split_yaml.py +β”‚ β”œβ”€β”€ transform_csv.py +β”‚ └── tuning.py +β”œβ”€β”€ data/ πŸ“ +β”‚ β”œβ”€β”€ csv.py +β”‚ β”œβ”€β”€ experiments.py +β”‚ β”œβ”€β”€ handlertorch.py +β”‚ β”œβ”€β”€ encoding/ πŸ” +β”‚ β”‚ └── encoders.py +β”‚ β”œβ”€β”€ splitters/ βœ‚οΈ +β”‚ β”‚ └── splitters.py +β”‚ └── transform/ πŸ”„ +β”‚ └── data_transformation_generators.py +β”œβ”€β”€ learner/ 🧠 +β”‚ β”œβ”€β”€ predict.py +β”‚ β”œβ”€β”€ raytune_learner.py +β”‚ └── raytune_parser.py +└── utils/ πŸ› οΈ + β”œβ”€β”€ json_schema.py + β”œβ”€β”€ launch_utils.py + β”œβ”€β”€ performance.py + └── yaml_model_schema.py +``` ## Installation @@ -24,3 +71,6 @@ pip install --index-url https://test.pypi.org/simple/ --extra-index-url https:// ``` +### citations + +[^1]: Godbole, V., Dahl, G. E., Gilmer, J., Shallue, C. J., & Nado, Z. (2023). Deep Learning Tuning Playbook (Version 1.0) [Computer software]. http://github.com/google-research/tuning_playbook \ No newline at end of file From 7253eba0dcc468f79cf9f4761aeaf28dae56e6d0 Mon Sep 17 00:00:00 2001 From: mathysgrapotte Date: Fri, 22 Nov 2024 13:09:43 +0100 Subject: [PATCH 2/7] changed format of introduction bullet lists --- README.md | 30 ++++++++++++++++++++++-------- 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 35e4b622..7e99cafb 100644 --- a/README.md +++ b/README.md @@ -13,16 +13,30 @@ In the age of large-scale deep learning, it is critical that early-stage dl mode Here, we attempt at solving the testing problem by proposing an extensive library to test deep neural networks beyond test-set performance. -Stimulus provides those functionalities: -* Modifying training data to test model's robustness to data perturbations (and uncover which pre-processing steps increase performance) -* Perform hyperparameter tuning on model architecture with user-defined search spaces using Ray[tune] to make sure model performance is comparable across data transformations -* Build an all-against-all model report to guide data pre-processing decisions +Stimulus provides those functionalities + +1. **Data Perturbation Testing**: + Modify training data to test model's robustness to perturbations and uncover which pre-processing steps increase performance + +2. **Hyperparameter Optimization**: + Perform tuning on model architecture with user-defined search spaces using Ray[tune] to ensure comparable performance across data transformations + +3. **Comprehensive Analysis**: + Generate all-against-all model report to guide data pre-processing decisions Stimulus aims at providing those functionalities in a near future: -* perform routine checks on the model architecture and training process (things like type-checking, model actually runs, weights changed at training, etc.) -* perform routine checks on the model post-training (things like checking for overfitting, out of distribution performance, etc.) -* perform "informed" hyperparameter tuning (see [google's deep learning tuning playbook](https://github.com/google-research/tuning_playbook) [^1]) -* build a scaling-law report to understand how prototypes scale + +4. **Model Architecture Testing**: + Run routine checks on model architecture and training process including type-checking, model execution, and weight updates + +5. **Post-Training Validation**: + Perform comprehensive model validation including overfitting detection and out-of-distribution performance testing + +6. **Informed Hyperparameter Tuning**: + Implement systematic tuning strategies following [Google's Deep Learning Tuning Playbook](https://github.com/google-research/tuning_playbook) [^1] + +7. **Scaling Analysis**: + Generate scaling law reports to understand prototype model behavior at different scales ### Repository Organization From 5d628d1104ecd1bb116207589bca1eb4b9c105e9 Mon Sep 17 00:00:00 2001 From: mathysgrapotte Date: Fri, 22 Nov 2024 14:26:17 +0100 Subject: [PATCH 3/7] added development warning section --- README.md | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 7e99cafb..1571f7a4 100644 --- a/README.md +++ b/README.md @@ -6,9 +6,28 @@ +## ⚠️ Development Warning + +> **Warning** +> This package is in active development and breaking changes may occur. The API is not yet stable and features might be added, modified, or removed without notice. Use in production environments is not recommended at this stage. + +We encourage you to: + +- πŸ“ Report bugs and issues on our [GitHub Issues](https://github.com/mathysgrapotte/stimulus-py/issues) page + +- πŸ’‘ Suggest features and improvements through [GitHub Discussions](https://github.com/mathysgrapotte/stimulus-py/discussions) + +- 🀝 Contribute by submitting pull requests + +We are actively working towards release 1.0.0 (see [milestone](https://github.com/mathysgrapotte/stimulus-py/milestone/1)), check the slack channel by clicking on the badge above where we are actively discussing. Build with us every wednesday at 14:00 CET until 18:00 CET on the nf-core gathertown (see slack for calendar updates i.e. some weeks open dev hours are not possible) + + + + ## Introduction Most (if not all) quality software is thouroughly tested. Deep neural networks seem to have escaped this paradigm. + In the age of large-scale deep learning, it is critical that early-stage dl models (prototypes) are tested to ensure costly bugs do not happen at scale. Here, we attempt at solving the testing problem by proposing an extensive library to test deep neural networks beyond test-set performance. @@ -24,7 +43,7 @@ Stimulus provides those functionalities 3. **Comprehensive Analysis**: Generate all-against-all model report to guide data pre-processing decisions -Stimulus aims at providing those functionalities in a near future: +Stimulus aims at providing those functionalities in a near future 4. **Model Architecture Testing**: Run routine checks on model architecture and training process including type-checking, model execution, and weight updates @@ -33,11 +52,12 @@ Stimulus aims at providing those functionalities in a near future: Perform comprehensive model validation including overfitting detection and out-of-distribution performance testing 6. **Informed Hyperparameter Tuning**: - Implement systematic tuning strategies following [Google's Deep Learning Tuning Playbook](https://github.com/google-research/tuning_playbook) [^1] + Encourage tuning strategies that follow [Google's Deep Learning Tuning Playbook](https://github.com/google-research/tuning_playbook) [^1] 7. **Scaling Analysis**: Generate scaling law reports to understand prototype model behavior at different scales +For large scale experiments, we recommend our [nf-core](https://nf-co.re) [deepmodeloptim](https://github.com/nf-core/deepmodeloptim) pipeline which is still under development and will be released alongside stimulus v1.0.0. ### Repository Organization From 7c5636a7b5f0c8d4bfedc856d1a017427ad11418 Mon Sep 17 00:00:00 2001 From: mathysgrapotte Date: Fri, 22 Nov 2024 15:19:45 +0100 Subject: [PATCH 4/7] adding User Guide structure, starting with data configuration --- README.md | 58 +++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 46 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 1571f7a4..34b9ac1a 100644 --- a/README.md +++ b/README.md @@ -6,21 +6,19 @@ -## ⚠️ Development Warning +!!! warning -> **Warning** -> This package is in active development and breaking changes may occur. The API is not yet stable and features might be added, modified, or removed without notice. Use in production environments is not recommended at this stage. + > This package is in active development and breaking changes may occur. The API is not yet stable and features might be added, modified, or removed without notice. Use in production environments is not recommended at this stage. -We encourage you to: + We encourage you to: -- πŸ“ Report bugs and issues on our [GitHub Issues](https://github.com/mathysgrapotte/stimulus-py/issues) page + - πŸ“ Report bugs and issues on our [GitHub Issues](https://github.com/mathysgrapotte/stimulus-py/issues) page -- πŸ’‘ Suggest features and improvements through [GitHub Discussions](https://github.com/mathysgrapotte/stimulus-py/discussions) + - πŸ’‘ Suggest features and improvements through [GitHub Discussions](https://github.com/mathysgrapotte/stimulus-py/discussions) -- 🀝 Contribute by submitting pull requests - -We are actively working towards release 1.0.0 (see [milestone](https://github.com/mathysgrapotte/stimulus-py/milestone/1)), check the slack channel by clicking on the badge above where we are actively discussing. Build with us every wednesday at 14:00 CET until 18:00 CET on the nf-core gathertown (see slack for calendar updates i.e. some weeks open dev hours are not possible) + - 🀝 Contribute by submitting pull requests + We are actively working towards release 1.0.0 (see [milestone](https://github.com/mathysgrapotte/stimulus-py/milestone/1)), check the slack channel by clicking on the badge above where we are actively discussing. Build with us every wednesday at 14:00 CET until 18:00 CET on the nf-core gathertown (see slack for calendar updates i.e. some weeks open dev hours are not possible) @@ -43,7 +41,13 @@ Stimulus provides those functionalities 3. **Comprehensive Analysis**: Generate all-against-all model report to guide data pre-processing decisions -Stimulus aims at providing those functionalities in a near future +For large scale experiments, we recommend our [nf-core](https://nf-co.re) [deepmodeloptim](https://github.com/nf-core/deepmodeloptim) pipeline which is still under development and will be released alongside stimulus v1.0.0. + +πŸ“Ή Stimulus was featured at the nextflow summit 2024 in Barcelona, which is a nice intoduction to current package capabilities, you can watch the talk [here](https://www.youtube.com/watch?v=dC5p_tXQpEs) + + + +Stimulus aims at providing those functionalities in a near future, stay tuned for updates! 4. **Model Architecture Testing**: Run routine checks on model architecture and training process including type-checking, model execution, and weight updates @@ -57,9 +61,12 @@ Stimulus aims at providing those functionalities in a near future 7. **Scaling Analysis**: Generate scaling law reports to understand prototype model behavior at different scales -For large scale experiments, we recommend our [nf-core](https://nf-co.re) [deepmodeloptim](https://github.com/nf-core/deepmodeloptim) pipeline which is still under development and will be released alongside stimulus v1.0.0. -### Repository Organization +## User guide + +### Repository organization + +Stimulus is organized as follows, we will reference to this structure in the following sections ``` src/stimulus/ πŸ§ͺ @@ -96,6 +103,33 @@ src/stimulus/ πŸ§ͺ └── yaml_model_schema.py ``` +### Expected data format + +Data is expected to be presented in a csv samplesheet file with the following format: + +| input1:input:input_type | input2:input:input_type | meta1:meta:meta_type | label1\:label:label_type | label2\:label:label_type | +| ----------------------- | ----------------------- | -------------------- | ----------------------- | ----------------------- | +| sample1 input1 | sample1 input2 | sample1 meta1 | sample1 label1 | sample1 label2 | +| sample2 input1 | sample2 input2 | sample2 meta1 | sample2 label1 | sample2 label2 | +| sample3 input1 | sample3 input2 | sample3 meta1 | sample3 label1 | sample3 label2 | + + + +!!! note "future improvements" + This rigid data format is expected to change once we move to release v1.0.0, data types and information will be defined in a yaml config and only column names will be required in the data, see [this github issue](https://github.com/mathysgrapotte/stimulus-py/issues/24) + + +### Data loading + +Data in stimulus can take many forms (files, text, images, networks...) in order to support this diversity, stimulus relies on the [encoding module](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/encoding/encoders/#stimulus.data.encoding.encoders.AbstractEncoder){:target="_blank"}. List of available encoders can be found [here](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/encoding/encoders/#stimulus.data.encoding.encoders). + +If the provided encoders do not support the type of data you are working with, you can write your own encoder by inheriting from the `AbstractEncoder` class and implementing the `encode`, `decode` and `encode_all` methods. + +- `encode` is currently optional, can return a `NotImplementedError` if the encoder does not support encoding a single data point +- `decode` is currently optional, can return a `NotImplementedError` if the encoder does not support decoding +- `encode_all` is called by other stimulus functions, and is expected to return a [`np.array`](https://numpy.org/doc/stable/reference/generated/numpy.array.html){:target="_blank"} . + + ## Installation stimulus is still under development, you can install it from test-pypi by running the following command: From 455c912c7423b0fe2858c2f42b1c2f877b0ce949 Mon Sep 17 00:00:00 2001 From: mathysgrapotte Date: Fri, 22 Nov 2024 18:08:03 +0100 Subject: [PATCH 5/7] rewrote admonitions to have better readability between Github and MkDocs, completed the Data Loading section --- README.md | 121 ++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 95 insertions(+), 26 deletions(-) diff --git a/README.md b/README.md index 34b9ac1a..ae73924d 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,5 @@ -# stimulus-py +# STIMULUS +## Stochastic Testing with Input Modification for Unbiased Learning Systems. [![ci](https://github.com/mathysgrapotte/stimulus-py/workflows/ci/badge.svg)](https://github.com/mathysgrapotte/stimulus-py/actions?query=workflow%3Aci) [![documentation](https://img.shields.io/badge/docs-mkdocs-708FCC.svg?style=flat)](https://mathysgrapotte.github.io/stimulus-py/) @@ -6,19 +7,18 @@ -!!! warning - - > This package is in active development and breaking changes may occur. The API is not yet stable and features might be added, modified, or removed without notice. Use in production environments is not recommended at this stage. - - We encourage you to: - - - πŸ“ Report bugs and issues on our [GitHub Issues](https://github.com/mathysgrapotte/stimulus-py/issues) page - - - πŸ’‘ Suggest features and improvements through [GitHub Discussions](https://github.com/mathysgrapotte/stimulus-py/discussions) - - - 🀝 Contribute by submitting pull requests - - We are actively working towards release 1.0.0 (see [milestone](https://github.com/mathysgrapotte/stimulus-py/milestone/1)), check the slack channel by clicking on the badge above where we are actively discussing. Build with us every wednesday at 14:00 CET until 18:00 CET on the nf-core gathertown (see slack for calendar updates i.e. some weeks open dev hours are not possible) +> WARNING: +> This package is in active development and breaking changes may occur. The API is not yet stable and features might be added, modified, or removed without notice. Use in production environments is not recommended at this stage. +> +> We encourage you to: +> +> - πŸ“ Report bugs and issues on our [GitHub Issues](https://github.com/mathysgrapotte/stimulus-py/issues) page +> +> - πŸ’‘ Suggest features and improvements through [GitHub Discussions](https://github.com/mathysgrapotte/stimulus-py/discussions) +> +> - 🀝 Contribute by submitting pull requests +> +> We are actively working towards release 1.0.0 (see [milestone](https://github.com/mathysgrapotte/stimulus-py/milestone/1)), check the slack channel by clicking on the badge above where we are actively discussing. Build with us every wednesday at 14:00 CET until 18:00 CET on the nf-core gathertown (see slack for calendar updates i.e. some weeks open dev hours are not possible) @@ -41,9 +41,9 @@ Stimulus provides those functionalities 3. **Comprehensive Analysis**: Generate all-against-all model report to guide data pre-processing decisions -For large scale experiments, we recommend our [nf-core](https://nf-co.re) [deepmodeloptim](https://github.com/nf-core/deepmodeloptim) pipeline which is still under development and will be released alongside stimulus v1.0.0. +For large scale experiments, we recommend our [nf-core](https://nf-co.re){:target="_blank"} [deepmodeloptim](https://github.com/nf-core/deepmodeloptim){:target="_blank"} pipeline which is still under development and will be released alongside stimulus v1.0.0. -πŸ“Ή Stimulus was featured at the nextflow summit 2024 in Barcelona, which is a nice intoduction to current package capabilities, you can watch the talk [here](https://www.youtube.com/watch?v=dC5p_tXQpEs) +πŸ“Ή Stimulus was featured at the nextflow summit 2024 in Barcelona, which is a nice intoduction to current package capabilities, you can watch the talk [here](https://www.youtube.com/watch?v=dC5p_tXQpEs){:target="_blank"} @@ -56,7 +56,7 @@ Stimulus aims at providing those functionalities in a near future, stay tuned fo Perform comprehensive model validation including overfitting detection and out-of-distribution performance testing 6. **Informed Hyperparameter Tuning**: - Encourage tuning strategies that follow [Google's Deep Learning Tuning Playbook](https://github.com/google-research/tuning_playbook) [^1] + Encourage tuning strategies that follow [Google's Deep Learning Tuning Playbook](https://github.com/google-research/tuning_playbook){:target="_blank"} [^1] 7. **Scaling Analysis**: Generate scaling law reports to understand prototype model behavior at different scales @@ -103,6 +103,17 @@ src/stimulus/ πŸ§ͺ └── yaml_model_schema.py ``` + +### Data encoding + +Data in stimulus can take many forms (files, text, images, networks...) in order to support this diversity, stimulus relies on the [encoding module](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/encoding/encoders/#stimulus.data.encoding.encoders.AbstractEncoder){:target="_blank"}. List of available encoders can be found [here](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/encoding/encoders/#stimulus.data.encoding.encoders){:target="_blank"}. + +If the provided encoders do not support the type of data you are working with, you can write your own encoder by inheriting from the `AbstractEncoder` class and implementing the `encode`, `decode` and `encode_all` methods. + +- `encode` is currently optional, can return a `NotImplementedError` if the encoder does not support encoding a single data point +- `decode` is currently optional, can return a `NotImplementedError` if the encoder does not support decoding +- `encode_all` is called by other stimulus functions, and is expected to return a [`np.array`](https://numpy.org/doc/stable/reference/generated/numpy.array.html){:target="_blank"} . + ### Expected data format Data is expected to be presented in a csv samplesheet file with the following format: @@ -113,21 +124,79 @@ Data is expected to be presented in a csv samplesheet file with the following fo | sample2 input1 | sample2 input2 | sample2 meta1 | sample2 label1 | sample2 label2 | | sample3 input1 | sample3 input2 | sample3 meta1 | sample3 label1 | sample3 label2 | +Columns are expected to follow this name convention : `name:type:data_type` +- name corresponds to the column name, this should be the same as input names in model batch definition (see model section for more details) -!!! note "future improvements" - This rigid data format is expected to change once we move to release v1.0.0, data types and information will be defined in a yaml config and only column names will be required in the data, see [this github issue](https://github.com/mathysgrapotte/stimulus-py/issues/24) +- type is either input, meta or label, typically models predict the labels from the input, and meta is used to perform downstream analysis +- data_type is the column data type. -### Data loading +> NOTE: +> This rigid data format is expected to change once we move to release v1.0.0, data types and information will be defined in a yaml config and only column names will be required in the data, see [this github issue](https://github.com/mathysgrapotte/stimulus-py/issues/24){:target="_blank"} -Data in stimulus can take many forms (files, text, images, networks...) in order to support this diversity, stimulus relies on the [encoding module](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/encoding/encoders/#stimulus.data.encoding.encoders.AbstractEncoder){:target="_blank"}. List of available encoders can be found [here](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/encoding/encoders/#stimulus.data.encoding.encoders). +### Connecting encoders and datasets -If the provided encoders do not support the type of data you are working with, you can write your own encoder by inheriting from the `AbstractEncoder` class and implementing the `encode`, `decode` and `encode_all` methods. +Once we have our data formated and our encoders ready, we need to explicitly state which encoder is used for which data type. This is done through an experiment class. -- `encode` is currently optional, can return a `NotImplementedError` if the encoder does not support encoding a single data point -- `decode` is currently optional, can return a `NotImplementedError` if the encoder does not support decoding -- `encode_all` is called by other stimulus functions, and is expected to return a [`np.array`](https://numpy.org/doc/stable/reference/generated/numpy.array.html){:target="_blank"} . +To understand how experiment classes are used to connect data types and encoders, let's have a look at a minimal DnaToFloat example : + +```python +class DnaToFloat(AbstractExperiment): + def __init__(self) -> None: + super().__init__() + self.dna = { + "encoder": encoders.TextOneHotEncoder(alphabet="acgt"), + } + self.float = { + "encoder": encoders.FloatEncoder(), + } + self.split = {"RandomSplitter": splitters.RandomSplitter()} +``` + +Here we define the `data_type` for the dna and float types, note that those `data_type` are the same as the ones defined in the samplesheet dataset above, for example, a dataset on which this experiment would run could look like this: + +| mouse_dna:input:dna | mouse_rnaseq\:label:float | +| ------------------- | ------------------------ | +| ACTAGGCATGCTAGTCG | 0.53 | +| ACTGGGGCTAGTCGAA | 0.23 | +| GATGTTCTGATGCT | 0.98 | + +Note how the `data_type` for the mouse_dna and mouse_rnaseq columns match exactly the attribute names defined in the `DnaToFloat` minimal class above. + +stimulus-py ships with a few basic experiment classes, if you need to write your own experiment class, simply inherit from the base `AbstractExperiment` class and overwrite the class `__init__` method like shown above. + +> NOTE: +> This has the drawback of requiring a build of the experiment class each time a new task is defined (for instance, let's say we want to use dna and protein sequences to predict rna). +> +> Once we move to release v1.0.0, `type` (i.e. input, meta, label) and `data_type` will be defined in the data yaml config, and the relevant experiment class will be automatically built. + + +### Loading the data + +Finally, once we have defined our encoders, the experiment class and the samplesheet, stimulus will transparently load the data using the [csv.py module](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/csv/#stimulus.data.csv){:target="_blank"} + +csv.py contains two important classes, `CsvLoader` and `CsvProcessing` + +`CsvLoader` is responsible for naΓ―vely loading the data (without changing anything), it works by performing a couple of checks on the dataset to ensure it is correctly formated, and then uses the experiment class in conjunction with the column names to call the proper encoders and output inputs, labels, and meta dictionary objects. + +`CsvLoader` is used by the `handlertorch` module to load data into pytorch tensors. + +> TIP: +> So, to recap, +> when you load a dataset into a torch tensor, +> +> 1. `handlertorch` will call `CsvLoader` with the csv samplesheet and the experiment class +> +> 2. `CsvLoader` will use the experiment class to fetch the proper encoder `encode_all` method for each data column +> +> 3. `CsvLoader` will use the `encode_all` method to encode the data and output dictionary objects for inputs, labels and meta data +> +> 4. `handlertorch` will convert the contents to torch tensors +> +> 5. `handlertorch` will feed the `input` torch tensor to the model, use the `label` torch tensor for loss computation and will store the `meta` tensor for downstream analysis +> +>Great, now you know how stimulus transparently loads your data into your pytorch model! While this seems complicated, the only thing you really have to do, is to format your data correctly in a csv samplesheet and define your experiment class with the proper encoders (either by using the provided encoders or by writing your own). ## Installation @@ -139,6 +208,6 @@ pip install --index-url https://test.pypi.org/simple/ --extra-index-url https:// ``` -### citations +## citations [^1]: Godbole, V., Dahl, G. E., Gilmer, J., Shallue, C. J., & Nado, Z. (2023). Deep Learning Tuning Playbook (Version 1.0) [Computer software]. http://github.com/google-research/tuning_playbook \ No newline at end of file From 683d23eb81f3ed2167f76d93c7ba605cecc8b646 Mon Sep 17 00:00:00 2001 From: mathysgrapotte Date: Fri, 22 Nov 2024 18:12:01 +0100 Subject: [PATCH 6/7] removed target_blank to links as this was displaying in README on github --- README.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index ae73924d..247dbb2f 100644 --- a/README.md +++ b/README.md @@ -41,9 +41,9 @@ Stimulus provides those functionalities 3. **Comprehensive Analysis**: Generate all-against-all model report to guide data pre-processing decisions -For large scale experiments, we recommend our [nf-core](https://nf-co.re){:target="_blank"} [deepmodeloptim](https://github.com/nf-core/deepmodeloptim){:target="_blank"} pipeline which is still under development and will be released alongside stimulus v1.0.0. +For large scale experiments, we recommend our [nf-core](https://nf-co.re) [deepmodeloptim](https://github.com/nf-core/deepmodeloptim) pipeline which is still under development and will be released alongside stimulus v1.0.0. -πŸ“Ή Stimulus was featured at the nextflow summit 2024 in Barcelona, which is a nice intoduction to current package capabilities, you can watch the talk [here](https://www.youtube.com/watch?v=dC5p_tXQpEs){:target="_blank"} +πŸ“Ή Stimulus was featured at the nextflow summit 2024 in Barcelona, which is a nice intoduction to current package capabilities, you can watch the talk [here](https://www.youtube.com/watch?v=dC5p_tXQpEs) @@ -56,7 +56,7 @@ Stimulus aims at providing those functionalities in a near future, stay tuned fo Perform comprehensive model validation including overfitting detection and out-of-distribution performance testing 6. **Informed Hyperparameter Tuning**: - Encourage tuning strategies that follow [Google's Deep Learning Tuning Playbook](https://github.com/google-research/tuning_playbook){:target="_blank"} [^1] + Encourage tuning strategies that follow [Google's Deep Learning Tuning Playbook](https://github.com/google-research/tuning_playbook) [^1] 7. **Scaling Analysis**: Generate scaling law reports to understand prototype model behavior at different scales @@ -106,13 +106,13 @@ src/stimulus/ πŸ§ͺ ### Data encoding -Data in stimulus can take many forms (files, text, images, networks...) in order to support this diversity, stimulus relies on the [encoding module](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/encoding/encoders/#stimulus.data.encoding.encoders.AbstractEncoder){:target="_blank"}. List of available encoders can be found [here](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/encoding/encoders/#stimulus.data.encoding.encoders){:target="_blank"}. +Data in stimulus can take many forms (files, text, images, networks...) in order to support this diversity, stimulus relies on the [encoding module](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/encoding/encoders/#stimulus.data.encoding.encoders.AbstractEncoder). List of available encoders can be found [here](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/encoding/encoders/#stimulus.data.encoding.encoders). If the provided encoders do not support the type of data you are working with, you can write your own encoder by inheriting from the `AbstractEncoder` class and implementing the `encode`, `decode` and `encode_all` methods. - `encode` is currently optional, can return a `NotImplementedError` if the encoder does not support encoding a single data point - `decode` is currently optional, can return a `NotImplementedError` if the encoder does not support decoding -- `encode_all` is called by other stimulus functions, and is expected to return a [`np.array`](https://numpy.org/doc/stable/reference/generated/numpy.array.html){:target="_blank"} . +- `encode_all` is called by other stimulus functions, and is expected to return a [`np.array`](https://numpy.org/doc/stable/reference/generated/numpy.array.html) . ### Expected data format @@ -133,7 +133,7 @@ Columns are expected to follow this name convention : `name:type:data_type` - data_type is the column data type. > NOTE: -> This rigid data format is expected to change once we move to release v1.0.0, data types and information will be defined in a yaml config and only column names will be required in the data, see [this github issue](https://github.com/mathysgrapotte/stimulus-py/issues/24){:target="_blank"} +> This rigid data format is expected to change once we move to release v1.0.0, data types and information will be defined in a yaml config and only column names will be required in the data, see [this github issue](https://github.com/mathysgrapotte/stimulus-py/issues/24) ### Connecting encoders and datasets @@ -174,7 +174,7 @@ stimulus-py ships with a few basic experiment classes, if you need to write your ### Loading the data -Finally, once we have defined our encoders, the experiment class and the samplesheet, stimulus will transparently load the data using the [csv.py module](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/csv/#stimulus.data.csv){:target="_blank"} +Finally, once we have defined our encoders, the experiment class and the samplesheet, stimulus will transparently load the data using the [csv.py module](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/csv/#stimulus.data.csv) csv.py contains two important classes, `CsvLoader` and `CsvProcessing` From 7ffd173a930bb73250734cd3bb4daf9a9e88142a Mon Sep 17 00:00:00 2001 From: mgrapotte Date: Mon, 25 Nov 2024 15:35:36 +0100 Subject: [PATCH 7/7] [docs] improved documentation for transforms - modified docstrings that were unclear - added README.md section - misc: updated CONTRIBUTING.md --- CONTRIBUTING.md | 85 +------------------ README.md | 60 ++++++++++++- .../data_transformation_generators.py | 10 ++- 3 files changed, 69 insertions(+), 86 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index a07fb1aa..0c94baa5 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -60,89 +60,6 @@ As usual: 1. if you updated the documentation or the project dependencies: 1. run `make docs` 1. go to http://localhost:8000 and check that everything looks good -1. follow our [commit message convention](#commit-message-convention) -If you are unsure about how to fix or ignore a warning, -just let the continuous integration fail, -and we will help you during review. -Don't bother updating the changelog, we will take care of this. - -## Commit message convention - -Commit messages must follow our convention based on the -[Angular style](https://gist.github.com/stephenparish/9941e89d80e2bc58a153#format-of-the-commit-message) -or the [Karma convention](https://karma-runner.github.io/4.0/dev/git-commit-msg.html): - -``` -[(scope)]: Subject - -[Body] -``` - -**Subject and body must be valid Markdown.** -Subject must have proper casing (uppercase for first letter -if it makes sense), but no dot at the end, and no punctuation -in general. - -Scope and body are optional. Type can be: - -- `build`: About packaging, building wheels, etc. -- `chore`: About packaging or repo/files management. -- `ci`: About Continuous Integration. -- `deps`: Dependencies update. -- `docs`: About documentation. -- `feat`: New feature. -- `fix`: Bug fix. -- `perf`: About performance. -- `refactor`: Changes that are not features or bug fixes. -- `style`: A change in code style/format. -- `tests`: About tests. - -If you write a body, please add trailers at the end -(for example issues and PR references, or co-authors), -without relying on GitHub's flavored Markdown: - -``` -Body. - -Issue #10: https://github.com/namespace/project/issues/10 -Related to PR namespace/other-project#15: https://github.com/namespace/other-project/pull/15 -``` - -These "trailers" must appear at the end of the body, -without any blank lines between them. The trailer title -can contain any character except colons `:`. -We expect a full URI for each trailer, not just GitHub autolinks -(for example, full GitHub URLs for commits and issues, -not the hash or the #issue-number). - -We do not enforce a line length on commit messages summary and body, -but please avoid very long summaries, and very long lines in the body, -unless they are part of code blocks that must not be wrapped. - -## Pull requests guidelines - -Link to any related issue in the Pull Request message. - -During the review, we recommend using fixups: - -```bash -# SHA is the SHA of the commit you want to fix -git commit --fixup=SHA -``` - -Once all the changes are approved, you can squash your commits: - -```bash -git rebase -i --autosquash main -``` - -And force-push: - -```bash -git push -f -``` - -If this seems all too complicated, you can push or force-push each new commit, -and we will squash them ourselves if needed, before merging. +Then you can pull request and we will review. Make sure you join our [slack](https://nfcore.slack.com/channels/deepmodeloptim) hosted on nf-core to talk and build with us! diff --git a/README.md b/README.md index 247dbb2f..9c38351a 100644 --- a/README.md +++ b/README.md @@ -151,7 +151,6 @@ class DnaToFloat(AbstractExperiment): self.float = { "encoder": encoders.FloatEncoder(), } - self.split = {"RandomSplitter": splitters.RandomSplitter()} ``` Here we define the `data_type` for the dna and float types, note that those `data_type` are the same as the ones defined in the samplesheet dataset above, for example, a dataset on which this experiment would run could look like this: @@ -198,6 +197,65 @@ csv.py contains two important classes, `CsvLoader` and `CsvProcessing` > >Great, now you know how stimulus transparently loads your data into your pytorch model! While this seems complicated, the only thing you really have to do, is to format your data correctly in a csv samplesheet and define your experiment class with the proper encoders (either by using the provided encoders or by writing your own). +### Data transformation + +Measuring the impact of data transformations (noising, down/upsampling, augmentation...) on models at training time is a major feature of stimulus. + +Data transformations materialize as `DataTransformer` classes, and should inherit from the `AbstractDataTransformer` class (see [docs](https://mathysgrapotte.github.io/stimulus-py/reference/stimulus/data/encoding/encoders/#stimulus.data.encoding.encoders.AbstractEncoder)) + +> NOTE: +> Writing your own `DataTransformer` class is the same as writing your own `Encoder` class, you should overwrite the `transform` and `transform_all` methods + +> WARNING: +> Every `DataTransformer` class has to have `seed` in `transform` and `transform_all` methods parameters, and `np.random.seed(seed)` should be called in those methods. + +> WARNING: +> Every `DataTransformer` class should have an `add_row` argument set to either `True` or `False` depending on if it is augmenting the data (adding rows) or not. + +### Connecting transformations and dataset + +Just like encoders, data transformations are defined in the `Experiment` class alongside encoders. Let's upgrade our `DnaToFloat` minimal class defined above to reflect this. + +```python +class DnaToFloat(AbstractExperiment): + def __init__(self) -> None: + super().__init__() + self.dna = { + "encoder": encoders.TextOneHotEncoder(alphabet="acgt"), + "data_transformation_generators": { + "UniformTextMasker": data_transformation_generators.UniformTextMasker(mask="N"), + "ReverseComplement": data_transformation_generators.ReverseComplement(), + "GaussianChunk": data_transformation_generators.GaussianChunk(), + }, + } + self.float = { + "encoder": encoders.FloatEncoder(), + "data_transformation_generators": {"GaussianNoise": data_transformation_generators.GaussianNoise()}, + } +``` + +As you can see, our `data_type` arguments get an other field, `"data_transformation_generators"`, there we can initialize the `DataTransformer` classes with their relevant parameters. + +In the `csv` module, the `CsvProcessing` class will call the `transform_all` methods from the classes contained in `"data_transformation_generators"` based on the column type and a list of transformations. + +i.e., if we give the `["ReverseComplement","GaussianChunk"]` list to the `CsvProcessing` class `transform` method the data contained in the `mouse_dna:input:dna` column in our minimal example above will be first reverse complemented and then chunked. + +> TIP: +> Recap : +> To transform your dataset, +> +> - define your own `DataTransformer` class or use one we provide +> +> - add it to your experiment class +> +> - load your data through `CsvProcessing` +> +> - set a list of transforms +> +> - call `CsvProcessing.transform(transform_list)` + + + ## Installation diff --git a/src/stimulus/data/transform/data_transformation_generators.py b/src/stimulus/data/transform/data_transformation_generators.py index a5b98693..cf30764f 100644 --- a/src/stimulus/data/transform/data_transformation_generators.py +++ b/src/stimulus/data/transform/data_transformation_generators.py @@ -10,7 +10,15 @@ class AbstractDataTransformer(ABC): """Abstract class for data transformers. - All data transformers should have the seed in it. This is because the multiprocessing of them could unset the seed. + Data transformers implement in_place or augmentation transformations. + Whether it is in_place or augmentation is specified in the "add_row" attribute (should be True or False and set in children classes constructor) + + Child classes should override the `transform` and `transform_all` methods. + + `transform_all` should always return a list + + Both methods should take an optional `seed` argument set to `None` by default to be compliant with stimulus' core principle of reproducibility. + Seed should be initialized through `np.random.seed(seed)` in the method implementation. Attributes: add_row (bool): whether the transformer adds rows to the data