diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 0c94baa5..ca9d0c66 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,7 +1,6 @@ # Contributing -Contributions are welcome, and they are greatly appreciated! -Every little bit helps, and credit will always be given. +Contributions are most welcome! Follow the stepts below to get started. ## Environment setup @@ -33,7 +32,122 @@ You now have the dependencies installed. Run `make help` to see all the available actions! -## Tasks +## Understanding the codebase + +Unfortunately, at this stage, we have no tutorials ready or in-depth README. So you will have to dive into the codebase by yourself. +Thankfully, every function or class has a docstring that should give you a good idea of what it does, and docs are building (you can run `make docs` to access dev branch docs which should be more up to date than the main branch docs). + +After running `make setup`, you should have a `.venvs` directory in the root of the repository. + +To activate it, run: + +```bash +source .venvs//bin/activate +``` + +stimulus should now be installed as a librairy! + +Then, you can install extra dependencies (for instance jupyter) using `uv pip install `. + +We recommand you install jupyter, spin up a notebook and take the titanic datasets formated to the stimulus format in `tests/test_data/titanic/titanic_stimulus.csv` as well as the model in `tests/test_model/titanic_model.py`. + +The configurations files (stimulus uses those to generate different versions of the model and datasets) are in `tests/test_data/titanic/titanic.yaml` and `tests/test_model/titanic_model_{gpu/cpu}.yaml`. + +From there, you should first try to load the dataset in a data_handler class (see `stimulus.data.data_handler.py`). For this, you will need loader classes (see `stimulus.data.loaders.py`). Currently, loaders use a config system, the split config in `tests/test_data/titanic/titanic_sub_config.yaml` should be enough to get you started. + +Then, you can try to load the dataset in the pytorch dataset utility (see `stimulus.data.handlertorch.py`). And run the model on a few examples to see what is going on. + +Finally, you can call tuning scripts (see `stimulus.learner.raytune_learner.py`) - make sure you import ray and call ray.init() first - and run a hyper parameter tuning run on the model using the model config `tests/test_model/titanic_model_{gpu/cpu}.yaml`. + +Since this is not so well documented, it is possible you encounter issues, bugs, things that are weird to you, unintuitive, etc. I will be here to answer questions, either on nf-core slack or on discord. Those questions are extremely valuable to get the documentation up to speed. + +## Contributing + +At this stage, you will mostly be understanding the codebase and have your own ideas on how to improve it. If so, please make an issue and discuss on discord/slack. Otherwise, you can pick one of the many issues already opened and work from there. + +### Things that are always welcome (especially interesting for newcomers): + +- Improving documentation: + +This is the most impactful thing you can do as a newcomer (since you get to write the documentation you wish you had when you started) + it will help you understand the codebase better. PR's aiming to improve documentation are also very easy to review and accept, those will be prioritized. + +- Building tutorials: + +Now that you understand the codebase a bit better, helping others understanding it as well will always be extremely valuable, and as with documentation PR's, those will be prioritized. + +- Adding Encoders/Transforms/Splitters + +This librairy lives and dies by the number of encoders/transforms/splitters offered, so adding those will always improve the software. + +- Quality of life improvements + +As users, you have a much better understanding of the pain points than we do. If you have ideas on how to improve the software, please share them! + +- Bug fixes + +- Performance improvements + +### Things that are always welcome discussing but not necessarily easy to get for newcomers: + +- Refactoring + +Often refactoring code is needed to improve the codebase, make it more readable or flexible (and refactoring making the codebase more readable will always be highly valued). + +- Porting stimulus to non-bio fields + +This will sometime require extensive refactor and good understanding of the codebase to make sure it does not break anything. + +- Adding extra functionality (specifically downstream analysis, interpretation methods etc...) + +Stimulus would be a lot more useful if it could perform downstream model analysis, interpretability methods, overfitting analysis etc.. All of those things are on the roadmap, but I guess the codebase need to be well understood first (however, raising issues to discuss how to do this is always welcome). + +## How to contribute code + +### First thing you should do + +Fork the repository, then : + +1. create a new branch: `git switch -c feature-or-bugfix-name` +1. edit the code and/or the documentation + +### Commit guidelines : + +Please write atomic commits, meaning that each commit should be a single change. This is good practice since it allows everybody to review much faster! + +When we push a new release, we use `make changelog` to generate the changelog. This has one caveat, it only works if commits messages are following the angular commit guidelines. Therefore, if you want your contributions to be seen, you need to write your commit messages following [their contributing guide](https://github.com/angular/angular/blob/master/CONTRIBUTING.md#commit-message-format). + +Below, the relevant part of the angular contribution guide for commit messages: + +``` +(): + │ │ │ + │ │ └─⫸ Summary in present tense. Not capitalized. No period at the end. + │ │ + │ └─⫸ Commit Scope: animations|bazel|benchpress|common|compiler|compiler-cli|core| + │ elements|forms|http|language-service|localize|platform-browser| + │ platform-browser-dynamic|platform-server|router|service-worker| + │ upgrade|zone.js|packaging|changelog|docs-infra|migrations| + │ devtools + │ + └─⫸ Commit Type: build|ci|docs|feat|fix|perf|refactor|test +``` + +The `` and `` fields are mandatory, the `` field is optional. +Type +Must be one of the following: + +- `build`: Changes that affect the build system or external dependencies (example scopes: gulp, broccoli, npm) +- `ci`: Changes to our CI configuration files and scripts (examples: Github Actions, SauceLabs) +- `docs`: Documentation only changes +- `feat`: A new feature +- `fix`: A bug fix +- `perf`: A code change that improves performance +- `refactor`: A code change that neither fixes a bug nor adds a feature +- `test`: Adding missing tests or correcting existing tests + +The scope, while optional, is recommended (typically the name of the file you are working on). + +**Before pushing:** The entry-point to run commands and tasks is the `make` Python script, located in the `scripts` directory. Try running `make` to show the available commands and tasks. @@ -45,15 +159,6 @@ If you work in VSCode, we provide [an action to configure VSCode](https://pawamoy.github.io/copier-uv/work/#vscode-setup) for the project. -## Development - -As usual: - -1. create a new branch: `git switch -c feature-or-bugfix-name` -1. edit the code and/or the documentation - -**Before committing:** - 1. run `make format` to auto-format the code 1. run `make check` to check everything (fix any warning) 1. run `make test` to run the tests (fix any issue) @@ -61,5 +166,8 @@ As usual: 1. run `make docs` 1. go to http://localhost:8000 and check that everything looks good +Then you can pull request and we will review. Make sure you join our [slack](https://nfcore.slack.com/channels/deepmodeloptim) hosted on nf-core or find us on discord to talk and build with us! + +Once you have your first PR merged, you will be added to the repository as a contributor and your contributions will be aknowledged! + -Then you can pull request and we will review. Make sure you join our [slack](https://nfcore.slack.com/channels/deepmodeloptim) hosted on nf-core to talk and build with us! diff --git a/src/stimulus/utils/yaml_model_schema.py b/src/stimulus/utils/yaml_model_schema.py index fb48355f..7fb25ebb 100644 --- a/src/stimulus/utils/yaml_model_schema.py +++ b/src/stimulus/utils/yaml_model_schema.py @@ -196,12 +196,8 @@ def convert_config_to_ray(self, model: Model) -> RayTuneModel: Ray Tune compatible model configuration """ return RayTuneModel( - network_params={ - k: self.convert_raytune(v) for k, v in model.network_params.items() - }, - optimizer_params={ - k: self.convert_raytune(v) for k, v in model.optimizer_params.items() - }, + network_params={k: self.convert_raytune(v) for k, v in model.network_params.items()}, + optimizer_params={k: self.convert_raytune(v) for k, v in model.optimizer_params.items()}, loss_params={k: self.convert_raytune(v) for k, v in model.loss_params}, data_params={k: self.convert_raytune(v) for k, v in model.data_params}, tune=model.tune, diff --git a/tests/cli/test_tuning.py b/tests/cli/test_tuning.py index 5621c936..e05d6213 100644 --- a/tests/cli/test_tuning.py +++ b/tests/cli/test_tuning.py @@ -1,28 +1,25 @@ """Test the tuning CLI.""" +import operator import os -import yaml import shutil -import operator import warnings -from pathlib import Path from functools import reduce +from pathlib import Path +from typing import Any import pytest import ray +import yaml from stimulus.cli import tuning -from typing import Any @pytest.fixture def data_path() -> str: """Get path to test data CSV file.""" return str( - Path(__file__).parent.parent - / "test_data" - / "titanic" - / "titanic_stimulus_split.csv" + Path(__file__).parent.parent / "test_data" / "titanic" / "titanic_stimulus_split.csv", ) @@ -30,10 +27,7 @@ def data_path() -> str: def data_config() -> str: """Get path to test data config YAML.""" return str( - Path(__file__).parent.parent - / "test_data" - / "titanic" - / "titanic_sub_config.yaml" + Path(__file__).parent.parent / "test_data" / "titanic" / "titanic_sub_config.yaml", ) @@ -50,25 +44,25 @@ def model_config() -> str: def _get_number_of_generated_files(save_dir_path: str) -> int: - """Each run generates a file in the result dir""" + """Each run generates a file in the result dir.""" # Get the number of generated run files number_of_files: int = 0 for file in os.listdir(save_dir_path): if "TuneModel" in file: number_of_files = len( - [f for f in os.listdir(save_dir_path + "/" + file) if "TuneModel" in f] + [f for f in os.listdir(save_dir_path + "/" + file) if "TuneModel" in f], ) return number_of_files def _get_number_of_theoritical_runs(params_path: str) -> int: - """ - The number of run is defined as follows: - G: number of grid_search - n_i: number of options for the ith grid_search - S: value of num_samples + """The number of run is defined as follows. + + G: number of grid_search + n_i: number of options for the ith grid_search + S: value of num_samples - R = S * ∏(i=1 to G) n_i + R = S * ∏(i=1 to G) n_i """ # Get the theoritical number of runs with open(params_path) as file: @@ -76,24 +70,30 @@ def _get_number_of_theoritical_runs(params_path: str) -> int: grid_searches_len: list[int] = [] num_samples: int = 0 - for header, sections in params_dict.items(): + for _header, sections in params_dict.items(): if isinstance(sections, dict): for section in sections.values(): if isinstance(section, dict): - # Lookup for any grid search in the yaml - has_grid_search: bool = section.get("mode") == "grid_search" - has_num_samples: bool = section.get("num_samples") is not None - if has_grid_search: - grid_searches_len.append(len(section.get("space"))) - elif has_num_samples: - num_samples = section.get("num_samples") - # Apply the described function and return the value - result = num_samples * reduce(operator.mul, grid_searches_len) - return result + # Lookup for grid search or num_samples in the yaml + mode_value = section.get("mode") + ns_value = section.get("num_samples") + if mode_value == "grid_search": + space_value = section.get("space") + if space_value is not None: + grid_searches_len.append(len(space_value)) + else: + grid_searches_len.append(0) + elif ns_value is not None: + num_samples = ns_value if isinstance(ns_value, int) else 0 + # Apply the described function and return the value + return num_samples * reduce(operator.mul, grid_searches_len) def test_tuning_main( - data_path: str, data_config: str, model_path: str, model_config: str + data_path: str, + data_config: str, + model_path: str, + model_config: str, ) -> None: """Test that tuning.main runs without errors. @@ -147,7 +147,7 @@ def test_tuning_main( # Clean up any ray files/directories that may have been created ray_results_dir = os.path.expanduser( - "tests/test_data/titanic/test_results/" + "tests/test_data/titanic/test_results/", ) # Check that the theoritical numbers of run corresponds to the real number of runs n_files: int = _get_number_of_generated_files(ray_results_dir)