Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation #81

Merged
merged 3 commits into from
Feb 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 121 additions & 13 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# Contributing

Contributions are welcome, and they are greatly appreciated!
Every little bit helps, and credit will always be given.
Contributions are most welcome! Follow the stepts below to get started.

## Environment setup

Expand Down Expand Up @@ -33,7 +32,122 @@ You now have the dependencies installed.

Run `make help` to see all the available actions!

## Tasks
## Understanding the codebase

Unfortunately, at this stage, we have no tutorials ready or in-depth README. So you will have to dive into the codebase by yourself.
Thankfully, every function or class has a docstring that should give you a good idea of what it does, and docs are building (you can run `make docs` to access dev branch docs which should be more up to date than the main branch docs).

After running `make setup`, you should have a `.venvs` directory in the root of the repository.

To activate it, run:

```bash
source .venvs/<python_version>/bin/activate
```

stimulus should now be installed as a librairy!

Then, you can install extra dependencies (for instance jupyter) using `uv pip install <package>`.

We recommand you install jupyter, spin up a notebook and take the titanic datasets formated to the stimulus format in `tests/test_data/titanic/titanic_stimulus.csv` as well as the model in `tests/test_model/titanic_model.py`.

The configurations files (stimulus uses those to generate different versions of the model and datasets) are in `tests/test_data/titanic/titanic.yaml` and `tests/test_model/titanic_model_{gpu/cpu}.yaml`.

From there, you should first try to load the dataset in a data_handler class (see `stimulus.data.data_handler.py`). For this, you will need loader classes (see `stimulus.data.loaders.py`). Currently, loaders use a config system, the split config in `tests/test_data/titanic/titanic_sub_config.yaml` should be enough to get you started.

Then, you can try to load the dataset in the pytorch dataset utility (see `stimulus.data.handlertorch.py`). And run the model on a few examples to see what is going on.

Finally, you can call tuning scripts (see `stimulus.learner.raytune_learner.py`) - make sure you import ray and call ray.init() first - and run a hyper parameter tuning run on the model using the model config `tests/test_model/titanic_model_{gpu/cpu}.yaml`.

Since this is not so well documented, it is possible you encounter issues, bugs, things that are weird to you, unintuitive, etc. I will be here to answer questions, either on nf-core slack or on discord. Those questions are extremely valuable to get the documentation up to speed.

## Contributing

At this stage, you will mostly be understanding the codebase and have your own ideas on how to improve it. If so, please make an issue and discuss on discord/slack. Otherwise, you can pick one of the many issues already opened and work from there.

### Things that are always welcome (especially interesting for newcomers):

- Improving documentation:

This is the most impactful thing you can do as a newcomer (since you get to write the documentation you wish you had when you started) + it will help you understand the codebase better. PR's aiming to improve documentation are also very easy to review and accept, those will be prioritized.

- Building tutorials:

Now that you understand the codebase a bit better, helping others understanding it as well will always be extremely valuable, and as with documentation PR's, those will be prioritized.

- Adding Encoders/Transforms/Splitters

This librairy lives and dies by the number of encoders/transforms/splitters offered, so adding those will always improve the software.

- Quality of life improvements

As users, you have a much better understanding of the pain points than we do. If you have ideas on how to improve the software, please share them!

- Bug fixes

- Performance improvements

### Things that are always welcome discussing but not necessarily easy to get for newcomers:

- Refactoring

Often refactoring code is needed to improve the codebase, make it more readable or flexible (and refactoring making the codebase more readable will always be highly valued).

- Porting stimulus to non-bio fields

This will sometime require extensive refactor and good understanding of the codebase to make sure it does not break anything.

- Adding extra functionality (specifically downstream analysis, interpretation methods etc...)

Stimulus would be a lot more useful if it could perform downstream model analysis, interpretability methods, overfitting analysis etc.. All of those things are on the roadmap, but I guess the codebase need to be well understood first (however, raising issues to discuss how to do this is always welcome).

## How to contribute code

### First thing you should do

Fork the repository, then :

1. create a new branch: `git switch -c feature-or-bugfix-name`
1. edit the code and/or the documentation

### Commit guidelines :

Please write atomic commits, meaning that each commit should be a single change. This is good practice since it allows everybody to review much faster!

When we push a new release, we use `make changelog` to generate the changelog. This has one caveat, it only works if commits messages are following the angular commit guidelines. Therefore, if you want your contributions to be seen, you need to write your commit messages following [their contributing guide](https://github.com/angular/angular/blob/master/CONTRIBUTING.md#commit-message-format).

Below, the relevant part of the angular contribution guide for commit messages:

```
<type>(<scope>): <short summary>
│ │ │
│ │ └─⫸ Summary in present tense. Not capitalized. No period at the end.
│ │
│ └─⫸ Commit Scope: animations|bazel|benchpress|common|compiler|compiler-cli|core|
│ elements|forms|http|language-service|localize|platform-browser|
│ platform-browser-dynamic|platform-server|router|service-worker|
│ upgrade|zone.js|packaging|changelog|docs-infra|migrations|
│ devtools
└─⫸ Commit Type: build|ci|docs|feat|fix|perf|refactor|test
```

The `<type>` and `<summary>` fields are mandatory, the `<scope>` field is optional.
Type
Must be one of the following:

- `build`: Changes that affect the build system or external dependencies (example scopes: gulp, broccoli, npm)
- `ci`: Changes to our CI configuration files and scripts (examples: Github Actions, SauceLabs)
- `docs`: Documentation only changes
- `feat`: A new feature
- `fix`: A bug fix
- `perf`: A code change that improves performance
- `refactor`: A code change that neither fixes a bug nor adds a feature
- `test`: Adding missing tests or correcting existing tests

The scope, while optional, is recommended (typically the name of the file you are working on).

**Before pushing:**

The entry-point to run commands and tasks is the `make` Python script,
located in the `scripts` directory. Try running `make` to show the available commands and tasks.
Expand All @@ -45,21 +159,15 @@ If you work in VSCode, we provide
[an action to configure VSCode](https://pawamoy.github.io/copier-uv/work/#vscode-setup)
for the project.

## Development

As usual:

1. create a new branch: `git switch -c feature-or-bugfix-name`
1. edit the code and/or the documentation

**Before committing:**

1. run `make format` to auto-format the code
1. run `make check` to check everything (fix any warning)
1. run `make test` to run the tests (fix any issue)
1. if you updated the documentation or the project dependencies:
1. run `make docs`
1. go to http://localhost:8000 and check that everything looks good

Then you can pull request and we will review. Make sure you join our [slack](https://nfcore.slack.com/channels/deepmodeloptim) hosted on nf-core or find us on discord to talk and build with us!

Once you have your first PR merged, you will be added to the repository as a contributor and your contributions will be aknowledged!


Then you can pull request and we will review. Make sure you join our [slack](https://nfcore.slack.com/channels/deepmodeloptim) hosted on nf-core to talk and build with us!
8 changes: 2 additions & 6 deletions src/stimulus/utils/yaml_model_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,12 +196,8 @@ def convert_config_to_ray(self, model: Model) -> RayTuneModel:
Ray Tune compatible model configuration
"""
return RayTuneModel(
network_params={
k: self.convert_raytune(v) for k, v in model.network_params.items()
},
optimizer_params={
k: self.convert_raytune(v) for k, v in model.optimizer_params.items()
},
network_params={k: self.convert_raytune(v) for k, v in model.network_params.items()},
optimizer_params={k: self.convert_raytune(v) for k, v in model.optimizer_params.items()},
loss_params={k: self.convert_raytune(v) for k, v in model.loss_params},
data_params={k: self.convert_raytune(v) for k, v in model.data_params},
tune=model.tune,
Expand Down
66 changes: 33 additions & 33 deletions tests/cli/test_tuning.py
Original file line number Diff line number Diff line change
@@ -1,39 +1,33 @@
"""Test the tuning CLI."""

import operator
import os
import yaml
import shutil
import operator
import warnings
from pathlib import Path
from functools import reduce
from pathlib import Path
from typing import Any

import pytest
import ray
import yaml

from stimulus.cli import tuning
from typing import Any


@pytest.fixture
def data_path() -> str:
"""Get path to test data CSV file."""
return str(
Path(__file__).parent.parent
/ "test_data"
/ "titanic"
/ "titanic_stimulus_split.csv"
Path(__file__).parent.parent / "test_data" / "titanic" / "titanic_stimulus_split.csv",
)


@pytest.fixture
def data_config() -> str:
"""Get path to test data config YAML."""
return str(
Path(__file__).parent.parent
/ "test_data"
/ "titanic"
/ "titanic_sub_config.yaml"
Path(__file__).parent.parent / "test_data" / "titanic" / "titanic_sub_config.yaml",
)


Expand All @@ -50,50 +44,56 @@ def model_config() -> str:


def _get_number_of_generated_files(save_dir_path: str) -> int:
"""Each run generates a file in the result dir"""
"""Each run generates a file in the result dir."""
# Get the number of generated run files
number_of_files: int = 0
for file in os.listdir(save_dir_path):
if "TuneModel" in file:
number_of_files = len(
[f for f in os.listdir(save_dir_path + "/" + file) if "TuneModel" in f]
[f for f in os.listdir(save_dir_path + "/" + file) if "TuneModel" in f],
)
return number_of_files


def _get_number_of_theoritical_runs(params_path: str) -> int:
"""
The number of run is defined as follows:
G: number of grid_search
n_i: number of options for the ith grid_search
S: value of num_samples
"""The number of run is defined as follows.

G: number of grid_search
n_i: number of options for the ith grid_search
S: value of num_samples

R = S * ∏(i=1 to G) n_i
R = S * ∏(i=1 to G) n_i
"""
# Get the theoritical number of runs
with open(params_path) as file:
params_dict: dict[str, Any] = yaml.safe_load(file)

grid_searches_len: list[int] = []
num_samples: int = 0
for header, sections in params_dict.items():
for _header, sections in params_dict.items():
if isinstance(sections, dict):
for section in sections.values():
if isinstance(section, dict):
# Lookup for any grid search in the yaml
has_grid_search: bool = section.get("mode") == "grid_search"
has_num_samples: bool = section.get("num_samples") is not None
if has_grid_search:
grid_searches_len.append(len(section.get("space")))
elif has_num_samples:
num_samples = section.get("num_samples")
# Apply the described function and return the value
result = num_samples * reduce(operator.mul, grid_searches_len)
return result
# Lookup for grid search or num_samples in the yaml
mode_value = section.get("mode")
ns_value = section.get("num_samples")
if mode_value == "grid_search":
space_value = section.get("space")
if space_value is not None:
grid_searches_len.append(len(space_value))
else:
grid_searches_len.append(0)
elif ns_value is not None:
num_samples = ns_value if isinstance(ns_value, int) else 0
# Apply the described function and return the value
return num_samples * reduce(operator.mul, grid_searches_len)


def test_tuning_main(
data_path: str, data_config: str, model_path: str, model_config: str
data_path: str,
data_config: str,
model_path: str,
model_config: str,
) -> None:
"""Test that tuning.main runs without errors.

Expand Down Expand Up @@ -147,7 +147,7 @@ def test_tuning_main(

# Clean up any ray files/directories that may have been created
ray_results_dir = os.path.expanduser(
"tests/test_data/titanic/test_results/"
"tests/test_data/titanic/test_results/",
)
# Check that the theoritical numbers of run corresponds to the real number of runs
n_files: int = _get_number_of_generated_files(ray_results_dir)
Expand Down
Loading