Skip to content
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
a2ae8d3
chore: add hydra to project dependencies
JemmaLDaniel Nov 26, 2025
d51a264
feat: use hydra to configure winnow runs
JemmaLDaniel Nov 26, 2025
8d3a02a
test: update tests to use extra init arguments
JemmaLDaniel Nov 26, 2025
5e730c7
feat: add winnow config command to view resolved configuration
JemmaLDaniel Nov 26, 2025
20ee8b3
docs: document hydra config usage with winnow cli
JemmaLDaniel Nov 26, 2025
2529582
docs: make docs titles sentence case and fix bullet list formatting
JemmaLDaniel Nov 26, 2025
d7e713c
perf: optimise CLI startup time with lazy imports
JemmaLDaniel Nov 26, 2025
07bfc18
chore: merge branch 'main' into feat-hydra-config
JemmaLDaniel Nov 26, 2025
980a793
chore: update gitignore to ignore extra supported files and images
JemmaLDaniel Nov 26, 2025
b18bd54
Merge branch 'main' into feat-hydra-config
JemmaLDaniel Dec 4, 2025
bb25d28
fix: convert predictions_path to a Path before file loading
JemmaLDaniel Dec 4, 2025
a883fcd
docs: add instructions on conversion from mgf to parquet file
JemmaLDaniel Dec 4, 2025
e9126d9
docs: remove references to old Typer CLI arguments
JemmaLDaniel Dec 4, 2025
864095e
feat: create toy data for CLI quickstart
JemmaLDaniel Dec 4, 2025
d614fcf
docs: add documentation for quickstarting with the toy data
JemmaLDaniel Dec 4, 2025
999e42f
fix: allow for location, overriding and composition of configs when i…
JemmaLDaniel Dec 4, 2025
ad8b1f5
chore: update example notebook with new object instantiation argument…
JemmaLDaniel Dec 4, 2025
00a006b
ci: migrate coverage badge to Gist-based dynamic system
JemmaLDaniel Dec 4, 2025
571b3b3
chore: track new config position
JemmaLDaniel Dec 8, 2025
cb9edfe
chore: merge branch 'main' into feat-hydra-config
JemmaLDaniel Jan 6, 2026
36c50a2
chore: update README.md
JemmaLDaniel Jan 16, 2026
29e182e
fix: remove np.float64 artifacts from example CSV
JemmaLDaniel Jan 16, 2026
0e12735
chore: remove unused workspace config
JemmaLDaniel Jan 16, 2026
4ea04a5
chore: update requirements
JemmaLDaniel Jan 16, 2026
c12ae19
fix: added instanovo version compatibility layer
JemmaLDaniel Jan 16, 2026
46c66e1
chore: bump instanovo package version
JemmaLDaniel Jan 16, 2026
bf5c90d
fix: update pyproject.toml to include compatibility layer
JemmaLDaniel Jan 20, 2026
35d1a36
fix: do not filter sample data outputs on FDR 0.05
JemmaLDaniel Jan 20, 2026
c0ad9b2
docs: update old CLI references
JemmaLDaniel Jan 20, 2026
e7986f9
fix: reference MZTabDatasetLoader residue_remapping correctly
JemmaLDaniel Jan 20, 2026
5794913
chore: update pyproject.toml with winnow sub-directories
JemmaLDaniel Jan 21, 2026
5418e6b
docs: recommend use of Make commands in quickstart
JemmaLDaniel Jan 21, 2026
431123b
chore: use uv to run quickstart commands
JemmaLDaniel Jan 21, 2026
82b9666
chore: remove unused global variable
JemmaLDaniel Jan 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,21 @@ docs_public
*.csv
*.parquet
*.ipc
*.mztab
*.fasta
*.mgf
*.pkl
*.json
*.yaml
*.pdf
*.png

*.ipynb

examples/winnow-general-model
examples/winnow-ms-datasets
examples/output

build/

.cursorrules
39 changes: 25 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,9 @@
<a href="https://instadeepai.github.io/winnow/"><strong>Explore the docs »</strong></a>
<br />
<br />
<a href="https://github.com/instadeepai/winnow/issues/new?labels=bug&template=bug_report.md">Report Bug</a>
<a href="https://github.com/instadeepai/winnow/issues/new?labels=bug&template=bug_report.md">Report bug</a>
&middot;
<a href="https://github.com/instadeepai/winnow/issues/new?labels=enhancement&template=feature_request.md">Request Feature</a>
<a href="https://github.com/instadeepai/winnow/issues/new?labels=enhancement&template=feature_request.md">Request feature</a>
</p>
</div>

Expand All @@ -48,16 +48,13 @@
<summary>Table of Contents</summary>
<ol>
<li>
<a href="#about-the-project">About The Project</a>
<a href="#about-the-project">About the project</a>
</li>
<li>
<a href="#installation">Installation</a>
</li>
<li><a href="#usage">Usage</a>
<ul>
<li><a href="#CLI">CLI</a></li>
<li><a href="#Package">Package</a></li>
</ul>
<li>
<a href="#usage">Usage</a>
</li>
<li><a href="#contributing">Contributing</a></li>
</ol>
Expand All @@ -70,7 +67,7 @@
</div>

<!-- ABOUT THE PROJECT -->
## About The Project
## About the project

<!-- [![Product Name Screen Shot][product-screenshot]](https://example.com) -->
In bottom-up proteomics workflows, peptide sequencing—matching an MS2 spectrum to a peptide—is just the first step. The resulting peptide-spectrum matches (PSMs) often contain many incorrect identifications, which can negatively impact downstream tasks like protein assembly.
Expand All @@ -80,7 +77,7 @@ To mitigate this, intermediate steps are introduced to:
1. Assign confidence scores to PSMs that better correlate with correctness.
2. Estimate and control the false discovery rate (FDR) by filtering identifications based on confidence scores.

For database search-based peptide sequencing, PSM rescoring and target-decoy competition (TDC) are standard approaches, supported by an extensive ecosystem of tools. However, *de novo* peptide sequencing lacks standardized methods for these tasks.
For database search-based peptide sequencing, PSM rescoring and target-decoy competition (TDC) are standard approaches, supported by an extensive ecosystem of tools. However, *de novo* peptide sequencing lacks standardised methods for these tasks.

`winnow` aims to fill this gap by implementing the calibrate-estimate framework for FDR estimation. Unlike TDC, this approach is directly applicable to *de novo* sequencing models. Additionally, its calibration step naturally incorporates common confidence rescoring workflows as part of FDR estimation.

Expand Down Expand Up @@ -121,11 +118,24 @@ Installing `winnow` provides the `winnow` command with two sub-commands:

By default, `winnow predict` uses a pretrained general model (`InstaDeepAI/winnow-general-model`) hosted on HuggingFace Hub, allowing you to get started immediately without training. You can also specify custom HuggingFace models or use locally trained models.

Refer to the documentation for details on command-line arguments and usage examples.
Winnow uses [Hydra](https://hydra.cc/) for flexible, hierarchical configuration management. All parameters can be configured via YAML files or overridden on the command line:

```bash
# Quick start with defaults
winnow predict

# Override specific parameters
winnow predict fdr_control.fdr_threshold=0.01

# Specify different data source and dataset paths
winnow predict data_loader=mztab dataset.spectrum_path_or_directory=data/spectra.parquet dataset.predictions_path=data/preds.mztab
```

Refer to the [CLI Guide](cli.md) and [Configuration Guide](configuration.md) for details on usage and configuration options.

### Package

The `winnow` package is organized into three sub-modules:
The `winnow` package is organised into three sub-modules:

1. `winnow.datasets` – Handles data loading and saving, including the `CalibrationDataset` class for mapping peptide sequencing output formats.
2. `winnow.calibration` – Implements confidence calibration. Key components include:
Expand All @@ -145,10 +155,11 @@ For an example, check out the [example notebook](https://github.com/instadeepai/
Contributions are what make the open-source community such an amazing place to learn, inspire and create, and we welcome your support! Any contributions you make are **greatly appreciated**.

If you have ideas for enhancements, you can:

- Fork the repository and submit a pull request.
- Open an issue and tag it with "enhancement".

### Contribution Process
### Contribution process

1. Fork the repository.
2. Create a feature branch (`git checkout -b feature/AmazingFeature`).
Expand All @@ -159,7 +170,7 @@ Don't forget to give the project a star! Thanks again! :star:

<p align="right">(<a href="#readme-top">back to top</a>)</p>

### BibTeX entry and citation info
### BibTeX entry and citation information

If you use `winnow` in your research, please cite the following preprint:

Expand Down
48 changes: 48 additions & 0 deletions config/calibrator.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# --- Calibrator configuration ---

calibrator:
_target_: winnow.calibration.calibrator.ProbabilityCalibrator

seed: 42
hidden_layer_sizes: [50, 50] # The number of neurons in each hidden layer of the MLP classifier.
learning_rate_init: 0.001 # The initial learning rate for the MLP classifier.
alpha: 0.0001 # L2 regularisation parameter for the MLP classifier.
max_iter: 1000 # Maximum number of training iterations for the MLP classifier.
early_stopping: true # Whether to use early stopping to terminate training.
validation_fraction: 0.1 # Proportion of training data to use for early stopping validation.

features:
mass_error:
_target_: winnow.calibration.calibration_features.MassErrorFeature
residue_masses: ${residue_masses} # The residue masses to use for the mass error feature.

prosit_features:
_target_: winnow.calibration.calibration_features.PrositFeatures
mz_tolerance: 0.02
learn_from_missing: true # Whether to learn from missing Prosit features. If False, errors will be raised when invalid spectra are encountered.
invalid_prosit_tokens: ${invalid_prosit_tokens} # The tokens to consider as invalid for Prosit features.
prosit_intensity_model_name: Prosit_2020_intensity_HCD # The name of the Prosit intensity model to use.

retention_time_feature:
_target_: winnow.calibration.calibration_features.RetentionTimeFeature
hidden_dim: 10 # The hidden dimension size for the MLP regressor used to predict iRT from observed retention times.
train_fraction: 0.1 # The fraction of the data to use for training the iRT predictor.
learn_from_missing: true # Whether to learn from missing retention time features. If False, errors will be raised when invalid spectra are encountered.
seed: 42 # Random seed for the MLP regressor.
learning_rate_init: 0.001 # The initial learning rate for the MLP regressor.
alpha: 0.0001 # L2 regularisation parameter for the MLP regressor.
max_iter: 200 # Maximum number of training iterations for the MLP regressor.
early_stopping: false # Whether to use early stopping for the MLP regressor.
validation_fraction: 0.1 # Proportion of training data to use for early stopping validation.
invalid_prosit_tokens: ${invalid_prosit_tokens} # The tokens to consider as invalid for Prosit iRT features.
prosit_irt_model_name: Prosit_2019_irt # The name of the Prosit iRT model to use.

chimeric_features:
_target_: winnow.calibration.calibration_features.ChimericFeatures
mz_tolerance: 0.02
learn_from_missing: true # Whether to learn from missing chimeric features. If False, errors will be raised when invalid spectra are encountered.
invalid_prosit_tokens: ${invalid_prosit_tokens} # The tokens to consider as invalid for Prosit chimeric intensity features.
prosit_intensity_model_name: Prosit_2020_intensity_HCD # The name of the Prosit intensity model to use.

beam_features:
_target_: winnow.calibration.calibration_features.BeamFeatures
23 changes: 23 additions & 0 deletions config/data_loader/instanovo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# --- InstaNovo data loading configuration ---

_target_: winnow.datasets.data_loaders.InstaNovoDatasetLoader

residue_masses: ${residue_masses}
residue_remapping: # Used to map InstaNovo legacy notations to UNIMOD tokens.
"M(ox)": "M[UNIMOD:35]" # Oxidation
"M(+15.99)": "M[UNIMOD:35]" # Oxidation
"S(p)": "S[UNIMOD:21]" # Phosphorylation
"T(p)": "T[UNIMOD:21]" # Phosphorylation
"Y(p)": "Y[UNIMOD:21]" # Phosphorylation
"S(+79.97)": "S[UNIMOD:21]" # Phosphorylation
"T(+79.97)": "T[UNIMOD:21]" # Phosphorylation
"Y(+79.97)": "Y[UNIMOD:21]" # Phosphorylation
"Q(+0.98)": "Q[UNIMOD:7]" # Deamidation
"N(+0.98)": "N[UNIMOD:7]" # Deamidation
"Q(+.98)": "Q[UNIMOD:7]" # Deamidation
"N(+.98)": "N[UNIMOD:7]" # Deamidation
"C(+57.02)": "C[UNIMOD:4]" # Carbamidomethylation
# N-terminal modifications.
"(+42.01)": "[UNIMOD:1]" # Acetylation
"(+43.01)": "[UNIMOD:5]" # Carbamylation
"(-17.03)": "[UNIMOD:385]" # Ammonia loss
20 changes: 20 additions & 0 deletions config/data_loader/mztab.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# --- MZTab data loading configuration ---
_target_: winnow.datasets.data_loaders.MZTabDatasetLoader

residue_masses: ${residue_masses}
residue_remapping: # Used to map Casanovo-specific notations to UNIMOD tokens.
"M+15.995": "M[UNIMOD:35]" # Oxidation
"Q+0.984": "Q[UNIMOD:7]" # Deamidation
"N+0.984": "N[UNIMOD:7]" # Deamidation
"+42.011": "[UNIMOD:1]" # Acetylation
"+43.006": "[UNIMOD:5]" # Carbamylation
"-17.027": "[UNIMOD:385]" # Ammonia loss
"C+57.021": "C[UNIMOD:4]" # Carbamidomethylation
"C[Carbamidomethyl]": "C[UNIMOD:4]" # Carbamidomethylation
"M[Oxidation]": "M[UNIMOD:35]" # Oxidation
"N[Deamidated]": "N[UNIMOD:7]" # Deamidation
"Q[Deamidated]": "Q[UNIMOD:7]" # Deamidation
# N-terminal modifications.
"[Acetyl]-": "[UNIMOD:1]" # Acetylation
"[Carbamyl]-": "[UNIMOD:5]" # Carbamylation
"[Ammonia-loss]-": "[UNIMOD:385]" # Ammonia loss
5 changes: 5 additions & 0 deletions config/data_loader/pointnovo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# --- PointNovo data loading configuration ---

_target_: winnow.datasets.data_loaders.PointNovoDatasetLoader

residue_masses: ${residue_masses}
7 changes: 7 additions & 0 deletions config/data_loader/winnow.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# --- Winnow data loading configuration ---

_target_: winnow.datasets.data_loaders.WinnowDatasetLoader

residue_masses: ${residue_masses}
# The internal Winnow dataset loader does not need a residue remapping
# since it uses the UNIMOD tokens directly.
8 changes: 8 additions & 0 deletions config/fdr_method/database_grounded.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# --- Database-grounded FDR control configuration ---

_target_: winnow.fdr.database_grounded.DatabaseGroundedFDRControl

confidence_feature: ${fdr_control.confidence_column} # Name of the column with confidence scores to use for FDR estimation.
residue_masses: ${residue_masses} # The residue masses from global `residues` config
isotope_error_range: [0, 1] # The isotope error range for matching peptides
drop: 10 # The number of top predictions to drop for stability
3 changes: 3 additions & 0 deletions config/fdr_method/nonparametric.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# --- Non-parametric FDR control configuration ---

_target_: winnow.fdr.nonparametric.NonParametricFDRControl
38 changes: 38 additions & 0 deletions config/predict.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# --- Predicting scores and applying FDR control ---
defaults:
- _self_
- residues
- data_loader: instanovo # Options: instanovo, mztab, pointnovo, winnow
- fdr_method: nonparametric # Options: nonparametric, database_grounded

# --- Pipeline Execution Configuration ---

dataset:
# Dataset paths:
# Path to the spectrum data file or to folder containing saved internal Winnow dataset.
spectrum_path_or_directory: data/spectra.ipc
# Path to the beam predictions file.
# Leave as `null` if data source is `winnow`, or loading will fail.
predictions_path: data/predictions.csv
# NOTE: Make sure that the data loader type matches the data source type in this dataset section.

calibrator:
# Model loading:
# Path to the local calibrator directory or the HuggingFace model identifier.
# If the path is a local directory path, it will be used directly. If it is a HuggingFace repository identifier, it will be downloaded from HuggingFace.
pretrained_model_name_or_path: InstaDeepAI/winnow-general-model
# Directory to cache the HuggingFace model.
cache_dir: null # can be set to `null` if using local model or for the default cache directory from HuggingFace.

fdr_control:
# FDR settings:
# Target FDR threshold (e.g. 0.01 for 1%, 0.05 for 5% etc.).
fdr_threshold: 0.05
# Name of the column with confidence scores to use for FDR estimation.
confidence_column: calibrated_confidence

# Folder path to write the outputs to.
# This will create two CSV files in the output folder:
# - metadata.csv: Contains all metadata and feature columns from the input dataset.
# - preds_and_fdr_metrics.csv: Contains predictions and FDR metrics.
output_folder: results/predictions
64 changes: 64 additions & 0 deletions config/residues.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# --- Residues configuration ---

# This is Winnow's internal residue representation.
# We use this to calculate the mass error feature and during database-grounded FDR control.
# We also use this to initialise the residue set for the Metrics class.
residue_masses:
"G": 57.021464
"A": 71.037114
"S": 87.032028
"P": 97.052764
"V": 99.068414
"T": 101.047670
"C": 103.009185
"L": 113.084064
"I": 113.084064
"N": 114.042927
"D": 115.026943
"Q": 128.058578
"K": 128.094963
"E": 129.042593
"M": 131.040485
"H": 137.058912
"F": 147.068414
"R": 156.101111
"Y": 163.063329
"W": 186.079313
# Modifications
"M[UNIMOD:35]": 147.035400 # Oxidation
"C[UNIMOD:4]": 160.030649 # Carboxyamidomethylation
"N[UNIMOD:7]": 115.026943 # Deamidation
"Q[UNIMOD:7]": 129.042594 # Deamidation
"R[UNIMOD:7]": 157.085127 # Arginine citrullination
"P[UNIMOD:35]": 113.047679 # Proline hydroxylation
"S[UNIMOD:21]": 166.998028 # Phosphorylation + 79.966
"T[UNIMOD:21]": 181.01367 # Phosphorylation + 79.966
"Y[UNIMOD:21]": 243.029329 # Phosphorylation + 79.966
"C[UNIMOD:312]": 222.013284 # Cysteinylation
"E[UNIMOD:27]": 111.032028 # Glu -> pyro-Glu
"Q[UNIMOD:28]": 111.032029 # Gln -> pyro-Gln
# Terminal modifications
"[UNIMOD:1]": 42.010565 # Acetylation
"[UNIMOD:5]": 43.005814 # Carbamylation
"[UNIMOD:385]": -17.026549 # NH3 loss
"(+25.98)": 25.980265 # Carbamylation & NH3 loss (legacy notation)

# The tokens to consider as invalid for Prosit features.
# We also filter out non-carboxyamidomethylated Cysteine in a separate step.
invalid_prosit_tokens:
# InstaNovo
- "[UNIMOD:7]"
- "[UNIMOD:21]"
- "[UNIMOD:1]"
- "[UNIMOD:5]"
- "[UNIMOD:385]"
- "(+25.98)" # (legacy notation)
# Casanovo
- "+0.984"
- "+42.011"
- "+43.006"
- "-17.027"
- "[Ammonia-loss]-"
- "[Carbamyl]-"
- "[Acetyl]-"
- "[Deamidated]"
21 changes: 21 additions & 0 deletions config/train.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# --- Training a calibrator ---
defaults:
- _self_
- residues
- calibrator
- data_loader: instanovo # Options: instanovo, mztab, pointnovo, winnow

# --- Pipeline Execution Configuration ---

dataset:
# Dataset paths:
# Path to the spectrum data file or to folder containing saved internal Winnow dataset.
spectrum_path_or_directory: data/spectra.ipc
# Path to the beam predictions file.
# Leave as `null` if data source is `winnow`, or loading will fail.
predictions_path: data/predictions.csv
# NOTE: Make sure that the data loader type matches the data source type in this dataset section.

# Output paths:
model_output_dir: models/new_model
dataset_output_path: results/calibrated_dataset.csv
Loading