instadeepai · JemmaLDaniel · Jan 23, 2026 · Nov 26, 2025 · Nov 26, 2025 · Nov 26, 2025
diff --git a/.gitignore b/.gitignore
@@ -12,12 +12,21 @@ docs_public
 *.csv
 *.parquet
 *.ipc
+*.mztab
+*.fasta
+*.mgf
 *.pkl
 *.json
 *.yaml
+*.pdf
+*.png
+
+*.ipynb
 
 examples/winnow-general-model
 examples/winnow-ms-datasets
 examples/output
 
 build/
+
+.cursorrules
diff --git a/README.md b/README.md
@@ -35,9 +35,9 @@
     <a href="https://instadeepai.github.io/winnow/"><strong>Explore the docs »</strong></a>
     <br />
     <br />
-    <a href="https://github.com/instadeepai/winnow/issues/new?labels=bug&template=bug_report.md">Report Bug</a>
+    <a href="https://github.com/instadeepai/winnow/issues/new?labels=bug&template=bug_report.md">Report bug</a>
     &middot;
-    <a href="https://github.com/instadeepai/winnow/issues/new?labels=enhancement&template=feature_request.md">Request Feature</a>
+    <a href="https://github.com/instadeepai/winnow/issues/new?labels=enhancement&template=feature_request.md">Request feature</a>
   </p>
 </div>
 
@@ -48,16 +48,13 @@
   <summary>Table of Contents</summary>
   <ol>
     <li>
-      <a href="#about-the-project">About The Project</a>
+      <a href="#about-the-project">About the project</a>
     </li>
     <li>
       <a href="#installation">Installation</a>
     </li>
-    <li><a href="#usage">Usage</a>
-      <ul>
-        <li><a href="#CLI">CLI</a></li>
-        <li><a href="#Package">Package</a></li>
-      </ul>
+    <li>
+      <a href="#usage">Usage</a>
     </li>
     <li><a href="#contributing">Contributing</a></li>
   </ol>
@@ -70,7 +67,7 @@
 </div>
 
 <!-- ABOUT THE PROJECT -->
-## About The Project
+## About the project
 
 <!-- [![Product Name Screen Shot][product-screenshot]](https://example.com) -->
 In bottom-up proteomics workflows, peptide sequencing—matching an MS2 spectrum to a peptide—is just the first step. The resulting peptide-spectrum matches (PSMs) often contain many incorrect identifications, which can negatively impact downstream tasks like protein assembly.
@@ -80,7 +77,7 @@ To mitigate this, intermediate steps are introduced to:
 1. Assign confidence scores to PSMs that better correlate with correctness.
 2. Estimate and control the false discovery rate (FDR) by filtering identifications based on confidence scores.
 
-For database search-based peptide sequencing, PSM rescoring and target-decoy competition (TDC) are standard approaches, supported by an extensive ecosystem of tools. However, *de novo* peptide sequencing lacks standardized methods for these tasks.
+For database search-based peptide sequencing, PSM rescoring and target-decoy competition (TDC) are standard approaches, supported by an extensive ecosystem of tools. However, *de novo* peptide sequencing lacks standardised methods for these tasks.
 
 `winnow` aims to fill this gap by implementing the calibrate-estimate framework for FDR estimation. Unlike TDC, this approach is directly applicable to *de novo* sequencing models. Additionally, its calibration step naturally incorporates common confidence rescoring workflows as part of FDR estimation.
 
@@ -121,11 +118,24 @@ Installing `winnow` provides the `winnow` command with two sub-commands:
 
 By default, `winnow predict` uses a pretrained general model (`InstaDeepAI/winnow-general-model`) hosted on HuggingFace Hub, allowing you to get started immediately without training. You can also specify custom HuggingFace models or use locally trained models.
 
-Refer to the documentation for details on command-line arguments and usage examples.
+Winnow uses [Hydra](https://hydra.cc/) for flexible, hierarchical configuration management. All parameters can be configured via YAML files or overridden on the command line:
+
+```bash
+# Quick start with defaults
+winnow predict
+
+# Override specific parameters
+winnow predict fdr_control.fdr_threshold=0.01
+
+# Specify different data source and dataset paths
+winnow predict data_loader=mztab dataset.spectrum_path_or_directory=data/spectra.parquet dataset.predictions_path=data/preds.mztab
+```
+
+Refer to the [CLI Guide](cli.md) and [Configuration Guide](configuration.md) for details on usage and configuration options.
 
 ### Package
 
-The `winnow` package is organized into three sub-modules:
+The `winnow` package is organised into three sub-modules:
 
 1. `winnow.datasets` – Handles data loading and saving, including the `CalibrationDataset` class for mapping peptide sequencing output formats.
 2. `winnow.calibration` – Implements confidence calibration. Key components include:
@@ -145,10 +155,11 @@ For an example, check out the [example notebook](https://github.com/instadeepai/
 Contributions are what make the open-source community such an amazing place to learn, inspire and create, and we welcome your support! Any contributions you make are **greatly appreciated**.
 
 If you have ideas for enhancements, you can:
+
 - Fork the repository and submit a pull request.
 - Open an issue and tag it with "enhancement".
 
-### Contribution Process
+### Contribution process
 
 1. Fork the repository.
 2. Create a feature branch (`git checkout -b feature/AmazingFeature`).
@@ -159,7 +170,7 @@ Don't forget to give the project a star! Thanks again! :star:
 
 <p align="right">(<a href="#readme-top">back to top</a>)</p>
 
-### BibTeX entry and citation info
+### BibTeX entry and citation information
 
 If you use `winnow` in your research, please cite the following preprint:
 

diff --git a/config/calibrator.yaml b/config/calibrator.yaml
@@ -0,0 +1,48 @@
+# --- Calibrator configuration ---
+
+calibrator:
+  _target_: winnow.calibration.calibrator.ProbabilityCalibrator
+
+  seed: 42
+  hidden_layer_sizes: [50, 50]  # The number of neurons in each hidden layer of the MLP classifier.
+  learning_rate_init: 0.001  # The initial learning rate for the MLP classifier.
+  alpha: 0.0001  # L2 regularisation parameter for the MLP classifier.
+  max_iter: 1000  # Maximum number of training iterations for the MLP classifier.
+  early_stopping: true  # Whether to use early stopping to terminate training.
+  validation_fraction: 0.1  # Proportion of training data to use for early stopping validation.
+
+  features:
+    mass_error:
+      _target_: winnow.calibration.calibration_features.MassErrorFeature
+      residue_masses: ${residue_masses}  # The residue masses to use for the mass error feature.
+
+    prosit_features:
+      _target_: winnow.calibration.calibration_features.PrositFeatures
+      mz_tolerance: 0.02
+      learn_from_missing: true  # Whether to learn from missing Prosit features. If False, errors will be raised when invalid spectra are encountered.
+      invalid_prosit_tokens: ${invalid_prosit_tokens}  # The tokens to consider as invalid for Prosit features.
+      prosit_intensity_model_name: Prosit_2020_intensity_HCD  # The name of the Prosit intensity model to use.
+
+    retention_time_feature:
+      _target_: winnow.calibration.calibration_features.RetentionTimeFeature
+      hidden_dim: 10  # The hidden dimension size for the MLP regressor used to predict iRT from observed retention times.
+      train_fraction: 0.1  # The fraction of the data to use for training the iRT predictor.
+      learn_from_missing: true  # Whether to learn from missing retention time features. If False, errors will be raised when invalid spectra are encountered.
+      seed: 42  # Random seed for the MLP regressor.
+      learning_rate_init: 0.001  # The initial learning rate for the MLP regressor.
+      alpha: 0.0001  # L2 regularisation parameter for the MLP regressor.
+      max_iter: 200  # Maximum number of training iterations for the MLP regressor.
+      early_stopping: false  # Whether to use early stopping for the MLP regressor.
+      validation_fraction: 0.1  # Proportion of training data to use for early stopping validation.
+      invalid_prosit_tokens: ${invalid_prosit_tokens}  # The tokens to consider as invalid for Prosit iRT features.
+      prosit_irt_model_name: Prosit_2019_irt  # The name of the Prosit iRT model to use.
+
+    chimeric_features:
+      _target_: winnow.calibration.calibration_features.ChimericFeatures
+      mz_tolerance: 0.02
+      learn_from_missing: true  # Whether to learn from missing chimeric features. If False, errors will be raised when invalid spectra are encountered.
+      invalid_prosit_tokens: ${invalid_prosit_tokens}  # The tokens to consider as invalid for Prosit chimeric intensity features.
+      prosit_intensity_model_name: Prosit_2020_intensity_HCD  # The name of the Prosit intensity model to use.
+
+    beam_features:
+      _target_: winnow.calibration.calibration_features.BeamFeatures
diff --git a/config/data_loader/instanovo.yaml b/config/data_loader/instanovo.yaml
@@ -0,0 +1,23 @@
+# --- InstaNovo data loading configuration ---
+
+_target_: winnow.datasets.data_loaders.InstaNovoDatasetLoader
+
+residue_masses: ${residue_masses}
+residue_remapping:  # Used to map InstaNovo legacy notations to UNIMOD tokens.
+  "M(ox)": "M[UNIMOD:35]"  # Oxidation
+  "M(+15.99)": "M[UNIMOD:35]"  # Oxidation
+  "S(p)": "S[UNIMOD:21]"  # Phosphorylation
+  "T(p)": "T[UNIMOD:21]"  # Phosphorylation
+  "Y(p)": "Y[UNIMOD:21]"  # Phosphorylation
+  "S(+79.97)": "S[UNIMOD:21]"  # Phosphorylation
+  "T(+79.97)": "T[UNIMOD:21]"  # Phosphorylation
+  "Y(+79.97)": "Y[UNIMOD:21]"  # Phosphorylation
+  "Q(+0.98)": "Q[UNIMOD:7]"  # Deamidation
+  "N(+0.98)": "N[UNIMOD:7]"  # Deamidation
+  "Q(+.98)": "Q[UNIMOD:7]"  # Deamidation
+  "N(+.98)": "N[UNIMOD:7]"  # Deamidation
+  "C(+57.02)": "C[UNIMOD:4]"  # Carbamidomethylation
+  # N-terminal modifications.
+  "(+42.01)": "[UNIMOD:1]"  # Acetylation
+  "(+43.01)": "[UNIMOD:5]"  # Carbamylation
+  "(-17.03)": "[UNIMOD:385]"  # Ammonia loss
diff --git a/config/data_loader/mztab.yaml b/config/data_loader/mztab.yaml
@@ -0,0 +1,20 @@
+# --- MZTab data loading configuration ---
+_target_: winnow.datasets.data_loaders.MZTabDatasetLoader
+
+residue_masses: ${residue_masses}
+residue_remapping:  # Used to map Casanovo-specific notations to UNIMOD tokens.
+  "M+15.995": "M[UNIMOD:35]"  # Oxidation
+  "Q+0.984": "Q[UNIMOD:7]"  # Deamidation
+  "N+0.984": "N[UNIMOD:7]"  # Deamidation
+  "+42.011": "[UNIMOD:1]"  # Acetylation
+  "+43.006": "[UNIMOD:5]"  # Carbamylation
+  "-17.027": "[UNIMOD:385]"  # Ammonia loss
+  "C+57.021": "C[UNIMOD:4]"  # Carbamidomethylation
+  "C[Carbamidomethyl]": "C[UNIMOD:4]"  # Carbamidomethylation
+  "M[Oxidation]": "M[UNIMOD:35]"  # Oxidation
+  "N[Deamidated]": "N[UNIMOD:7]"  # Deamidation
+  "Q[Deamidated]": "Q[UNIMOD:7]"  # Deamidation
+  # N-terminal modifications.
+  "[Acetyl]-": "[UNIMOD:1]"  # Acetylation
+  "[Carbamyl]-": "[UNIMOD:5]"  # Carbamylation
+  "[Ammonia-loss]-": "[UNIMOD:385]"  # Ammonia loss
diff --git a/config/data_loader/pointnovo.yaml b/config/data_loader/pointnovo.yaml
@@ -0,0 +1,5 @@
+# --- PointNovo data loading configuration ---
+
+_target_: winnow.datasets.data_loaders.PointNovoDatasetLoader
+
+residue_masses: ${residue_masses}
diff --git a/config/data_loader/winnow.yaml b/config/data_loader/winnow.yaml
@@ -0,0 +1,7 @@
+# --- Winnow data loading configuration ---
+
+_target_: winnow.datasets.data_loaders.WinnowDatasetLoader
+
+residue_masses: ${residue_masses}
+# The internal Winnow dataset loader does not need a residue remapping
+# since it uses the UNIMOD tokens directly.
diff --git a/config/fdr_method/database_grounded.yaml b/config/fdr_method/database_grounded.yaml
@@ -0,0 +1,8 @@
+# --- Database-grounded FDR control configuration ---
+
+_target_: winnow.fdr.database_grounded.DatabaseGroundedFDRControl
+
+confidence_feature: ${fdr_control.confidence_column}  # Name of the column with confidence scores to use for FDR estimation.
+residue_masses: ${residue_masses}  # The residue masses from global `residues` config
+isotope_error_range: [0, 1]  # The isotope error range for matching peptides
+drop: 10  # The number of top predictions to drop for stability
diff --git a/config/fdr_method/nonparametric.yaml b/config/fdr_method/nonparametric.yaml
@@ -0,0 +1,3 @@
+# --- Non-parametric FDR control configuration ---
+
+_target_: winnow.fdr.nonparametric.NonParametricFDRControl
diff --git a/config/predict.yaml b/config/predict.yaml
@@ -0,0 +1,38 @@
+# --- Predicting scores and applying FDR control ---
+defaults:
+  - _self_
+  - residues
+  - data_loader: instanovo  # Options: instanovo, mztab, pointnovo, winnow
+  - fdr_method: nonparametric  # Options: nonparametric, database_grounded
+
+# --- Pipeline Execution Configuration ---
+
+dataset:
+  # Dataset paths:
+  # Path to the spectrum data file or to folder containing saved internal Winnow dataset.
+  spectrum_path_or_directory: data/spectra.ipc
+  # Path to the beam predictions file.
+  # Leave as `null` if data source is `winnow`, or loading will fail.
+  predictions_path: data/predictions.csv
+  # NOTE: Make sure that the data loader type matches the data source type in this dataset section.
+
+calibrator:
+  # Model loading:
+  # Path to the local calibrator directory or the HuggingFace model identifier.
+  # If the path is a local directory path, it will be used directly. If it is a HuggingFace repository identifier, it will be downloaded from HuggingFace.
+  pretrained_model_name_or_path: InstaDeepAI/winnow-general-model
+  # Directory to cache the HuggingFace model.
+  cache_dir: null  # can be set to `null` if using local model or for the default cache directory from HuggingFace.
+
+fdr_control:
+  # FDR settings:
+  # Target FDR threshold (e.g. 0.01 for 1%, 0.05 for 5% etc.).
+  fdr_threshold: 0.05
+  # Name of the column with confidence scores to use for FDR estimation.
+  confidence_column: calibrated_confidence
+
+# Folder path to write the outputs to.
+# This will create two CSV files in the output folder:
+# - metadata.csv: Contains all metadata and feature columns from the input dataset.
+# - preds_and_fdr_metrics.csv: Contains predictions and FDR metrics.
+output_folder: results/predictions
diff --git a/config/residues.yaml b/config/residues.yaml
@@ -0,0 +1,64 @@
+# --- Residues configuration ---
+
+# This is Winnow's internal residue representation.
+# We use this to calculate the mass error feature and during database-grounded FDR control.
+# We also use this to initialise the residue set for the Metrics class.
+residue_masses:
+  "G": 57.021464
+  "A": 71.037114
+  "S": 87.032028
+  "P": 97.052764
+  "V": 99.068414
+  "T": 101.047670
+  "C": 103.009185
+  "L": 113.084064
+  "I": 113.084064
+  "N": 114.042927
+  "D": 115.026943
+  "Q": 128.058578
+  "K": 128.094963
+  "E": 129.042593
+  "M": 131.040485
+  "H": 137.058912
+  "F": 147.068414
+  "R": 156.101111
+  "Y": 163.063329
+  "W": 186.079313
+  # Modifications
+  "M[UNIMOD:35]": 147.035400 # Oxidation
+  "C[UNIMOD:4]": 160.030649 # Carboxyamidomethylation
+  "N[UNIMOD:7]": 115.026943 # Deamidation
+  "Q[UNIMOD:7]": 129.042594 # Deamidation
+  "R[UNIMOD:7]": 157.085127 # Arginine citrullination
+  "P[UNIMOD:35]": 113.047679 # Proline hydroxylation
+  "S[UNIMOD:21]": 166.998028 # Phosphorylation + 79.966
+  "T[UNIMOD:21]": 181.01367 # Phosphorylation + 79.966
+  "Y[UNIMOD:21]": 243.029329 # Phosphorylation + 79.966
+  "C[UNIMOD:312]": 222.013284  # Cysteinylation
+  "E[UNIMOD:27]": 111.032028  # Glu -> pyro-Glu
+  "Q[UNIMOD:28]": 111.032029  # Gln -> pyro-Gln
+  # Terminal modifications
+  "[UNIMOD:1]": 42.010565 # Acetylation
+  "[UNIMOD:5]": 43.005814 # Carbamylation
+  "[UNIMOD:385]": -17.026549 # NH3 loss
+  "(+25.98)": 25.980265  # Carbamylation & NH3 loss (legacy notation)
+
+# The tokens to consider as invalid for Prosit features.
+# We also filter out non-carboxyamidomethylated Cysteine in a separate step.
+invalid_prosit_tokens:
+  # InstaNovo
+  - "[UNIMOD:7]"
+  - "[UNIMOD:21]"
+  - "[UNIMOD:1]"
+  - "[UNIMOD:5]"
+  - "[UNIMOD:385]"
+  - "(+25.98)"  # (legacy notation)
+  # Casanovo
+  - "+0.984"
+  - "+42.011"
+  - "+43.006"
+  - "-17.027"
+  - "[Ammonia-loss]-"
+  - "[Carbamyl]-"
+  - "[Acetyl]-"
+  - "[Deamidated]"
diff --git a/config/train.yaml b/config/train.yaml
@@ -0,0 +1,21 @@
+# --- Training a calibrator ---
+defaults:
+  - _self_
+  - residues
+  - calibrator
+  - data_loader: instanovo  # Options: instanovo, mztab, pointnovo, winnow
+
+# --- Pipeline Execution Configuration ---
+
+dataset:
+  # Dataset paths:
+  # Path to the spectrum data file or to folder containing saved internal Winnow dataset.
+  spectrum_path_or_directory: data/spectra.ipc
+  # Path to the beam predictions file.
+  # Leave as `null` if data source is `winnow`, or loading will fail.
+  predictions_path: data/predictions.csv
+  # NOTE: Make sure that the data loader type matches the data source type in this dataset section.
+
+# Output paths:
+model_output_dir: models/new_model
+dataset_output_path: results/calibrated_dataset.csv
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# --- Non-parametric FDR control configuration ---

		_target_: winnow.fdr.nonparametric.NonParametricFDRControl