Skip to content

Conversation

@JemmaLDaniel
Copy link
Collaborator

Summary

This PR implements the Typer + Hydra hybrid architecture proposed in #146, refactoring Winnow's configuration management from flat CLI signatures to a flexible, hierarchical system that enables scalable configuration of complex nested components and automatic object instantiation.

Implementation Details

1. Typer + Hydra Hybrid Architecture

Typer now acts as a thin command dispatcher, passing all configuration to Hydra:

def train(ctx: typer.Context) -> None:
    """Passes control directly to the Hydra training pipeline."""
    overrides = ctx.args if ctx.args else None
    train_entry_point(overrides)Pipeline logic moved to `train_entry_point()` and `predict_entry_point()` functions that handle Hydra initialization, configuration composition and pipeline execution.

2. Structured Configuration with Composition

Created modular configuration structure in config/:

  • train.yaml / predict.yaml - Main pipeline configurations
  • calibrator.yaml - Model architecture and features
  • residues.yaml - Amino acid masses and modifications (shared via composition)
  • data_loader/ - Pluggable dataset format loaders (InstaNovo, MZTab, PointNovo, Winnow)
  • fdr_method/ - Pluggable FDR methods (nonparametric, database-grounded)

Configuration files use Hydra's defaults mechanism to compose shared components.

3. Hydra-Based Object Instantiation

Used Hydra's _target_ field for automatic instantiation:

  • Data loaders instantiated from configuration without manual if/elif logic
  • FDR methods selected and configured via YAML
  • Users can inject custom implementations by creating YAML configs with _target_ pointing to their classes

4. Configuration Inspection Commands

Added winnow config command group:

  • winnow config train - Display resolved training configuration
  • winnow config predict - Display resolved prediction configuration

Implemented custom ConfigFormatter class with hierarchical colour-coding based on YAML nesting depth for improved terminal readability.

5. Lazy Imports for CLI Performance

Implemented lazy import pattern using TYPE_CHECKING to defer heavy dependencies (PyTorch, InstaNovo, etc.) until command execution. This makes --help and config commands respond instantly whilst pipeline commands still have access to all required dependencies.

Added module-level docstring in main.py explaining the rationale.

6. Documentation Updates

Minor improvements to CLI help text and documentation to reflect the new Hydra-based configuration system with examples of dot-notation overrides.

Migration Notes

Existing users will need to:

  • Use configuration files in config/ instead of passing all parameters via CLI flags
  • Override parameters using dot notation: winnow train calibrator.seed=42
  • Consult winnow config <pipeline> to inspect resolved configurations

@JemmaLDaniel
Copy link
Collaborator Author

JemmaLDaniel commented Nov 26, 2025

Commits 20ee8b3 and 2529582 also address #143 and #140

@JemmaLDaniel JemmaLDaniel requested a review from BioGeek November 26, 2025 18:18
@JemmaLDaniel JemmaLDaniel self-assigned this Nov 26, 2025
@JemmaLDaniel JemmaLDaniel added enhancement New feature or request documentation Improvements or additions to documentation labels Nov 26, 2025
@JemmaLDaniel JemmaLDaniel force-pushed the feat-hydra-config branch 4 times, most recently from f0bafe0 to b459d60 Compare December 4, 2025 18:59
…nstalled as a package

chore: fix pre-commit on main script

chore: remove testing Make commands

fix: correct the path for config_path_utils

fix: correct the path for config_path_utils

chore: pre-commit formatting fixes for test_config_paths
Co-authored-by: Jeroen Van Goey <j.vangoey@instadeep.com>

chore: update README.md

Co-authored-by: Jeroen Van Goey <j.vangoey@instadeep.com>

chore: remove trailing whitespace
@JemmaLDaniel JemmaLDaniel requested a review from BioGeek January 16, 2026 14:40
Copy link
Contributor

@BioGeek BioGeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run

make sample-data
make train-sample
make predict-sample

It generates CSV files in results/predictions with only headers.

  • preds_and_fdr_metrics.csv:
,calibrated_confidence,prediction,psm_fdr,psm_q_value,sequence,psm_pep,spectrum_id
  • metadata.csv:
,spectrum_id,prediction_untokenised,confidence,sequence_untokenised,token_log_probabilities_beam_0,token_log_probabilities_beam_1,token_log_probabilities_beam_2,precursor_mz,precursor_charge,precursor_mass,retention_time,mz_array,intensity_array,valid_peptide,valid_prediction,num_matches,correct,Mass Error,is_missing_prosit_features,prosit_mz,prosit_intensity,ion_matches,ion_match_intensity,is_missing_irt_error,iRT,predicted iRT,iRT error,is_missing_chimeric_features,runner_up_prosit_mz,runner_up_prosit_intensity,chimeric_ion_matches,chimeric_ion_match_intensity,margin,median_margin,entropy,z-score

@JemmaLDaniel
Copy link
Collaborator Author

make predict-sample resulted in no saved predictions because we filter to an FDR of <=0.05 by default, and the sample data did not pass this threshold. I've changed the make command to prevent filtering, which will return all the input sample data rows.

@JemmaLDaniel JemmaLDaniel requested a review from BioGeek January 21, 2026 16:52
Copy link
Contributor

@BioGeek BioGeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for persevering and getting this over the finish line!

@JemmaLDaniel JemmaLDaniel merged commit 767a464 into main Jan 23, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

3 participants