-
Notifications
You must be signed in to change notification settings - Fork 1
Implement Typer + Hydra Configuration Architecture #147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 9 commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
a2ae8d3
chore: add hydra to project dependencies
JemmaLDaniel d51a264
feat: use hydra to configure winnow runs
JemmaLDaniel 8d3a02a
test: update tests to use extra init arguments
JemmaLDaniel 5e730c7
feat: add winnow config command to view resolved configuration
JemmaLDaniel 20ee8b3
docs: document hydra config usage with winnow cli
JemmaLDaniel 2529582
docs: make docs titles sentence case and fix bullet list formatting
JemmaLDaniel d7e713c
perf: optimise CLI startup time with lazy imports
JemmaLDaniel 07bfc18
chore: merge branch 'main' into feat-hydra-config
JemmaLDaniel 980a793
chore: update gitignore to ignore extra supported files and images
JemmaLDaniel b18bd54
Merge branch 'main' into feat-hydra-config
JemmaLDaniel bb25d28
fix: convert predictions_path to a Path before file loading
JemmaLDaniel a883fcd
docs: add instructions on conversion from mgf to parquet file
JemmaLDaniel e9126d9
docs: remove references to old Typer CLI arguments
JemmaLDaniel 864095e
feat: create toy data for CLI quickstart
JemmaLDaniel d614fcf
docs: add documentation for quickstarting with the toy data
JemmaLDaniel 999e42f
fix: allow for location, overriding and composition of configs when i…
JemmaLDaniel ad8b1f5
chore: update example notebook with new object instantiation argument…
JemmaLDaniel 00a006b
ci: migrate coverage badge to Gist-based dynamic system
JemmaLDaniel 571b3b3
chore: track new config position
JemmaLDaniel cb9edfe
chore: merge branch 'main' into feat-hydra-config
JemmaLDaniel 36c50a2
chore: update README.md
JemmaLDaniel 29e182e
fix: remove np.float64 artifacts from example CSV
JemmaLDaniel 0e12735
chore: remove unused workspace config
JemmaLDaniel 4ea04a5
chore: update requirements
JemmaLDaniel c12ae19
fix: added instanovo version compatibility layer
JemmaLDaniel 46c66e1
chore: bump instanovo package version
JemmaLDaniel bf5c90d
fix: update pyproject.toml to include compatibility layer
JemmaLDaniel 35d1a36
fix: do not filter sample data outputs on FDR 0.05
JemmaLDaniel c0ad9b2
docs: update old CLI references
JemmaLDaniel e7986f9
fix: reference MZTabDatasetLoader residue_remapping correctly
JemmaLDaniel 5794913
chore: update pyproject.toml with winnow sub-directories
JemmaLDaniel 5418e6b
docs: recommend use of Make commands in quickstart
JemmaLDaniel 431123b
chore: use uv to run quickstart commands
JemmaLDaniel 82b9666
chore: remove unused global variable
JemmaLDaniel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| # --- Calibrator configuration --- | ||
|
|
||
| calibrator: | ||
| _target_: winnow.calibration.calibrator.ProbabilityCalibrator | ||
|
|
||
| seed: 42 | ||
| hidden_layer_sizes: [50, 50] # The number of neurons in each hidden layer of the MLP classifier. | ||
| learning_rate_init: 0.001 # The initial learning rate for the MLP classifier. | ||
| alpha: 0.0001 # L2 regularisation parameter for the MLP classifier. | ||
| max_iter: 1000 # Maximum number of training iterations for the MLP classifier. | ||
| early_stopping: true # Whether to use early stopping to terminate training. | ||
| validation_fraction: 0.1 # Proportion of training data to use for early stopping validation. | ||
|
|
||
| features: | ||
| mass_error: | ||
| _target_: winnow.calibration.calibration_features.MassErrorFeature | ||
| residue_masses: ${residue_masses} # The residue masses to use for the mass error feature. | ||
|
|
||
| prosit_features: | ||
| _target_: winnow.calibration.calibration_features.PrositFeatures | ||
| mz_tolerance: 0.02 | ||
| learn_from_missing: true # Whether to learn from missing Prosit features. If False, errors will be raised when invalid spectra are encountered. | ||
| invalid_prosit_tokens: ${invalid_prosit_tokens} # The tokens to consider as invalid for Prosit features. | ||
| prosit_intensity_model_name: Prosit_2020_intensity_HCD # The name of the Prosit intensity model to use. | ||
|
|
||
| retention_time_feature: | ||
| _target_: winnow.calibration.calibration_features.RetentionTimeFeature | ||
| hidden_dim: 10 # The hidden dimension size for the MLP regressor used to predict iRT from observed retention times. | ||
| train_fraction: 0.1 # The fraction of the data to use for training the iRT predictor. | ||
| learn_from_missing: true # Whether to learn from missing retention time features. If False, errors will be raised when invalid spectra are encountered. | ||
| seed: 42 # Random seed for the MLP regressor. | ||
| learning_rate_init: 0.001 # The initial learning rate for the MLP regressor. | ||
| alpha: 0.0001 # L2 regularisation parameter for the MLP regressor. | ||
| max_iter: 200 # Maximum number of training iterations for the MLP regressor. | ||
| early_stopping: false # Whether to use early stopping for the MLP regressor. | ||
| validation_fraction: 0.1 # Proportion of training data to use for early stopping validation. | ||
| invalid_prosit_tokens: ${invalid_prosit_tokens} # The tokens to consider as invalid for Prosit iRT features. | ||
| prosit_irt_model_name: Prosit_2019_irt # The name of the Prosit iRT model to use. | ||
|
|
||
| chimeric_features: | ||
| _target_: winnow.calibration.calibration_features.ChimericFeatures | ||
| mz_tolerance: 0.02 | ||
| learn_from_missing: true # Whether to learn from missing chimeric features. If False, errors will be raised when invalid spectra are encountered. | ||
| invalid_prosit_tokens: ${invalid_prosit_tokens} # The tokens to consider as invalid for Prosit chimeric intensity features. | ||
| prosit_intensity_model_name: Prosit_2020_intensity_HCD # The name of the Prosit intensity model to use. | ||
|
|
||
| beam_features: | ||
| _target_: winnow.calibration.calibration_features.BeamFeatures |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # --- InstaNovo data loading configuration --- | ||
|
|
||
| _target_: winnow.datasets.data_loaders.InstaNovoDatasetLoader | ||
|
|
||
| residue_masses: ${residue_masses} | ||
| residue_remapping: # Used to map InstaNovo legacy notations to UNIMOD tokens. | ||
| "M(ox)": "M[UNIMOD:35]" # Oxidation | ||
| "M(+15.99)": "M[UNIMOD:35]" # Oxidation | ||
| "S(p)": "S[UNIMOD:21]" # Phosphorylation | ||
| "T(p)": "T[UNIMOD:21]" # Phosphorylation | ||
| "Y(p)": "Y[UNIMOD:21]" # Phosphorylation | ||
| "S(+79.97)": "S[UNIMOD:21]" # Phosphorylation | ||
| "T(+79.97)": "T[UNIMOD:21]" # Phosphorylation | ||
| "Y(+79.97)": "Y[UNIMOD:21]" # Phosphorylation | ||
| "Q(+0.98)": "Q[UNIMOD:7]" # Deamidation | ||
| "N(+0.98)": "N[UNIMOD:7]" # Deamidation | ||
| "Q(+.98)": "Q[UNIMOD:7]" # Deamidation | ||
| "N(+.98)": "N[UNIMOD:7]" # Deamidation | ||
| "C(+57.02)": "C[UNIMOD:4]" # Carbamidomethylation | ||
| # N-terminal modifications. | ||
| "(+42.01)": "[UNIMOD:1]" # Acetylation | ||
| "(+43.01)": "[UNIMOD:5]" # Carbamylation | ||
| "(-17.03)": "[UNIMOD:385]" # Ammonia loss |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| # --- MZTab data loading configuration --- | ||
| _target_: winnow.datasets.data_loaders.MZTabDatasetLoader | ||
|
|
||
| residue_masses: ${residue_masses} | ||
| residue_remapping: # Used to map Casanovo-specific notations to UNIMOD tokens. | ||
| "M+15.995": "M[UNIMOD:35]" # Oxidation | ||
| "Q+0.984": "Q[UNIMOD:7]" # Deamidation | ||
| "N+0.984": "N[UNIMOD:7]" # Deamidation | ||
| "+42.011": "[UNIMOD:1]" # Acetylation | ||
| "+43.006": "[UNIMOD:5]" # Carbamylation | ||
| "-17.027": "[UNIMOD:385]" # Ammonia loss | ||
| "C+57.021": "C[UNIMOD:4]" # Carbamidomethylation | ||
| "C[Carbamidomethyl]": "C[UNIMOD:4]" # Carbamidomethylation | ||
| "M[Oxidation]": "M[UNIMOD:35]" # Oxidation | ||
| "N[Deamidated]": "N[UNIMOD:7]" # Deamidation | ||
| "Q[Deamidated]": "Q[UNIMOD:7]" # Deamidation | ||
| # N-terminal modifications. | ||
| "[Acetyl]-": "[UNIMOD:1]" # Acetylation | ||
| "[Carbamyl]-": "[UNIMOD:5]" # Carbamylation | ||
| "[Ammonia-loss]-": "[UNIMOD:385]" # Ammonia loss |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| # --- PointNovo data loading configuration --- | ||
|
|
||
| _target_: winnow.datasets.data_loaders.PointNovoDatasetLoader | ||
|
|
||
| residue_masses: ${residue_masses} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| # --- Winnow data loading configuration --- | ||
|
|
||
| _target_: winnow.datasets.data_loaders.WinnowDatasetLoader | ||
|
|
||
| residue_masses: ${residue_masses} | ||
| # The internal Winnow dataset loader does not need a residue remapping | ||
| # since it uses the UNIMOD tokens directly. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| # --- Database-grounded FDR control configuration --- | ||
|
|
||
| _target_: winnow.fdr.database_grounded.DatabaseGroundedFDRControl | ||
|
|
||
| confidence_feature: ${fdr_control.confidence_column} # Name of the column with confidence scores to use for FDR estimation. | ||
| residue_masses: ${residue_masses} # The residue masses from global `residues` config | ||
| isotope_error_range: [0, 1] # The isotope error range for matching peptides | ||
| drop: 10 # The number of top predictions to drop for stability |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| # --- Non-parametric FDR control configuration --- | ||
|
|
||
| _target_: winnow.fdr.nonparametric.NonParametricFDRControl |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| # --- Predicting scores and applying FDR control --- | ||
| defaults: | ||
| - _self_ | ||
| - residues | ||
| - data_loader: instanovo # Options: instanovo, mztab, pointnovo, winnow | ||
| - fdr_method: nonparametric # Options: nonparametric, database_grounded | ||
|
|
||
| # --- Pipeline Execution Configuration --- | ||
|
|
||
| dataset: | ||
| # Dataset paths: | ||
| # Path to the spectrum data file or to folder containing saved internal Winnow dataset. | ||
| spectrum_path_or_directory: data/spectra.ipc | ||
| # Path to the beam predictions file. | ||
| # Leave as `null` if data source is `winnow`, or loading will fail. | ||
| predictions_path: data/predictions.csv | ||
| # NOTE: Make sure that the data loader type matches the data source type in this dataset section. | ||
|
|
||
| calibrator: | ||
| # Model loading: | ||
| # Path to the local calibrator directory or the HuggingFace model identifier. | ||
| # If the path is a local directory path, it will be used directly. If it is a HuggingFace repository identifier, it will be downloaded from HuggingFace. | ||
| pretrained_model_name_or_path: InstaDeepAI/winnow-general-model | ||
| # Directory to cache the HuggingFace model. | ||
| cache_dir: null # can be set to `null` if using local model or for the default cache directory from HuggingFace. | ||
|
|
||
| fdr_control: | ||
| # FDR settings: | ||
| # Target FDR threshold (e.g. 0.01 for 1%, 0.05 for 5% etc.). | ||
| fdr_threshold: 0.05 | ||
| # Name of the column with confidence scores to use for FDR estimation. | ||
| confidence_column: calibrated_confidence | ||
|
|
||
| # Folder path to write the outputs to. | ||
| # This will create two CSV files in the output folder: | ||
| # - metadata.csv: Contains all metadata and feature columns from the input dataset. | ||
| # - preds_and_fdr_metrics.csv: Contains predictions and FDR metrics. | ||
| output_folder: results/predictions |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,64 @@ | ||
| # --- Residues configuration --- | ||
|
|
||
| # This is Winnow's internal residue representation. | ||
| # We use this to calculate the mass error feature and during database-grounded FDR control. | ||
| # We also use this to initialise the residue set for the Metrics class. | ||
| residue_masses: | ||
| "G": 57.021464 | ||
| "A": 71.037114 | ||
| "S": 87.032028 | ||
| "P": 97.052764 | ||
| "V": 99.068414 | ||
| "T": 101.047670 | ||
| "C": 103.009185 | ||
| "L": 113.084064 | ||
| "I": 113.084064 | ||
| "N": 114.042927 | ||
| "D": 115.026943 | ||
| "Q": 128.058578 | ||
| "K": 128.094963 | ||
| "E": 129.042593 | ||
| "M": 131.040485 | ||
| "H": 137.058912 | ||
| "F": 147.068414 | ||
| "R": 156.101111 | ||
| "Y": 163.063329 | ||
| "W": 186.079313 | ||
| # Modifications | ||
| "M[UNIMOD:35]": 147.035400 # Oxidation | ||
| "C[UNIMOD:4]": 160.030649 # Carboxyamidomethylation | ||
| "N[UNIMOD:7]": 115.026943 # Deamidation | ||
| "Q[UNIMOD:7]": 129.042594 # Deamidation | ||
| "R[UNIMOD:7]": 157.085127 # Arginine citrullination | ||
| "P[UNIMOD:35]": 113.047679 # Proline hydroxylation | ||
| "S[UNIMOD:21]": 166.998028 # Phosphorylation + 79.966 | ||
| "T[UNIMOD:21]": 181.01367 # Phosphorylation + 79.966 | ||
| "Y[UNIMOD:21]": 243.029329 # Phosphorylation + 79.966 | ||
| "C[UNIMOD:312]": 222.013284 # Cysteinylation | ||
| "E[UNIMOD:27]": 111.032028 # Glu -> pyro-Glu | ||
| "Q[UNIMOD:28]": 111.032029 # Gln -> pyro-Gln | ||
| # Terminal modifications | ||
| "[UNIMOD:1]": 42.010565 # Acetylation | ||
| "[UNIMOD:5]": 43.005814 # Carbamylation | ||
| "[UNIMOD:385]": -17.026549 # NH3 loss | ||
| "(+25.98)": 25.980265 # Carbamylation & NH3 loss (legacy notation) | ||
|
|
||
| # The tokens to consider as invalid for Prosit features. | ||
| # We also filter out non-carboxyamidomethylated Cysteine in a separate step. | ||
| invalid_prosit_tokens: | ||
| # InstaNovo | ||
| - "[UNIMOD:7]" | ||
| - "[UNIMOD:21]" | ||
| - "[UNIMOD:1]" | ||
| - "[UNIMOD:5]" | ||
| - "[UNIMOD:385]" | ||
| - "(+25.98)" # (legacy notation) | ||
| # Casanovo | ||
| - "+0.984" | ||
| - "+42.011" | ||
| - "+43.006" | ||
| - "-17.027" | ||
| - "[Ammonia-loss]-" | ||
| - "[Carbamyl]-" | ||
| - "[Acetyl]-" | ||
| - "[Deamidated]" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| # --- Training a calibrator --- | ||
| defaults: | ||
| - _self_ | ||
| - residues | ||
| - calibrator | ||
| - data_loader: instanovo # Options: instanovo, mztab, pointnovo, winnow | ||
|
|
||
| # --- Pipeline Execution Configuration --- | ||
|
|
||
| dataset: | ||
| # Dataset paths: | ||
| # Path to the spectrum data file or to folder containing saved internal Winnow dataset. | ||
| spectrum_path_or_directory: data/spectra.ipc | ||
| # Path to the beam predictions file. | ||
| # Leave as `null` if data source is `winnow`, or loading will fail. | ||
| predictions_path: data/predictions.csv | ||
| # NOTE: Make sure that the data loader type matches the data source type in this dataset section. | ||
|
|
||
| # Output paths: | ||
| model_output_dir: models/new_model | ||
| dataset_output_path: results/calibrated_dataset.csv |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.