NovoRank

NovoRank is a post-processing tool designed to enhance the accuracy of de novo peptide sequencing, which directly infers peptide sequences from tandem mass spectra without relying on databases. Existing de novo sequencing tools often lead to incorrect peptide identifications due to their dependence on imperfect scoring functions. To address this issue, NovoRank utilizes spectral clustering and machine learning techniques to assign more accurate peptide sequences to spectra. NovoRank improves the output of various de novo sequencing tools and enhances recall and precision at the PSM level.

Rights and Permissions

- NovoRank © 2024 is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
  This license requires that reusers give credit to the creator. It allows reusers to distribute, 
  remix, adapt, and build upon the material in any medium or format, for noncommercial purposes only. 
  If others modify or adapt the material, they must license the modified material under identical terms.

NovoRank

NovoRank is a machine learning/deep learning-based algorithm that post-processes the De novo sequencing results.

NovoRank is implemented and tested with

Python==3.9
requirements.txt

and

DeepLC ( https://github.com/compomics/DeepLC )
MS-Cluster ( http://proteomics.ucsd.edu/software-tools/ms-clusterarchives )
CometX.exe ( this is in-house software modified to calculate XCorr only. The implementation is based on the Comet software )

All data used in the experiment can be downloaded from the 'sample' folder in the NovoRank GitHub repository and https://drive.google.com/drive/folders/13jir3-QLcAVGtUe84Tfp5utctnHRI-5Q?usp=share_link.

Quick start for potential reviewers

A user can download pre-trained model and test sample data at https://drive.google.com/drive/folders/1ICmLBRBhJGdImi4aPwlQVS6pfKT-Z8sI?usp=share_link and run below command line for quick test:

Create and activate an Anaconda virtual environment.

conda create -n [NAME] python==3.9
conda activate [NAME]

Install all packages in the requirements.txt file with the following command.

pip install -r requirements.txt

Run run_novorank.py.

python run_novorank.py .\test\config_for_reviewer.txt

How to use NovoRank

To use NovoRank for your datasets, you HAVE TO train your own model fitting to your datasets.

Step 1. Preparation datasets

As an initial step, a user MUST make their datasets to fit NovoRank input standard.

De novo search result

Once you perform de novo search using any tools such as PEAKS, pNovo3 and DeepNovo, you MUST convert the result to below form:

Source File	Scan number	Peptide	Score
Hela_1.mgf	10	HKPSVK	85

Note that each column is separated by comma (comma-separated value format (CSV)).

Database search result

NovoRank generates positive and negative labels based on database search result from the same MS/MS spectra used in the de novo search. Therefore, it only needs for training. If a user uses pre-trained model, this file is not needed for the further step. After conducting database search, only reliable PSMs are prepared as below format:

Source File	Scan number	GT
Hela_1.mgf	3	KPVGAAK

Note that each column is separated by comma (comma-separated value format (CSV)).

Note for post-translational modification notation

NovoRank assumes that all Cysteines (C) have a fixed modification Carbamidomethylation. As a variable modification, it only allows an oxidation on Methionine as lower letter "m". For example, if AM+15.99EENGR, a user must convert the sequence to AmEENGR.

Step 2. Initial clustering using MS-Cluster

MS-Cluster software and user’s manual are available at http://proteomics.ucsd.edu/software-tools/ms-clusterarchives/. Create a list of the full paths to the input files and call it list.txt.

< Clustering to MS-Cluster using the following command line. >

MSCluster.exe --list list.txt --output-name CLUSTERS --mixture-prob 0.01 --fragment-tolerance 0.02 --assign-charges

Step 3. Generation of deep learning input

Based on the results of both de novo search and MS-clust, NovoRank generates top two candidates. The top two candidates are an initial point to train deep learning model.

A user can set the parameters in 'config_for_gen_top2.txt' file.

Parameter	Value	Explanation	Mandatory
mgf_path	String	Path of a folder containing MS/MS spectra (MGF format).	Y
denovo_result_csv	String	Path of the de novo search result CSV file (see Step 1. Preparation datasets).	Y
db_result_csv	String	Path of the database search result CSV file (see Step 1. Preparation datasets).	N
cluster_result_path	String	Path of the clustering result from MS-Cluster.	Y
mgf_xcorr	String	Path of a folder containing MS/MS spectra for XCorr calculation (MGF format).	Y
mgf_remove	String	Path of a folder containing MS/MS spectra to find internal fragment ions (MGF format).	Y
precursor_search_ppm	Float	Precursor PPM tolerance.	Y
elution_time	Integer	A total elution time in the mass spectrometry assay (minutes).	Y
training	Boolean	If a user wants to train a model, set it True. Otherwise, set False (test only).	Y
features_csv	String	Path of a result feature file as output.	Y

Note that when training sets as "False", NovoRank ignores "db_result_csv".

python gen_feature_top2_candidates.py config_for_gen_top2.txt

Step 4. XCorr calculation

As a third-part, NovoRank uses XCorr value as an additional feature.

< Calculate XCorr using the following command line of CometX. >

CometX.exe -X -Pcomet.params .\mgf_XCorr\*.mgf

Step 5. The last step for training/test of NovoRank

Lastly, NovoRank takes three inputs: feature.csv and XCorr values obtained from Steps 3 and 4, respectively, as well as the MGF files.

A user can set the parameters in 'config_run_novorank.txt' file.

Parameter	Value	Explanation	Training or Test
training	Boolean	If a user wants to train a model, set it True. Otherwise, set False (test only).	Both
mgf_path	String	Path of a folder containing MS/MS spectra (MGF format).	Both
mgf_xcorr	String	Path of the XCorr calculation TSV file.	Both
features_csv	String	Path of the output of gen_feature_top2_candidates.py.	Both
batch_size	Integer	Size of batch.	Both
val_size	Float	The validation dataset ratio.	Training
epoch	Integer	Size of epoch.	Training
model_save_name	String	Save path and h5 file name for trained model.	Training
pre_trained_model	String	A path of pre-trained model h5 file.	Test
result_name	String	Save path and CSV file name for test result.	Test

"Both" means that it is used in both cases of training and Test.

python run_novorank.py config_run_novorank.txt

Deep learning model for re-ranking.

The deep learning model only handles peptides with a maximum mass of 5000 Da and a length of 40 or less.

Testing
Using a pre-trained model, perform testing and output a single assigned peptide for each spectrum as the result.
Training
The deep learning model is trained based on the hyper-parameters set in the config_run_novorank.txt.
The trained model is saved in the .h5 format as the output.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
sample		sample
Cal_rt_feature.py		Cal_rt_feature.py
Cluster_csv.py		Cluster_csv.py
CometX.py		CometX.py
Config.py		Config.py
DL_test.py		DL_test.py
DL_train.py		DL_train.py
Internal_fragment_ion.py		Internal_fragment_ion.py
MGF_info.py		MGF_info.py
MGF_noise_remove.py		MGF_noise_remove.py
MGF_scan_add.py		MGF_scan_add.py
NEW_candidates.py		NEW_candidates.py
README.md		README.md
Util.py		Util.py
Xcorr.py		Xcorr.py
config_run_novorank.txt		config_run_novorank.txt
gen_feature_top2_candidates.py		gen_feature_top2_candidates.py
requirements.txt		requirements.txt
run_novorank.py		run_novorank.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NovoRank

Rights and Permissions

NovoRank

Quick start for potential reviewers

How to use NovoRank

Step 1. Preparation datasets

De novo search result

Database search result

Note for post-translational modification notation

Step 2. Initial clustering using MS-Cluster

Step 3. Generation of deep learning input

Step 4. XCorr calculation

Step 5. The last step for training/test of NovoRank

Deep learning model for re-ranking.

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NovoRank

Rights and Permissions

NovoRank

Quick start for potential reviewers

How to use NovoRank

Step 1. Preparation datasets

De novo search result

Database search result

Note for post-translational modification notation

Step 2. Initial clustering using MS-Cluster

Step 3. Generation of deep learning input

Step 4. XCorr calculation

Step 5. The last step for training/test of NovoRank

Deep learning model for re-ranking.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages