pXg: proteomics X genomics

About pXg
Usage
Tutorial
TODO
- GTF Export
Citation

About pXg

pXg (proteomics X genomics), a software tool that enables the reliable identification of both canonical and noncanonical MHC-I-associated peptides (MAPs) from de novo peptide sequencing by utilizing RNA-Seq data.

Usage

pXg can be integrated with any search engines such as PEAKS and pNovo3. It was developed for the reliable identification of noncanonical MAPs from de novo peptide sequencing; however, it can also be used to capture the number of reads mapped to each peptide sequence.

Input format

Input	Description	Format	Mandatory
Searh result	A list of PSMs identified from a search engine (e.g. PEAKS, pNovo3, Casanovo)	TSV or CSV	Yes
Gene annotation	It must be the same file used in the read alignment (e.g. Gencode, Ensembl)	GTF	Yes
RNA-Seq reads	Mapped and unmapped RNA-Seq reads. The file must be sorted by coordinates. Multiple SAM/BAM files should be separated by comma (,)	SAM/BAM	Yes
Protein sequences	Canonical and contaminant protein sequences (e.g. UniProt)	Fasta	No

*pXg is not applicable to the flat formatted output in pNovo3. A user must convert the flat format to CSV or TSV.
*Since version 2.3.0, pXg can support multiple SAM/BAM files. "Reads" column indicates sum of reads from multiple SAM/BAM files. Reads in each SAM/BAM file is appended to the last columns.

Output format

Output	Description	Format	Mandatory
pXg result	This is a main output file and contains a list of identification as TSV format	TSV	Yes
pXg result for Percolator	This is a main output file and contains a list of identification as PIN format	PIN	Yes
Unknown sequences	A list of softclip and unmapped reads matching to peptides	Flat	Yes
Matched reads*	Matched reads to peptides passing all filters	SAM	No

*Although the pXg result contains PSM information with corresponding RNA-Seq counts, it is not suitable for visualization.
Two output files (matched reads and peptides) are available for direct use in IGV, making visualization easier.

pXg Result

Field	Description	Value
SpecID	Identifier of a spectrum	String
GenomicID	Identifier of genomic sequence	Integer
Label	Target (1) and decoy (-1) labels	1 or -1
DeltaScore	Difference between main scores of current rank and top-rank peptides	Float
Rank	Rank of candidate peptides	Integer
GenomicLociCount	The number of genomic locations	Integer
InferredPeptide	Translated nucleotide sequence	String
GenomicLoci	Genomic location of the peptide	String
Strand	Strand of matched sequence	+ or -
ObservedLeftFlankNucleotide	Nucleotide sequence of the left flank of the peptide	String
ObservedNucleotide	Nucleotide sequence of the peptide	String
ObservedRightFlankNucleotide	Nucleotide sequence of the right flank of the peptide	String
ReferenceLeftFlankNucleotide	Reference nucleotide sequence of the left flank of the peptide	String
ReferenceNucleotide	Reference nucleotide sequence of the peptide	String
ReferenceRightFlankNucleotide	Reference nucleotide sequence of the right flank of the peptide	String
Mutations	Genomic information of mutations in the peptide	String
MutationStatus	Indication of alteration caused by the mutations	Altered or Same
TranscriptIDs	Matched transcript IDs	String
GeneIDs	Matched gene IDs	String
GeneIDCount	The number of matched gene IDs	Integer
GeneNames	Matched gene names	String
GeneNameCount	The number of matched gene names	Integer
PercentFullDistance	Proportion of start genomic loci in the longest transcripts (exons + introns)	Float
PercentExonDistance	Proportion of start genomic loci in the longest transcripts (exons)	Float
PercentCDSDistance	Proportion of start genomic loci in the longest transcripts (CDSs)	Float
FromCDSStartSite	Distance from the start site	String
FromCDSStopSite	Distance from the stop site	String
Events	Type of identified feature	String
EventCount	The number of events	Integer
FastaIDs	Matched identifiers in a given fasta sequences	String
FastaIDCount	The number of FastaIDs	Integer
Reads	Sum of matched reads from all SAM/BAM files	Integer
MeanQScore	Mean of Phred scores	Float
IsCanonical	Canonical (true) or nocanonical (false) status	true or false
SAM/BAM file name	The number of matched reads in each SAM/BAM file	Integer

Unknown sequences

Unknown sequences include sequence information from "unknown" events. The header line begins with ">[PEPTIDE]". Following the header line is the matched read information, which includes the sequence identifier, genomic location (if available), full sequence, and matched sequence.

Command-line interface

List of Parameters

Option	Description	Mandatory
gtf_file	GTF file path. We recommand to use the same gtf corresponding to alignment	Yes
sam_file	SAM/BAM file path. The file must be sorted by coordinate. Multiple SAM/BAM files should be separated by comma (,)	Yes
psm_file	PSM file path. It is expected that the psm file is derived from proteomics search by de novo or database search engine	Yes
file_col	File name index in the psm file	Yes
pept_col	Peptide index in the psm file	Yes
charge_col	Charge state index in the psm file	Yes
scan_col	Scan number index in the psm file	Yes
output	Base output name of pXg	Yes
sep	Specify the column separator. Possible values are csv or tsv. Default is csv	No
mode	Specify the method of translation nucleotides. 3 for three-frame and 6 for six-frame. Default is 3	No
add_feat_cols	Specify the indices for additional features to generate PIN file. Several features can be added by comma separator. ex> 5,6,7	No
ileq	Controls whether pXg treats isoleucine (I) and leucine (L) as the same/equivalent with respect to a peptide identification. Default is true	No
lengths	Range of peptide length to consider. Default is 8-15. You can write in this way (min-max, both inclusive) : 8-13	No
fasta_file	Canonical sequence database to report conservative assignment of noncanonical PSMs	No
rank	How many candidates will be considered per a scan. Default is 100 (in other words, use all ranked candidates)	No
out_sam	Report matched reads as SAM format (true or false). Default is false	No
out_canonical	Report caonical peptides in the out_sam file (true or false). Default is true	No
out_noncanonical	Report noncaonical peptides in the out_sam file (true or false). Default is true	No
penalty_mutation	Penalty per a mutation. Default is 1	No
penalty_AS	Penalty for alternative splicing. Default is 10	No
penalty_5UTR	Penalty for 5`-UTR. Default is 20	No
penalty_3UTR	Penalty for 3`-UTR. Default is 20	No
penalty_ncRNA	Penalty for noncoding RNA. Default is 20	No
penalty_FS	Penalty for frame shift. Default is 20	No
penalty_IR	Penalty for intron region. Default is 30	No
penalty_IGR	Penalty for intergenic region. Default is 30	No
penalty_asRNA	Penalty for antisense RNA. Default is 30	No
penalty_softclip	Penalty for softclip reads. Default is 50	No
penalty_unknown	Penalty for unmapped reads. Default is 100	No
gtf_partition_size*	The size of treating genomic region at once. Default is 5000000	No
sam_partition_size*	The size of treating number of reads at once. Default is 1000000	No
threads*	The number of threads. Default is 4	No

*size parameters can effect memory usage and time. If your machine does not have enough memory, then decrease those values.

Basic command

java -Xmx30G -jar pXg.jar \
--gtf_file [gene annotation file path] \
--sam_file [sorted SAM/BAM file path] \
--psm_file [de novo result file path] \
--fasta_file [protein sequence fasta file paht] \
--file_col [index of file name column] \
--charge_col [index of chage state column] \
--pept_col [index of peptide column] \
--score_col [index of search score column] \
--scan_col [index of scan number column] \
--output [base output file name]

Tutorial

This tutorial aims to understand how to run pXg and estimate FDR from the result. It contains 1) running STAR2 aligner with 2-pass parameter, 2) preparing SAM file from the alignment, 3) running pXg and 4) several post-processing including Percolator, merging pXg result with the result of Percolator and estimating separated FDR. Note that it neither contains how to run de novo peptide sequencing engines such as PEAKS, pNovo3 and Casanovo AND how to create deep learning based features.

RNA-Seq alignment

We recommand to align fastq files using STAR2 with The Cancer Genome Atlas (TCGA) two-pass alignment option.

Sorted SAM/BAM preparation

Once you get the aligned BAM or SAM file, you MUST sort the file by chromosomal coordinates.

We provide a code for preprocessing SAM file using SAMtools below:

samtools sort -o in.sorted.sam in.sam -@ 8

The "in.sorted.sam" is used for pXg input.

Toy example

In this tutorial, toy datasets including 1) de novo results, 2) in.sorted.sam, 3) gene annotation (GTF) and 4) protein sequence fasta file are provided in the tutorial folder so that a user can try to run the pXg pipeline.

Run pXg

Using the toy datasets, you can run the pXg pipline using following command:

java -Xmx2G -jar pXg.v2.0.1.jar \
--gtf_file toy.gtf \
--sam_file toy.sorted.sam \
--psm_file toy.psm.csv \
--fasta_file toy.fasta \
--output toy \
--scan_col 5 \
--file_col 2 \
--pept_col 4 \
--score_col 8 \
--charge_col 11 \
--add_feat_cols 15 \
--sep csv \
--mode 3 \
--threads 2

This may take about 2 mins.

Note that the memory option "-Xmx50G" depends on the size of SAM file. In our experience, "-Xmx30G" is enough to deal with ~20G file.

Run Percolator using the pXg results

Once you get the pXg result, you can add more features such as spectral similarity and delta retention time described in our manuscript. Without the additional features, still it is possible to run Percolator and estimate FDR from the pXg results.
We recommand to use Percolator version >= v3.06.1 because former versions have an issue to print proteinIds.
Post processing codes are also provided in the tutorial folder (post_process.ipynb).

IGV viewer

When pXg finishes identifying peptides, the resulting GTF and SAM files are immediately available in the IGV viewer.

TODO

GTF Export

Export GTF format from pXg result.

Citation

pXg: Comprehensive Identification of Noncanonical MHC-I–Associated Peptides From De Novo Peptide Sequencing Using RNA-Seq Reads. Seunghyuk Choi and Eunok Paek, Molecular & Cellular Proteomics 2024.

Name		Name	Last commit message	Last commit date
Latest commit History 291 Commits
img		img
pysrc		pysrc
rsrc		rsrc
src		src
tutorial		tutorial
.gitignore		.gitignore
README.md		README.md
design.graphml		design.graphml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pXg: proteomics X genomics

About pXg

Usage

Input format

Output format

pXg Result

Unknown sequences

Command-line interface

List of Parameters

Basic command

Tutorial

RNA-Seq alignment

Sorted SAM/BAM preparation

Toy example

Run pXg

Run Percolator using the pXg results

IGV viewer

TODO

GTF Export

Citation

About

Releases

Packages

Languages

JingAnyaSun/pXg

Folders and files

Latest commit

History

Repository files navigation

pXg: proteomics X genomics

About pXg

Usage

Input format

Output format

pXg Result

Unknown sequences

Command-line interface

List of Parameters

Basic command

Tutorial

RNA-Seq alignment

Sorted SAM/BAM preparation

Toy example

Run pXg

Run Percolator using the pXg results

IGV viewer

TODO

GTF Export

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages