pXg (proteomics X genomics), a software tool that enables the reliable identification of both canonical and noncanonical MHC-I-associated peptides (MAPs) from de novo peptide sequencing by utilizing RNA-Seq data.
pXg can be integrated with any search engines such as PEAKS and pNovo3. It was developed for the reliable identification of noncanonical MAPs from de novo peptide sequencing; however, it can also be used to capture the number of reads mapped to each peptide sequence.
Input | Description | Format | Mandatory |
---|---|---|---|
Searh result | A list of PSMs identified from a search engine (e.g. PEAKS, pNovo3, Casanovo) | TSV or CSV | Yes |
Gene annotation | It must be the same file used in the read alignment (e.g. Gencode, Ensembl) | GTF | Yes |
RNA-Seq reads | Mapped and unmapped RNA-Seq reads. The file must be sorted by coordinates. Multiple SAM/BAM files should be separated by comma (,) | SAM/BAM | Yes |
Protein sequences | Canonical and contaminant protein sequences (e.g. UniProt) | Fasta | No |
*pXg is not applicable to the flat formatted output in pNovo3. A user must convert the flat format to CSV or TSV.
*Since version 2.3.0, pXg can support multiple SAM/BAM files. "Reads" column indicates sum of reads from multiple SAM/BAM files. Reads in each SAM/BAM file is appended to the last columns.
Output | Description | Format | Mandatory |
---|---|---|---|
pXg result | This is a main output file and contains a list of identification as TSV format | TSV | Yes |
pXg result for Percolator | This is a main output file and contains a list of identification as PIN format | PIN | Yes |
Unknown sequences | A list of softclip and unmapped reads matching to peptides | Flat | Yes |
Matched reads* | Matched reads to peptides passing all filters | SAM | No |
*Although the pXg result contains PSM information with corresponding RNA-Seq counts, it is not suitable for visualization.
Two output files (matched reads and peptides) are available for direct use in IGV, making visualization easier.
Field | Description | Value |
---|---|---|
SpecID | Identifier of a spectrum | String |
GenomicID | Identifier of genomic sequence | Integer |
Label | Target (1) and decoy (-1) labels | 1 or -1 |
DeltaScore | Difference between main scores of current rank and top-rank peptides | Float |
Rank | Rank of candidate peptides | Integer |
GenomicLociCount | The number of genomic locations | Integer |
InferredPeptide | Translated nucleotide sequence | String |
GenomicLoci | Genomic location of the peptide | String |
Strand | Strand of matched sequence | + or - |
ObservedLeftFlankNucleotide | Nucleotide sequence of the left flank of the peptide | String |
ObservedNucleotide | Nucleotide sequence of the peptide | String |
ObservedRightFlankNucleotide | Nucleotide sequence of the right flank of the peptide | String |
ReferenceLeftFlankNucleotide | Reference nucleotide sequence of the left flank of the peptide | String |
ReferenceNucleotide | Reference nucleotide sequence of the peptide | String |
ReferenceRightFlankNucleotide | Reference nucleotide sequence of the right flank of the peptide | String |
Mutations | Genomic information of mutations in the peptide | String |
MutationStatus | Indication of alteration caused by the mutations | Altered or Same |
TranscriptIDs | Matched transcript IDs | String |
GeneIDs | Matched gene IDs | String |
GeneIDCount | The number of matched gene IDs | Integer |
GeneNames | Matched gene names | String |
GeneNameCount | The number of matched gene names | Integer |
PercentFullDistance | Proportion of start genomic loci in the longest transcripts (exons + introns) | Float |
PercentExonDistance | Proportion of start genomic loci in the longest transcripts (exons) | Float |
PercentCDSDistance | Proportion of start genomic loci in the longest transcripts (CDSs) | Float |
FromCDSStartSite | Distance from the start site | String |
FromCDSStopSite | Distance from the stop site | String |
Events | Type of identified feature | String |
EventCount | The number of events | Integer |
FastaIDs | Matched identifiers in a given fasta sequences | String |
FastaIDCount | The number of FastaIDs | Integer |
Reads | Sum of matched reads from all SAM/BAM files | Integer |
MeanQScore | Mean of Phred scores | Float |
IsCanonical | Canonical (true) or nocanonical (false) status | true or false |
SAM/BAM file name | The number of matched reads in each SAM/BAM file | Integer |
Unknown sequences include sequence information from "unknown" events. The header line begins with ">[PEPTIDE]". Following the header line is the matched read information, which includes the sequence identifier, genomic location (if available), full sequence, and matched sequence.
Option | Description | Mandatory |
---|---|---|
gtf_file | GTF file path. We recommand to use the same gtf corresponding to alignment | Yes |
sam_file | SAM/BAM file path. The file must be sorted by coordinate. Multiple SAM/BAM files should be separated by comma (,) | Yes |
psm_file | PSM file path. It is expected that the psm file is derived from proteomics search by de novo or database search engine | Yes |
file_col | File name index in the psm file | Yes |
pept_col | Peptide index in the psm file | Yes |
charge_col | Charge state index in the psm file | Yes |
scan_col | Scan number index in the psm file | Yes |
output | Base output name of pXg | Yes |
sep | Specify the column separator. Possible values are csv or tsv. Default is csv | No |
mode | Specify the method of translation nucleotides. 3 for three-frame and 6 for six-frame. Default is 3 | No |
add_feat_cols | Specify the indices for additional features to generate PIN file. Several features can be added by comma separator. ex> 5,6,7 | No |
ileq | Controls whether pXg treats isoleucine (I) and leucine (L) as the same/equivalent with respect to a peptide identification. Default is true | No |
lengths | Range of peptide length to consider. Default is 8-15. You can write in this way (min-max, both inclusive) : 8-13 | No |
fasta_file | Canonical sequence database to report conservative assignment of noncanonical PSMs | No |
rank | How many candidates will be considered per a scan. Default is 100 (in other words, use all ranked candidates) | No |
out_sam | Report matched reads as SAM format (true or false). Default is false | No |
out_canonical | Report caonical peptides in the out_sam file (true or false). Default is true | No |
out_noncanonical | Report noncaonical peptides in the out_sam file (true or false). Default is true | No |
penalty_mutation | Penalty per a mutation. Default is 1 | No |
penalty_AS | Penalty for alternative splicing. Default is 10 | No |
penalty_5UTR | Penalty for 5`-UTR. Default is 20 | No |
penalty_3UTR | Penalty for 3`-UTR. Default is 20 | No |
penalty_ncRNA | Penalty for noncoding RNA. Default is 20 | No |
penalty_FS | Penalty for frame shift. Default is 20 | No |
penalty_IR | Penalty for intron region. Default is 30 | No |
penalty_IGR | Penalty for intergenic region. Default is 30 | No |
penalty_asRNA | Penalty for antisense RNA. Default is 30 | No |
penalty_softclip | Penalty for softclip reads. Default is 50 | No |
penalty_unknown | Penalty for unmapped reads. Default is 100 | No |
gtf_partition_size* | The size of treating genomic region at once. Default is 5000000 | No |
sam_partition_size* | The size of treating number of reads at once. Default is 1000000 | No |
threads* | The number of threads. Default is 4 | No |
*size parameters can effect memory usage and time. If your machine does not have enough memory, then decrease those values.
java -Xmx30G -jar pXg.jar \
--gtf_file [gene annotation file path] \
--sam_file [sorted SAM/BAM file path] \
--psm_file [de novo result file path] \
--fasta_file [protein sequence fasta file paht] \
--file_col [index of file name column] \
--charge_col [index of chage state column] \
--pept_col [index of peptide column] \
--score_col [index of search score column] \
--scan_col [index of scan number column] \
--output [base output file name]
This tutorial aims to understand how to run pXg and estimate FDR from the result. It contains 1) running STAR2 aligner with 2-pass parameter, 2) preparing SAM file from the alignment, 3) running pXg and 4) several post-processing including Percolator, merging pXg result with the result of Percolator and estimating separated FDR. Note that it neither contains how to run de novo peptide sequencing engines such as PEAKS, pNovo3 and Casanovo AND how to create deep learning based features.
We recommand to align fastq files using STAR2 with The Cancer Genome Atlas (TCGA) two-pass alignment option.
Once you get the aligned BAM or SAM file, you MUST sort the file by chromosomal coordinates.
We provide a code for preprocessing SAM file using SAMtools below:
samtools sort -o in.sorted.sam in.sam -@ 8
The "in.sorted.sam" is used for pXg input.
In this tutorial, toy datasets including 1) de novo results, 2) in.sorted.sam, 3) gene annotation (GTF) and 4) protein sequence fasta file are provided in the tutorial folder so that a user can try to run the pXg pipeline.
Using the toy datasets, you can run the pXg pipline using following command:
java -Xmx2G -jar pXg.v2.0.1.jar \
--gtf_file toy.gtf \
--sam_file toy.sorted.sam \
--psm_file toy.psm.csv \
--fasta_file toy.fasta \
--output toy \
--scan_col 5 \
--file_col 2 \
--pept_col 4 \
--score_col 8 \
--charge_col 11 \
--add_feat_cols 15 \
--sep csv \
--mode 3 \
--threads 2
This may take about 2 mins.
Note that the memory option "-Xmx50G" depends on the size of SAM file. In our experience, "-Xmx30G" is enough to deal with ~20G file.
Once you get the pXg result, you can add more features such as spectral similarity and delta retention time described in our manuscript. Without the additional features, still it is possible to run Percolator and estimate FDR from the pXg results.
We recommand to use Percolator version >= v3.06.1 because former versions have an issue to print proteinIds.
Post processing codes are also provided in the tutorial folder (post_process.ipynb).
- Export GTF format from pXg result.