Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with merging MNVs within the same codon #8

Open
looxon93 opened this issue Oct 30, 2022 · 7 comments
Open

Issue with merging MNVs within the same codon #8

looxon93 opened this issue Oct 30, 2022 · 7 comments

Comments

@looxon93
Copy link

Hi all,

I've used this merge_mnp.py script previously with great success, but now, I encountered an issue on certain positions, I found this:

grep "chr19:54574902"  codons.chr19.hg38.txt
chr19:54574902  NM_001290271-Codon639
chr19:54574902  NM_006866-Codon639
chr19:54574902  NM_001130917-Codon371
chr19:54574902  NM_001290270-Codon358

grep "chr19:54574903"  codons.chr19.hg38.txt
chr19:54574903  NM_001290271-Codon639
chr19:54574903  NM_006866-Codon639
chr19:54574903  NM_001130917-Codon372
chr19:54574903  NM_001290270-Codon358

In this part of code:

with open(codon_file, 'rt') as fp:
    for line in fp:
        line_contents = line.rstrip().split('\t')
        self.codon_dict[line_contents[0]] = line_contents[1]

This key will be overwritten, since one position contains multiple codons, and for transcript: NM_001130917 they shouldn't be merged, for all other they should be merged.
This is a subset of my VCF:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr19 54574871 . C T 101.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0.671;DP=25;ExcessHet=0;FS=5.566;MLEAC=1;MLEAF=0.5;MQ=55.81;MQRankSum=-3.886;QD=4.84;ReadPosRankSum=0.289;SOR=2.303 GT:AD:DP:GQ:PL:PS 0|1:16,5:21:99:109,0,483:54240560
chr19 54574902 . G C 331.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=22;ExcessHet=0;FS=4.2;MLEAC=1;MLEAF=0.5;MQ=55.33;MQRankSum=-3.277;QD=15.07;ReadPosRankSum=0.434;SOR=1.863 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:13,9:22:99:0|1:54574902_G_C:339,0,519:54240560
chr19 54574903 . G A 331.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=22;ExcessHet=0;FS=4.2;MLEAC=1;MLEAF=0.5;MQ=55.33;MQRankSum=-3.277;QD=15.07;ReadPosRankSum=0.535;SOR=1.863 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:13,9:22:99:0|1:54574902_G_C:339,0,519:54240560
chr19 54574907 . T C 331.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=22;ExcessHet=0;FS=4.2;MLEAC=1;MLEAF=0.5;MQ=55.33;MQRankSum=-3.277;QD=15.07;ReadPosRankSum=0.568;SOR=1.863 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:13,9:22:99:0|1:54574902_G_C:339,0,519:54240560

Thanks a lot for your help!
Luka

@looxon93
Copy link
Author

looxon93 commented Oct 30, 2022

Also, one thing I would add to the script is possibility to read compressed codons files:

import os
import gzip

class merge_by_codon:

    codon_dict = {}
    ignore_INDELs = False

    def __init__(self, codon_file, ignore_INDELs=0):
        self.ignore_INDELs = bool(int(ignore_INDELs))
        # load the codon file and create a dictionary of all loci and the codon they belong to
        # the codon file is a tab separated list of locus\tCodonID
        codons_name = os.path.basename(codon_file)
        if codons_name.endswith('.gz'):
            with gzip.open(codon_file,  'rt') as fp:
                for line in fp:
                    line_contents = line.rstrip().split('\t')
                    self.codon_dict[line_contents[0]] = line_contents[1]
        else:
            with open(codon_file) as fp:
                for line in fp:
                    line_contents = line.rstrip().split('\t')
                    self.codon_dict[line_contents[0]] = line_contents[1]

@DonFreed
Copy link
Contributor

DonFreed commented Nov 1, 2022

Hi Luka,

Glad this has been working well for you!

A difficulty with this request is that merge_by_codon is not setup to handle overlapping codons. What output would you expect to be produced at 54574902-54574903 for your example case?

If we decide to merge by NM_001130917, the variants at 54574902 and 54574903 would not be merged and the tool would output two separate records:

chr19 54574902 . G C 331.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=22;ExcessHet=0;FS=4.2;MLEAC=1;MLEAF=0.5;MQ=55.33;MQRankSum=-3.277;QD=15.07;ReadPosRankSum=0.434;SOR=1.863 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:13,9:22:99:0|1:54574902_G_C:339,0,519:54240560
chr19 54574903 . G A 331.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=22;ExcessHet=0;FS=4.2;MLEAC=1;MLEAF=0.5;MQ=55.33;MQRankSum=-3.277;QD=15.07;ReadPosRankSum=0.535;SOR=1.863 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:13,9:22:99:0|1:54574902_G_C:339,0,519:54240560

With any of the other transcripts, the tools outputs an MNP and annotates the original records with a MERGED FILTER field:

chr19 54574902 . GG CA 331.64 PASS ...
chr19 54574902 . G C 331.64 MERGED AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=22;ExcessHet=0;FS=4.2;MLEAC=1;MLEAF=0.5;MQ=55.33;MQRankSum=-3.277;QD=15.07;ReadPosRankSum=0.434;SOR=1.863 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:13,9:22:99:0|1:54574902_G_C:339,0,519:54240560
chr19 54574903 . G A 331.64 MERGED AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=22;ExcessHet=0;FS=4.2;MLEAC=1;MLEAF=0.5;MQ=55.33;MQRankSum=-3.277;QD=15.07;ReadPosRankSum=0.535;SOR=1.863 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:13,9:22:99:0|1:54574902_G_C:339,0,519:54240560

We could potentially output all three records with a FILTER or INFO field describing the codon used for the record. Would that be useful?

-Don

@DonFreed
Copy link
Contributor

DonFreed commented Nov 1, 2022

Happy to consider compressed codon files as well. I'll see if I can test your changes soon.

@looxon93
Copy link
Author

looxon93 commented Nov 1, 2022

Hi @DonFreed , thanks for looking into this. I know that these scripts aren't main tools of Sentieon toolkit, but I found them very useful.

I am also investigating possibility to avoid loading whole codons file to a memory as the latest transcript file crate codons file around 250 GB. I am trying to subset it on the fly to mutations of interest, will let you know if I found something which can improve runtime. I tested it on 10 mutations only and it lasted 1.5h.

@DonFreed
Copy link
Contributor

DonFreed commented Nov 3, 2022

Great. Please let us know if you are able to find any runtime improvements.

Do you have any thoughts on the type of output you would like to see? The current merge_by_codon code cannot handle overlapping transcripts, and it's not clear how to represent both merged and unmerged variants in the same VCF.

@looxon93
Copy link
Author

looxon93 commented Nov 4, 2022

Actually, only idea I have that in these cases we keep filter of merged mutations to: PASS instead of MERGED. That way, user who annotates these VCFs later will have both separate mutations and MNV with their effect. I don't have any other ideas tbh.

@DonFreed
Copy link
Contributor

DonFreed commented Nov 4, 2022

One solution would be to output both merged and unmerged variants as PASS and annotate the corresponding transcript(s) in an INFO field. This is probably something that we can do, but it may take awhile before I get around to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants