Issue with merging MNVs within the same codon #8

looxon93 · 2022-10-30T14:01:10Z

Hi all,

I've used this merge_mnp.py script previously with great success, but now, I encountered an issue on certain positions, I found this:

grep "chr19:54574902"  codons.chr19.hg38.txt
chr19:54574902  NM_001290271-Codon639
chr19:54574902  NM_006866-Codon639
chr19:54574902  NM_001130917-Codon371
chr19:54574902  NM_001290270-Codon358

grep "chr19:54574903"  codons.chr19.hg38.txt
chr19:54574903  NM_001290271-Codon639
chr19:54574903  NM_006866-Codon639
chr19:54574903  NM_001130917-Codon372
chr19:54574903  NM_001290270-Codon358

In this part of code:

with open(codon_file, 'rt') as fp:
    for line in fp:
        line_contents = line.rstrip().split('\t')
        self.codon_dict[line_contents[0]] = line_contents[1]

This key will be overwritten, since one position contains multiple codons, and for transcript: NM_001130917 they shouldn't be merged, for all other they should be merged.
This is a subset of my VCF:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr19 54574871 . C T 101.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0.671;DP=25;ExcessHet=0;FS=5.566;MLEAC=1;MLEAF=0.5;MQ=55.81;MQRankSum=-3.886;QD=4.84;ReadPosRankSum=0.289;SOR=2.303 GT:AD:DP:GQ:PL:PS 0|1:16,5:21:99:109,0,483:54240560
chr19 54574902 . G C 331.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=22;ExcessHet=0;FS=4.2;MLEAC=1;MLEAF=0.5;MQ=55.33;MQRankSum=-3.277;QD=15.07;ReadPosRankSum=0.434;SOR=1.863 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:13,9:22:99:0|1:54574902_G_C:339,0,519:54240560
chr19 54574903 . G A 331.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=22;ExcessHet=0;FS=4.2;MLEAC=1;MLEAF=0.5;MQ=55.33;MQRankSum=-3.277;QD=15.07;ReadPosRankSum=0.535;SOR=1.863 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:13,9:22:99:0|1:54574902_G_C:339,0,519:54240560
chr19 54574907 . T C 331.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=22;ExcessHet=0;FS=4.2;MLEAC=1;MLEAF=0.5;MQ=55.33;MQRankSum=-3.277;QD=15.07;ReadPosRankSum=0.568;SOR=1.863 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:13,9:22:99:0|1:54574902_G_C:339,0,519:54240560

Thanks a lot for your help!
Luka

The text was updated successfully, but these errors were encountered:

looxon93 · 2022-10-30T14:02:19Z

Also, one thing I would add to the script is possibility to read compressed codons files:

import os
import gzip

class merge_by_codon:

    codon_dict = {}
    ignore_INDELs = False

    def __init__(self, codon_file, ignore_INDELs=0):
        self.ignore_INDELs = bool(int(ignore_INDELs))
        # load the codon file and create a dictionary of all loci and the codon they belong to
        # the codon file is a tab separated list of locus\tCodonID
        codons_name = os.path.basename(codon_file)
        if codons_name.endswith('.gz'):
            with gzip.open(codon_file,  'rt') as fp:
                for line in fp:
                    line_contents = line.rstrip().split('\t')
                    self.codon_dict[line_contents[0]] = line_contents[1]
        else:
            with open(codon_file) as fp:
                for line in fp:
                    line_contents = line.rstrip().split('\t')
                    self.codon_dict[line_contents[0]] = line_contents[1]

DonFreed · 2022-11-01T21:56:02Z

Hi Luka,

Glad this has been working well for you!

A difficulty with this request is that merge_by_codon is not setup to handle overlapping codons. What output would you expect to be produced at 54574902-54574903 for your example case?

If we decide to merge by NM_001130917, the variants at 54574902 and 54574903 would not be merged and the tool would output two separate records:

chr19 54574902 . G C 331.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=22;ExcessHet=0;FS=4.2;MLEAC=1;MLEAF=0.5;MQ=55.33;MQRankSum=-3.277;QD=15.07;ReadPosRankSum=0.434;SOR=1.863 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:13,9:22:99:0|1:54574902_G_C:339,0,519:54240560
chr19 54574903 . G A 331.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=22;ExcessHet=0;FS=4.2;MLEAC=1;MLEAF=0.5;MQ=55.33;MQRankSum=-3.277;QD=15.07;ReadPosRankSum=0.535;SOR=1.863 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:13,9:22:99:0|1:54574902_G_C:339,0,519:54240560

With any of the other transcripts, the tools outputs an MNP and annotates the original records with a MERGED FILTER field:

chr19 54574902 . GG CA 331.64 PASS ...
chr19 54574902 . G C 331.64 MERGED AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=22;ExcessHet=0;FS=4.2;MLEAC=1;MLEAF=0.5;MQ=55.33;MQRankSum=-3.277;QD=15.07;ReadPosRankSum=0.434;SOR=1.863 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:13,9:22:99:0|1:54574902_G_C:339,0,519:54240560
chr19 54574903 . G A 331.64 MERGED AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=22;ExcessHet=0;FS=4.2;MLEAC=1;MLEAF=0.5;MQ=55.33;MQRankSum=-3.277;QD=15.07;ReadPosRankSum=0.535;SOR=1.863 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:13,9:22:99:0|1:54574902_G_C:339,0,519:54240560

We could potentially output all three records with a FILTER or INFO field describing the codon used for the record. Would that be useful?

-Don

DonFreed · 2022-11-01T21:57:36Z

Happy to consider compressed codon files as well. I'll see if I can test your changes soon.

looxon93 · 2022-11-01T22:13:02Z

Hi @DonFreed , thanks for looking into this. I know that these scripts aren't main tools of Sentieon toolkit, but I found them very useful.

I am also investigating possibility to avoid loading whole codons file to a memory as the latest transcript file crate codons file around 250 GB. I am trying to subset it on the fly to mutations of interest, will let you know if I found something which can improve runtime. I tested it on 10 mutations only and it lasted 1.5h.

DonFreed · 2022-11-03T19:17:26Z

Great. Please let us know if you are able to find any runtime improvements.

Do you have any thoughts on the type of output you would like to see? The current merge_by_codon code cannot handle overlapping transcripts, and it's not clear how to represent both merged and unmerged variants in the same VCF.

looxon93 · 2022-11-04T16:23:00Z

Actually, only idea I have that in these cases we keep filter of merged mutations to: PASS instead of MERGED. That way, user who annotates these VCFs later will have both separate mutations and MNV with their effect. I don't have any other ideas tbh.

DonFreed · 2022-11-04T16:29:56Z

One solution would be to output both merged and unmerged variants as PASS and annotate the corresponding transcript(s) in an INFO field. This is probably something that we can do, but it may take awhile before I get around to it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with merging MNVs within the same codon #8

Issue with merging MNVs within the same codon #8

looxon93 commented Oct 30, 2022

looxon93 commented Oct 30, 2022 •

edited

Loading

DonFreed commented Nov 1, 2022

DonFreed commented Nov 1, 2022

looxon93 commented Nov 1, 2022

DonFreed commented Nov 3, 2022

looxon93 commented Nov 4, 2022

DonFreed commented Nov 4, 2022

Issue with merging MNVs within the same codon #8

Issue with merging MNVs within the same codon #8

Comments

looxon93 commented Oct 30, 2022

looxon93 commented Oct 30, 2022 • edited Loading

DonFreed commented Nov 1, 2022

DonFreed commented Nov 1, 2022

looxon93 commented Nov 1, 2022

DonFreed commented Nov 3, 2022

looxon93 commented Nov 4, 2022

DonFreed commented Nov 4, 2022

looxon93 commented Oct 30, 2022 •

edited

Loading