-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop codons in the final assembly #106
Comments
Hi Junchen, Yes, it's also my understanding that Exonerate can't detect incomplete introns at the beginning or end of target contigs, as it's searching for 5' and 3' splice sites (I'm not sure if the intron model is more sophisticated than that, and searches for other intron motifs as well). Can you upload the target file query sequence for Cheers, Chris |
Hi Chris, Thanks for the reply! Here are the files you asked for. Cheers, |
Hi Junchen, Thanks for those files. The With the first example you provided above ( I downloaded the corresponding region of the annotated genome (https://www.ncbi.nlm.nih.gov/nuccore/NC_052509.1?report=genbank&from=14296457&to=14358256&strand=true), and aligned it with the 'good' region of the Exonerate hit for As mentioned, HybPiper 2 runs Exonerate with the flag
I'll have to have a think about the best way to deal with this issue. An uncomplicated approach might be to add a few flags to I'll wait to receive your other Cheers, Chris |
Hi Chris, Thanks for all the investigations! As you can see, we are working with planthoppers (Hemiptera: Fulgoromorpha). Nilaparvata lugens is one of the references we used. In the beginning, we tried to extract single-copy BUSCO genes directly from the metagenomic assembly by running BUSCO with the database Therefore, we turned to Hybpiper. This is how we built our target file: 1) we downloaded the complete genomes of a few close refs of planthoppers and run BUSCO on them with the database '''hemiptera_odb10''' to extract single-copy genes. Here are some refs we used:
Then, 2) we combined them with It's very interesting to see that It's also interesting to see that the putative intron region was trimmed after removing I really want to hear more thoughts from you. We are working with phylogenomics. Although a few untrimmed introns in our final alignment may not affect the final phylogeny, they still create erroneous alignment for a few samples in these regions of many genes. It would be nice if we can find a solution to this so that we can have a clean alignment to work with at the end. Sincerely, |
Hi Junchen, I realised I made a mistake when testing the Exonerate So, I'm in the process of adding an additional filter to HybPiper that processes the Exonerate alignments and trims hit boundaries as necessary. Hopefully it'll deal with your issue. It should be ready later this week, but I'll let you know when I've pushed the update. Cheers, Chris |
Hi Chris, Thanks very much! Cheers, |
Hi Junchen, I've just pushed HybPiper version 2.1.2 and updated the conda packages - see the changelog here. It includes a new filter for Exonerate alignments. The filter doesn't aim to remove stop codons explicitly, but rather it should trim any poorly aligned 5' and 3' ends from an Exonerate alignment before the corresponding SPAdes contig hit sequence is potentially incorporated into an output In addition, HybPiper version 2.1.2 now runs a check for any 'internal' (i.e., non-terminal) stop codons in all final output I suspect this is something that'll need to be further developed for future versions of HybPiper, but hopefully the current fix deals with some of the issues you saw. Let me know how you go, and if you have any ideas about how this might be done better, please don't hesitate to let me know! Cheers, Chris |
Hi Chris, Thanks for the updates! I did some investigations on one sample and here are a few things that I am curious about:
Here is an example from the sample OECSP2:
The problematic contig is the following and was taken by the previous version as the final output: But the correct contig also exists and was taken by the new version after trimming the potential intron region:
Here is an example from the same sample: I am not sure how these internal incomplete introns could be detected. One way could be that the script will detect regions with the reduced alignment score and then decide if this region is an incomplete intron by its size or its sequences at two ends (to my knowledge, introns are often characterized by 'AG---------GT'). Thanks again for the quick updates. I am looking forward to hearing more thoughts from you! If you want to look at this sample (OECSP2), here is the contig file. Cheers, |
Hello,
Recently we have been using hybpiper to assemble thousands of BUSCO single-copy genes from metagenomic reads (2x150bp). The target file is amino acids, and we set the coverage cutoff for spades to 4 since the sequencing depth is low.
We notice that there are stop codons in many assembled markers, which is not ideal considering that we used amino acid targets. Then, we looked for some stop codons in the output file exonerate_results.fasta. Here are some examples:


It seems that many of them appear at the beginning or the end of an alignment, although some of them can follow right after the annotated intron. In the wiki (https://github.com/mossmatters/HybPiper/wiki/Introns), it said:
My understanding is that exonerate cannot detect incomplete introns at the beginning or the end of a contig. Am I correct? Do you think this could be the reason why we get stop codons in the final assembly?
If this is the issue, do you have any suggestions on how to remove these intron leftovers? Thanks!
We are going to use these markers for phylogenomics. But the region around the stop codon often does not align well with the rest of the sequences.
Sincerely,
Junchen
The text was updated successfully, but these errors were encountered: