Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taxonomic assignment discrepancy #81

Open
ashwinssudarshan opened this issue May 28, 2021 · 1 comment
Open

Taxonomic assignment discrepancy #81

ashwinssudarshan opened this issue May 28, 2021 · 1 comment

Comments

@ashwinssudarshan
Copy link

Hello! I was using this package to generate OTU assignments and to understand the communities for a set of metagenomic raw reads that I have. While converting the table to a wide format suitable for phyloseq objects, I noticed that sequences within the same OTU were having different levels of taxonomic assignment (Some upto order and other upto genus level for example).

This conversion to a wide format was done through a custom R script and cause some problems especially while building the taxonomy table since there were these varying taxonomic assignments for the same OTU. This issue didn't occur when I used the function in the package to get the wide format OTU table.

My questions are:-

  1. Why is this discrepancy occurring in the first place?

  2. As far as resolving through the functions in the package, does the code only retain the taxonomic classification till the point where there is agreement between all the assignments? Is that how this issue is being resolved through the package.

Thanks in Advanced!

@wwood
Copy link
Owner

wwood commented May 28, 2021

Hello.

I think you have guessed right - the taxonomic assignment of each OTU is based on a summary of the taxonomic assignment of all reads that go into that OTU. So if an OTU has 2 sequences where one is from the phylum actinobacteria and the other is from acidobacteria, then the OTU gets assigned to bacteria only. But then if you ran the next sample and there was only 1 sequence in the OTU, assigned to actinobacteria, then the OTU from the second sample would be assigned Bacteria; Actinobacteria.

There is some pesky information loss when you are summarising a pre-existing OTU table, since the taxonomic classification of the reads that went into each OTU are not known (not in the OTU table), so it may make a decision different than if the full information was available.

In the bigger picture there are 2 things here:

  • The SingleM taxonomy is out of date - you can get a package set based on GTDB R95 as discussed at A version update to align with the latest GTDB taxonomy? #40 (comment) - this package set will be properly released in the next proper release.
  • taxonomic assignment is based on the protein sequence, when the nucleotide sequence might provide better resolution, so in the future we will explore that e.g. by believing close OTU matches to GTDB genomes' sequences.

HTH,
ben

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants