The LCA tool uses bitscore, identity percentage and the LCA script itself in the following order:
Step 1, determine the top percentage based on bitscore:
Of all the blast hits a sub-selection will be made of the top hits based on a percentage of the bitscore. If the top percentage is set to 8% the top will be calculated like 0.92*bitscore best hit=top threshold. In this example, all hits with a bitscore above 294.4 will continue to the next step.
Query | Subject | Identity percentage | Coverage | bitscore |
---|---|---|---|---|
Otu1 | hit1 | 93 | 95 | 320 |
Otu1 | hit2 | 93 | 95 | 320 |
Otu1 | hit3 | 91 | 94 | 310 |
Otu1 | hit4 | 90 | 95 | 301 |
Otu1 | hit5 | 89 | 94 | 300 |
Otu1 | hit6 | 85 | 95 | 290 |
Otu1 | hit7 | 84 | 95 | 290 |
Otu1 | hit8 | 81 | 90 | 281 |
Otu1 | hit9 | 80 | 89 | 275 |
Otu1 | hit10 | 79 | 88 | 274 |
Step 2, filter on threshold:
From the top hits there will be filtered on identity, coverage and bitscore. Let's choose a threshold of 90 identity, 90 coverage and 250 bitscore. All hits with scores above the thresholds will go to the next step.
Query | Subject | Identity percentage | Coverage | bitscore |
---|---|---|---|---|
Otu1 | hit1 | 93 | 95 | 320 |
Otu1 | hit2 | 93 | 95 | 320 |
Otu1 | hit3 | 91 | 94 | 310 |
Otu1 | hit4 | 90 | 95 | 301 |
Otu1 | hit5 | 89 | 94 | 300 |
Step 3, determine the lowest common ancestor:
Of all the remaining hits the lowest common ancestor is determined. The script starts at species level, it checks if all the hits are coming from the same species. If not it checks at genus level and so on.
Query | Subject | taxonomy |
---|---|---|
Otu1 | hit1 | Animalia / Arthropoda / Insecta / Diptera / Asilidae / Scarbroughia / Scarbroughia delicatula |
Otu1 | hit2 | Animalia / Arthropoda / Insecta / Diptera / Asilidae / Scarbroughia / Scarbroughia delicatula |
Otu1 | hit3 | Animalia / Arthropoda / Insecta / Diptera / Asilidae / Schildia / Schildia fragilis |
Otu1 | hit4 | Animalia / Arthropoda / Insecta / Diptera / Asilidae / Scytomedes / Scytomedes haemorrhoidalis |
With this example data and thresholds the identification of otu1 will be Asilidae. This is a basic explanation of the LCA approach, more advanced filter settings and output options are shown at the example commands.
There is an example input file included in the example folder, this file will be used to execute the commands. The file consist of 11 columns, has a specific header and is tab separated.
Column explanation input file:
Column name | Description |
---|---|
Query ID | Id of the input sequence (qseqid) |
Subject | Means Subject Title (stitle) |
Subject accession | Subject accession (sacc) |
Subject Taxonomy ID | Unique Subject Taxonomy ID (staxid), this value can be any value and is not used to find the lca |
Identity percentage | Identity percentage of hit () |
Coverage | Means Query Coverage Per Subject (qcovs) |
Evalue | Expected value of the hit (evalue) |
Bitscore | Bit score of hit (bitscore) |
Source | The source of where the taxonomy comes from, this value can be any value and is not used to find the lca |
Taxonomy | This column contains the taxonomy of the hit in the order Kingdom / phylum / class / order / family / genus /species |
The script itself has multiple parameter options.
Parameters:
Parameter | Description |
---|---|
-i | Input file |
-o | Output file |
-b | Bitscore top percentage threshold |
-id | Minimum identity threshold |
-cov | Minimum coverage threshold |
-t | Check the top hit first or perform an lca analysis on all hits. Options:['only_lca', 'best_hit', "best_hits_range"] |
-tid | Identity threshold for the tophit, only used when -t best_hit or best_hits_range |
-tcov | Coverage threshold for the tophit, only used when -t best_hit or best_hits_range |
-fh | Filter hits, filter out lines that contain a certain string |
-flh | Filter lca hits, during the determination of the lca ignore this string |
-minbit | Minimum bitscore threshold |
Example 1:
This command performs an lca analysis on all the hits per input sequence. The top percentage threshold is 8%, minimal idenity and coverage is set to 80%. If you perform an lca analysis on all the hits the identification will never be on species level.
python lca.py -i example/example.tabular -o output1_example.tabular -b 8 -id 80 -cov 80 -t only_lca
Example 1 output explanation:
Otu1 has no identification and there is no lca analysis performed. You can see this in the last column. The hits of this sequence did not passed the thresholds.
#query | #lca rank | #lca taxon | #kingdom | #phylum | #class | #order | #family | #genus | #species | #method |
---|---|---|---|---|---|---|---|---|---|---|
Otu1 | no identification | no identification | no identification | no identification | no identification | no identification | no identification | no identification | no identification | no lca |
For Otu3 and Otu 6 an lca analysis is performed. The identification of otu 3 is at genus level and is identified as Chaetopterus.
#query | #lca rank | #lca taxon | #kingdom | #phylum | #class | #order | #family | #genus | #species | #method |
---|---|---|---|---|---|---|---|---|---|---|
Otu3 | genus | Chaetopterus | Eukaryota | Annelida | Polychaeta | Spionida | Chaetopteridae | Chaetopterus | no identification | lca |
Otu6 | class | Polychaeta | Eukaryota | Annelida | Polychaeta | no identification | no identification | no identification | no identification | lca |
Otu16 is identified as "unknown genus". This is because that is how the information is stored in the reference database. In this case the hit is coming from BOLD http://www.boldsystems.org/index.php/Public_RecordView?processid=POLNB1246-14
#query | #lca rank | #lca taxon | #kingdom | #phylum | #class | #order | #family | #genus | #species | #method |
---|---|---|---|---|---|---|---|---|---|---|
Otu16 | genus | unknown genus | Animalia | Annelida | Polychaeta | Polychaeta incertae sedis | unknown family | unknown genus | no identification | lca |
Example 2 taxonomy filter:
This command performs an lca analysis on all the hits per sequence like example 1, but now during the lca analysis if the text "unknown" exist in a rank it will be ignored. In many databases like Genbank or BOLD not all taxonomy is filled in by the uploader.
python lca.py -i example/example.tabular -o output2_example.tabular -b 8 -id 80 -cov 80 -t only_lca -flh unknown
Example 2 taxonomy filter output explanation:
Otu6 is now now identified as Thelepus at genus level. If you look at the example.tabular file you can see that this makes sense, in example 1 the rank was pushed up to class level because of the "unknowns" in the first hit.
#query | #lca rank | #lca taxon | #kingdom | #phylum | #class | #order | #family | #genus | #species | #method |
---|---|---|---|---|---|---|---|---|---|---|
Otu6 | genus | Thelepus | Eukaryota | Annelida | Polychaeta | Terebellida | Terebellidae | Thelepus | no identification | lca |
Otu16 | order | Polychaeta incertae sedis | Animalia | Annelida | Polychaeta | Polychaeta incertae sedis | no identification | no identification | no identification | lca |
Example 3 best hit species level identification:
This command performs an lca analysis if the top hit falls below the "top hit" thresholds. If the top blast hit (sorted on evalue) exceeds the thresholds the top hit is chosen and the input sequence gets a species level identification. Notice the change in parameters -t best_hit -tid 98 -tcov 100
. If the top hit has an identity above 98% and a coverage above or equal to 100% this hit will be the taxonomic identification and will be at species level.
python lca.py -i example/example.tabular -o output3_example.tabular -b 8 -id 80 -cov 80 -t best_hit -tid 98 -tcov 100 -flh unknown
Example 3 best hit species level identification output explanation:
Now we have an otu that is identified on species level. Also the last column of Otu9 is different than in the previous examples. The value "best hit" means that there is no lca analysis performed, but the species of the top hit is chosen for identification. If you look at the example.tabular file you can see that the top hit of otu9 has an identity above 98 and coverage equal to 100.
#query | #lca rank | #lca taxon | #kingdom | #phylum | #class | #order | #family | #genus | #species | #method |
---|---|---|---|---|---|---|---|---|---|---|
Otu6 | genus | Thelepus | Eukaryota | Annelida | Polychaeta | Terebellida | Terebellidae | Thelepus | no identification | lca |
Otu9 | species | Myxine glutinosa | Eukaryota | Chordata | unknown class | Myxiniformes | Myxinidae | Myxine | Myxine glutinosa | best hit |
Example 4 best hit range:
In some cases the sequences in the reference database are wrongly morphological identified or contaminated by human or bacterial DNA. It can also happen that a certain marker (16S, CO1, ITS) is not distinctive enough. Many people choose the top hit as the identification for the input sequence without looking at the second hit while that it can even be a better choice. In these cases the following command can help to solve this. Notice the changed parameter -t best_hits_range
python lca.py -i example/example.tabular -o output4_example.tabular -b 8 -id 80 -cov 80 -t best_hits_range -tid 98 -tcov 100 -flh unknown
Example 4 best hit range output explanation:
After executing the command otu6 stays the same. Otu6 does not have hits above -tid 98 -tcov 100
so the parameter -t best_hits_range
has no effect here. If we look at otu9 you see that it occurs twice in the output with two extra columns. Otu9 had multiple hits above -tid 98 -tcov 100
on two different species. In the blast output of otu9 there were 6 hits on the species Myxine glutinosa with an identity between 98.7 and 100%.
#query | #lca rank | #lca taxon | #kingdom | #phylum | #class | #order | #family | #genus | #species | #method | #identity | #coverage |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Otu6 | genus | Thelepus | Eukaryota | Annelida | Polychaeta | Terebellida | Terebellidae | Thelepus | no identification | lca | ||
Otu9 | species | Myxine glutinosa | Eukaryota | Chordata | unknown class | Myxiniformes | Myxinidae | Myxine | Myxine glutinosa | top hit (6) | 98.7-100.0 | 100.0-100.0 |
Otu9 | species | Myxine limosa | Eukaryota | Chordata | unknown class | Myxiniformes | Myxinidae | Myxine | Myxine limosa | top hit (4) | 98.1-98.7 | 100.0-100.0 |