- Tests for
microbetagbasic modules- For this
READMEfile - Infer co-occurrence network with FlashWeave (
test_flashweave| IO) - Literature based node annotation with FAPROTAX (
test_faprotax.py| IO) - Genome based node annotations using
phenotrexpredictions (test_phenotrex.py| IO) - Gene prediction using Prodigal (
test_prodigal.py| IO) - KEGG ORTHOLOGY annotation (
test_kegg_annotation.py| IO) - Pathway complementarity extraction (
test_path_compl.py| IO) - Genome-scale metabolic reconstruction with ModelSEED (
test_modelseed.py| IO) - Genome-scale metabolic reconstruction with CarveMe (
test_carve.py| IO) - Seed complementarity extraction (
test_seed_compl.py| IO) - Network clustering using
manta(test_manta.py| IO) - Building a CX2
microbetag-annotated network (test_build_cx2.py| IO) - Complete
microbetagrun (test_microbetag| IO)
- For this
Under the tests folder, there is a series of test files for the different microbetag features/modules.
Undertest_data, you can find the input/output files of each test.
In this README file, we explain what every test is supposed to test, what kind of input it expects and what it returns.
Some tests have a pseudo-configuration file, part of the one used when running the whole pipeline.
Yet, not all tests have such a YAML file in their input data.
In some cases we just build a partial Config class on spot, i.e. on the test script.
Attention!
Configuration YAML files in these tests are not following the
microbetagversioning scheme; only partially. Changes from version to version will be fixed on those only if needed. Themicrobetagversion specific YAML files can be found under theconfig_filesfolder.
Also, unittest performs test functions alphabetically not in the order on the script.
Infer co-occurrence network with FlashWeave (test_flashweave | IO)
Runs FlashWeave with and without a metadata file.
Check input files of this test to make sure you give microbetag you data in the proper format.
Literature based node annotation with FAPROTAX (test_faprotax.py | IO)
Builds the merged functional_otu_table.tsv and the sub_tables folder with the .tsv files per trait found,
as FAPROTAX does.
Genome based node annotations using phenotrex predictions (test_phenotrex.py | IO)
Runs bothphenotrex programs used in microbetag:
- `genotype`
This builds a single file called `train.genotype`; a tab-separated file with a header line and a single line
for each genome/bin to be analyzed, with its name in the first column
and the Clusters of Orthologous Groups of proteins (COGs) assigned to it in the second one:
```
feature_type:eggNOG5-tax-2
bin_101.fa COG1432 COG0716 COG1381 COG2239 COG0010 COG0639 COG0738 COG1404 COG4430 COG2217
```
- `prediction`
This program returns a file for each of the [classes available](https://microbetag.readthedocs.io/en/v1.0.3/modules/phen-traits.html);
each file has a header line mentioning the trait under study; then it is a 3-column tab separated:
```
# Trait: ac
Identifier Trait present Confidence
bin_101.fa YES 0.6504
```
Gene prediction using Prodigal (test_prodigal.py | IO)
Runs the Prodigal gene predictor on a single bin (.fa) to build 3 files:
- .faa
- .ffn
- .gbk
KEGG ORTHOLOGY annotation (test_kegg_annotation.py | IO)
Takes as input a list of .faa files, as returned by the gene prediction step
and also, a very small part of the ko_list file of the kofam_database (ko_list_tests)
that looks like:
(microbetag) u0156635@gbw-l-l0074:kofam_database$ head ko_list_tests
knum threshold score_type profile_type F-measure nseq nseq_used alen mlen eff_nseq re/pos definition
K00001 363.33 domain all 0.312178 2773 2310 2080 519 14.06 0.590 alcohol dehydrogenase [EC:1.1.1.1]
K00002 438.50 domain all 0.418181 2762 2642 6996 484 7.63 0.590 alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
and the actual Hidden Markov Models (HMMs) of the kofam database, called profiles.
In
Pathway complementarity extraction (test_path_compl.py | IO)
Only input for this test is a ko_merged-like file, based on the 7 bins we've been using across all the tests,
namely: bin_101, bin_151, bin_19, bin_38, bin_41, bin_45, bin_48.
This file is the output of the merge_ko() you may check on the utils.py function,
and it looks like:
(microbetag) u0156635@gbw-l-l0074:input_files$ head ko_merged_7bins.txt
bin_id contig_id ko_term
bin_41 SCN18_26_2_15_R1_F_scaffold_115_57 K07586
Then, the test will run actually 3 tests, and finally it will build 2 files:
alternatives.json: all different ways to have a KEGG MODULEcomplementarities.json: potential complementarities between a beneficiary and a donor genome
For example:
{"bin_101":
{"bin_151": [
["md:M00019", ["K00826"], ["K01652", "K01653", "K00053", "K01687", "K00826"], "https://www.kegg.jp/kegg-bin/show_pathway?map00290/K01652%09%23EAD1DC/K01653%09%23EAD1DC/K00053%09%23EAD1DC/K01687%09%23EAD1DC/K00826%09%2300A898/"], ..
], .. # more complements from donor bin_151
}, .. # more donors
# more beneficiaries
}where bin_101 is the beneficiary, bin_151 the donor and the entry shown is the first of their list of potential complementarities:
md:M00019: the KEGG MODULE under study["K00826"]: list of KOs provided by the donor["K01652", "K01653", "K00053", "K01687", "K00826"]: the beneficiary's alternative being completedURL: URL to the KEGG MAP that the module under study is part of with colored nodes the terms coming from the beneficiary and the complements
Genome-scale metabolic reconstruction with ModelSEED (test_modelseed.py | IO)
In this test case, the microbetag.genres.TestGEMSReconstruction class takes as input a .fasta file, to build the following
files:
bin_6.gto: intermediate ModelSEED file standing for first step of the RAST annotation of the binbin_6.gto_2: like previous for the next step of the RAST annotationbin_6.faa: RAST annotated genomeGENRES/bin_6.xml: draft ModelSEED reconstruction (final outcome)
TestGEMSReconstructionneeds to have theseed_complementarityargument on the configuration file astrueto initiate necessary variables for the class. This result to an empty directory calledseeds_complementarityin the output directory.
Genome-scale metabolic reconstruction with CarveMe (test_carve.py | IO)
Similar to the previous test, we provide 2 .faa files this time, and build GENREs using CarveMe.
The test will result 2 .xml files in reconstructions/GENRES folder.
Like in the previous case, an empty folder called seeds_complementarity is part of the output directory because of the necessary
seed_complementarity argument in the configuration file.
Seed complementarity extraction (test_seed_compl.py | IO)
Here we give a set of GEMs as input files, and we build a pseudo Config-like class on the test.
Using this pseudo Config class, we build 3 instances to run 3 tests:
- Case 1: calculating seed and non-seed sets and extracting complements
- Case 2: using previously calculated seed, non-seed sets (from Case 1) to infer complements
- Case 3: calculating seed and non-seed sets only, keeping the BiGG namespace that our GEMs use
❗ By default, in case of GEMs using the BiGG namespace, microbetag attempts to map seed and non-seed compounds to ModelSEED identifiers
in order to retrieve the KEGG MODULES they participate in.
❗ You can see that the pseudo Config class sets among other fields the genre_reconstruction_with argument, equal to carveme.
This is not to reconstruct GENRES here, but to let microbetag know that the namespace to be used is BiGG.
The test script will then return a set of files:
-
SeedSetDic.json: A JSON file with models basename (supposed to be the same with the sequence identifier in a complete pipeline run) with the list of its seeds as value. Example:{"GPB:bin_000083": ["glucan4", "cpd00709", "lmn2", "2agpe181",..], }❗ You may notice that some compounds have the
cpdprefix, standing for ModelSEED compounds, while others not, i.e. being in the BiGG namespace. This is because our input models were built with CarveMe, thus using the BiGG namespace, and becausemicrobetagapplied a filter to map them to ModelSEED. You -
nonSeedSetDic.json: A JSON file exactly like theSeedSetDic.jsonbut with the non-seed sets of each model. -
confidenceDic.json: A JSON file where keys are again the sequence identifiers and values are also dictionaries, where a seed compound is the key and its value is the confidence level ($C$ ), ranging from 0 to 1. A confidence level of 0 would correspond to a non-seed node, while a 1 would correspond to a seed that cannot be activated by another node. For more you may check here. -
kegg_module_related_seeds.pckl: A pickle file with all the seeds of the models under study found related to at least one KEGG MODULE.>>> import pickle >>> with open("kegg_module_related_seeds.pckl","rb") as f: ... kegg_seeds = pickle.load(f) >>> kegg_seeds 0 GPB:bin_000083 [cpd00220, cpd00156, cpd01695, cpd00322, cpd00...
-
kegg_module_related_nonseeds.pckl: A similar pickle file but with the non-seed sets of the GEMs under study. -
seed_complements.pckl: A pickle file with all pairwise seed complementarities among the input GEMs.# After loading the pickle file as in the kegg_omdule_related_seeds.pckl case above >>> seed_compls GPB:bin_000083 bin_101 bin_101 [cpd00065, cpd00051, cpd01695... NaN GPB:bin_000083 NaN [cpd00724, cpd00134, ...]
-
phylomint_scores.tsv: A 4-column tab separated file (.tsv) with the$species_A$ ,$species_B$ , Competition and Cooperation scores.(microbetag) myuser@localhost:output_files$ head phylomint_scores.tsv GPB:bin_000083 bin_101 0.58 0.28 bin_101 GPB:bin_000083 0.5 0.22
Network clustering using manta (test_manta.py | IO)
We run two tests, one using an abundance table as input data and a second one using a network.
In both cases, we get two output files:
basenet.cyjs: The initial network, before applying any clustering, in CYJSON version (here a toy example).microbetaguses this as an intermediate file to fire clustering withmanta.manta_annotated.cyjs: This is themanta-clustered network. You can load it to Cytoscape directly.microbetagloads it when building the annotated network to assign clusters as node attributes.
Building a CX2 microbetag-annotated network (test_build_cx2.py | IO)
This test expects a complete annotation set from microbetag, i.e.
- faprotax
- phenotrex
- pathway and seed complementarities
- manta clusters
and parses them to build and save a CX2 network using the ndex2 library.
ATTENTION
This network will have all the annotations, but not the NCBI taxonomy level column required from
MGGto recognize sequence IDs that have at least a genome mapped.
Complete microbetag run (test_microbetag | IO)
This test run the total microbetag pipeline with a set of 7 custom genomes -- not on-the-fly!
This test takes hours when run completely from scratch, since it will try to predict genes over the genomes, perform KEGG, COG ANNOTATION on the genes, build GENRES and more.
To this end, this test is not supposed to be part of any GitHub workflow for continuous integration tests or so.
However, it is strongly suggested for contributors, to run the test before opening a PR.
For the users, it is a fair example of how to run microbetag with your own data.
Also, you can have a thorough overview of the total microbetag data products!