Skip to content

MetaFX tutorial

ivartb edited this page May 19, 2023 · 7 revisions

Here is presented a MetaFX tutorial based on simulated bacterial metagenomes. It provides a step-by-step analysis of metagenomics data from raw reads to samples categories prediction.

Table of contents

Input data

System requirements

It is highly recommended to run this tutorial on computational cluster due to high resource requirements.

Disk storage: 30Gb of free disk space is required. ~12Gb for input data and ~16Gb for intermediate computational steps.

RAM: 8Gb should be enough.

Threads: with 6 threads this tutorial took 1 hour to run. More threads will speed up the computations, especially k-mers extraction step.

Data

We will use in-silico metagenomic data generated with InSilicoSeq. For comparative analysis at least two groups (categories) of metagenomes are required – A and B. In each group there are two types of samples – train and test. Train data will be used for feature extraction and classification model training. Test data is required to validate extracted features and check the accuracy of predictive model.

There are 40 samples in total. Each sample contains 10 random bacteria from a predefined pool of 20 bacteria. Additionally, all samples contain E.coli strain, different between A and B groups. These strains are quite similar, with mash distance D=0.022 (i.e. Average Nucleotide Identity ANI≈0.978). Dataset overview is presented in the table below.

group A group B
train test train test
10 samples 10 samples 10 samples 10 samples
10 random bacteria from 20
E.Coli str. K-12 E.coli O157:H7 str. Sakai

Raw data can be download by that link. Alternatively, you can run the following command:

wget data.zip

Further we need to extract data and delete the archive:

unzip data.zip
rm data.zip

data directory contains several files and folders:

  • data/reads/ contains 80 files with paired-end reads for 40 samples.
  • data/species_distribution.tsv contains the relative abundance of bacterial species in samples.
  • data/train_dataset.txt contains path to train samples and their categories.
  • data/labels.txt contains information about all samples and their categories.

We can examine bacterial content with:

cat data/species_distribution.tsv
tutorial_bacteria_distr

Feature extraction from train data

K-mers extraction

Unique features extraction

Train data analysis

Data visualisation

Prediction model training

Test data processing

K-mers extraction for test data

Test features calculation

Test data analysis

Test labels prediction

All data visualisation