In this practical, we are going to analyze some HTS data sets (MiSeq) for soil fungi in samples taken at different locations along a mountain slope. The following are the steps we take:
- Create a directory where to store your data, e.g.
mkdir w1p1
- Go into the directory and download the data, e.g.
wget https://github.com/naturalis/mebioda/blob/master/doc/week1/w1p1/C1P1.fastq.gz?raw=true
- Do this for all the *.fastq.gz files
- Extract the data, e.g.
gunzip *.gz
#!/bin/bash
url="https://github.com/naturalis/mebioda/blob/master/doc/week1/w1p1/"
CNUMS="1 3 5 7"
for C in $CNUMS; do
PNUMS="1 2 3"
for P in $PNUMS; do
curl -Lo "C${C}P${P}.fastq.gz" "${url}C${C}P${P}.fastq.gz?raw=true"
done
done
Our sequencing reads may contain base calls with low support. These typically occur especially on the ends of the reads. We are going to remove them. In addition, there still are primer sequences on both ends (i.e. 3' and 5'), which have to be removed. For all of these, we use a tool called cutadapt.
- Install cutadapt:
pip install --user --upgrade cutadapt
(ifpip
is not found:sudo apt install python-pip
) - Trim the bad quality ends:
mkdir out_trimmed
for fq in *.fastq; do
cutadapt -q 20,20 -o out_trimmed/"${fq%.fastq}_trimmed_ends.fastq" ${fq}
done
- Trim the 3' primer:
for fq in out_trimmed/*.fastq; do
cutadapt -a TCCTCCGCTTATTGATAGC -o "${fq/_ends/_primers}" ${fq}
done
- Trim the 5' primer
for fq in out_trimmed/*_trimmed_primers.fastq; do
cutadapt -g GTGARTCATCGAATCTTTG -o "${fq/_primers/_primers2}" ${fq}
done
We are going to use the tool vsearch
, first described in a paper by
Rognes et al., 2016.. Part of the
analysis will be a continuation of data pre-processing steps, but in ways that
cutadapt
can't do. There is additional quality filtering, and deduplication of
the reads. Then, the read IDs are given additional labels to identify the
different soil samples. This is because we are going to merge the data and
cluster the sequences across all samples - but we still want to be able to
recognize where the sequences came from originally, and that information would
otherwise be lost.
Subsequently, the sequences are going to be clustered: the ones that are at least 97% similar are all going to end up in the same putative taxon (OTU). Once the clustering is done, we can also figure out which of the sequences are weird singletons. Those are probably artefacts of the sequencing process (chimeras), which we will remove. We then compare the remaining, plausible, OTUs with a reference database to see if we can recognize and classify the OTUs. This step will produce the biologically relevant information that we need (though we will process the output a bit more so that we can easily interpret it in a spreadsheet program).
- Install
vsearch
from its source code. The steps to take on the command line are here - Do the additional quality filtering:
for fq in out_trimmed/*_trimmed_primers2.fastq; do
parameters="--fastq_qmax 46 --fastq_maxee_rate 1 --fastq_trunclen 200"
vsearch $parameters --fastq_filter ${fq} --fastaout "${fq%.fastq}.fa"
done
The parameters mean the following:
--fastq_qmax
: specify the maximum quality score accepted when reading FASTQ files.--fastq_maxee_rate 1
: we specify the expected error per base: 1--fastq_trunclen 200
, all the sequences will be truncated at 200 bp;
- Deduplicate the reads at sample level and relabel with the sample number:
cd out_trimmed
for fq in *_trimmed_primers2.fa; do
vsearch --derep_fulllength $fq --output ${fq/trimmed_primers2/uniques} --relabel ${fq/trimmed_primers2/seq} --sizeout --minsize 2
done
The parameters mean the following:
--derep_fulllength
: Merge strictly identical sequences contained in filename.--sizeout
: the number of occurrences (i.e. abundance) of each sequence is indicated at the end of their fasta header.--relabel
: sequence renaming is done to add sample prefixes to the sequence identifiers, so that we can later merge the sequences
- Merge the data across all samples:
rm *_ends.fastq *_primers.fastq *_primers2.fastq *_primers2.fa
cat *uniques.fa* > CPuniques.fasta
- Cluster the reads at 97% similarity
vsearch \
--cluster_size CPuniques.fasta \
--id 0.97 \
--strand plus \
--sizein \
--sizeout \
--fasta_width 0 \
--uc CP_otus.uc \
--relabel OTU_ \
--centroids CP.otus.fasta \
--otutabout CP.otutab.txt
The parameters mean the following:
--cluster_size
: sorts sequences by decreasing abundance before clustering.--id
: reject the sequence match if the pairwise identity is lower than 0.97--fasta_width
: Fasta files produced by vsearch are wrapped (sequences are written on lines of integer nucleotides, 80 by default). To eliminate the wrapping, the value is set to zero.--centroids
: The centroid is the sequence that seeded the cluster (i.e. the first sequence of the cluster).--otutabout
: Output an OTU table in the classic tab-separated plain text format as a matrix containing the abundances of the OTUs in the different samples.
- Detect likely chimeras de novo
vsearch \
--uchime_denovo CP.otus.fasta \
--sizein \
--sizeout \
--fasta_width 0 \
--nonchimeras CP.otus.nonchim.fasta
--uchime_denovo
: Detect chimeras present in the fasta-formatted filename, without external references (i.e. de novo)
- Compare clusters with reference database
vsearch \
--usearch_global CP.otus.nonchim.fasta \
-db ../utax_reference_dataset_10.10.2017.fasta \
-id 0.7 \
-blast6out CPotus.m8 \
-strand both \
-maxaccepts 1 \
-maxrejects 256
--usearch_global
: Compare target sequences (--db) to the fasta-formatted query sequences contained in filename, using global pairwise alignment.
We now have the all the output files that we need. However, the CPotus.m8
needs to be
processed in the following way:
- The fourth column contains the sequence length. We only want sequences longer than 150 bp
- The first column contains both the OTU id and the cluster size, like this:
OTU_1;size=9159;
. We just want to keep theOTU_1
part, so before the first semicolon. - For each distinct OTU id, we want to keep the match with the highest percentage (there might be multiple matches for the same OTU).
This script performs these operations. It is executed like this:
perl ../filter.pl CPotus.m8 > CPotus.uniq.tsv
We now have two tables that each have OTU identifiers in the first column. One table contains the taxonomic identification of these OTUs (i.e. their nearest matches in the Linnean taxonomy), the other table contains the abundances of these OTUs in the different samples. We want to combine these by matching up the OTU identifiers. This is a database operation called a join. The Linux environment contains a standard tool that allows you to do this with tab-separated text files, producing output that consists of the joined records. Run it like this:
join CPotus.uniq.tsv CP.otutab.txt > CPotus.samples.tsv
We now need to get the file CPotus.samples.tsv
on our own computers. Depending on
the platform (Windows, Mac, Linux) there are different ways to do that. On
Windows you would use putty. On Mac and Linux you use the
standard scp
command. Once you have the file, import it in a spreadsheet program
such as Excel, as a tab-separated file. In the spreadsheet, we are going to summarize
our results:
- After the OTUs' abundance values you can calculate the sum, e.g
=SUM(a1:g1)
- Next to the sum we are going to write an IF condition that will return the value 1 if the
abundance value is greater than 0, otherwise it will return 0.
=IF(e1>0,1,0)
- Finally, we want the sum of each absence-presence column, so at the end of each column calculate the sum.