pandaseq.1

.\" Authors: Andre Masella
.TH pandaseq 1 "June 2011" "2.0" "USER COMMANDS"
.SH NAME 
pandaseq \- PAired-eND Assembler for DNA sequences
.SH SYNOPSIS
.B pandaseq
.B \-f
.I forward.fastq
.B \-r 
.I reverse.fastq
[
.B \-6 
] [
.B \-A
.I algorithm
] [
.B \-a 
] [
.B \-B 
] [
.B \-C
.I module
] [
.B \-d
.I flags
] [
.B \-D
.I penalty
] [
.B \-F 
] [
.B \-g
.I log.txt
] [
.B \-G
.I log.txt.bz2
] [
.B \-i
.I index.fastq
] [
.B \-k
.I kmers
] [ 
.B \-l
.I minlen
] [
.B \-L
.I maxlen
] [
.B \-N 
] [
.B \-o 
.I minoverlap
] [
.B \-O 
.I maxoverlap
] [
.B \-p
.I forwardprimer
] [
.B \-q
.I reverseprimer 
] [
.B \-t
.I threshold
] [
.B \-T
.I threads
] [
{
.B \-u
|
.B \-U
}
.I unpaired.txt
] [
.B \-w
.I output.fasta
] [
.B \-W
.I output.fasta.bz2
]
.SH DESCRIPTION
PANDASEQ assembles paired-end Illumina reads into sequences, trying to correct for errors and uncalled bases. The assembler reads two files in FASTQ format with quality information. If amplification primers were used (e.g., to isolate a variable region of the 16S gene, or the constant regions around zinc finger binding residues), they can be removed from the sequence during assembly. The final sequence will correct any uncalled bases in the overlapping region using the complementary strand. When mismatches occur in the overlapping region, the base with the better quality score is chosen.

The algorithm is as follows:
.IP 1.
Find the positions where the forward and reverse primers match best above the threshold and discard the ends of the sequence, including the primer.
.IP 2.
Pick and overlap to maximise the probability of the forward and reverse reads having come from a single piece of DNA.
.IP 3.
Identify the masking of the end of the read with the quality score \fBB\fR or \fB#\fR as done by CASAVA and adjust the probabilities in this region.
.IP 4.
Construct an assembled sequence between the primers and calculate the quality.
.IP 5.
Check for various constraints, including quality, length, uncalled bases, and user-supplied modules.
.SH OPTIONS
.TP
\-6
Input files will have their quality encoded as PHRED + 64 instead of the PHRED + 33. PHRED + 64 was used originally in the CASAVA pipeline from version 1.3 through 1.7. In CASAVA 1.8, the score is encoded as PHRED + 33, the default.
.TP
\-A algorithm
Set the algorithm used for assembly. Currently there are:
.RS
.IP simple_bayesian[:\fIerror_estimation\fR]
Uses the formula described in the original paper (Masella 2012), optionally with an error estimation (\[*e]) provided.
.IP ea_util
Uses the formula by FastqJoin in the ea-utils package (Aronesty 2013). No parameters are tunable.
.IP flash
Uses the formula by FLASH paper (Magoc 2011). No parameters are tunable.
.IP pear[:\fIrandom_base\fR]
Uses the formula described in the PEAR paper (Zhang 2013), optionally with the probability of a random base (q) provided.
.IP rdp_mle
Uses the formula by RDP paper (Cole 2013). No parameters are tunable.
.IP stitch
Uses the formula in the Sitch software, developed by Austin Richardson. No parameters are tunable.
.IP uparse[:\fIerror_estimation\fR]
Uses the scoring algorithm from UPARSE/USEARCH (Edgar 2013), optionally with an error estimation (\[*e]) provided.
.RE
.TP
\-a
Strip the primers after assembly, rather than before. Stripping the primers first saves time, but if the overlap region is very large compared to the read, the read may have sequence from the other primer (i.e., the forward read ends with reverse primer, and/or the reverse read ends with forward primer). If the primers are stripped first, the reads will fail to assemble. This option attempts assembly first, then tries to strip the primers, so the heavily overlapping case will assemble. You should only need this if the region of interest is smaller than the whole read. It is undesirable, unless necessary, as it slows assembly down.
.TP
\-B
Allow input sequences to lack a barcode/tag. Normally, Illumina sequences have barcodes attached to the sequence. This allows the barcode to be missing. The tool
.BR panda-checkid (1)
will determine if an Illumina identifier is undestood by PANDAseq and if the tag is included.
.TP
\-C module
Loads an optional validation module to verify sequences are valid before emitting them. See below for more information. You may repeat this option to use multiple validation modules.
.TP
\-d flags
Set debugging/output flags to provide more details about what PANDAseq is doing. To enable a flag, capitalise it; to disable, uncapitalise it. Provide information about the \fBb\fRuilding of a sequence. Show excruciating detail about \fBr\fReconstruction. Show some optional \fBs\fRtatistics. Show information about building the \fBk\fR-mer table. Provide errors about the \fBf\fRile parsing. Show every \fBm\fRismatch. The default is \fBBrSkFm\fR.
.TP
\-D penalty
Sometimes, with repetitive sequence, the primer aligns further down the sequence. To avoid this, a primer penalty can be applied. For each base further down the sequence, \fIpenalty\fR is subtracted from the proability that the primer aligns to this location. By default, the value is 0, and if used, the value should be rather small; 0.01 seesm to be sufficient in most cases.
.TP
\-f forward.fastq
The location of the forward reads in FASTQ format. The file may be plain FASTQ, or compressed with
.BR gzip (1)
or
.BR bzip2 (1).
File compression is automatically detected.
.TP
\-F
Normally, output will be as a FASTA even though per-base quality information is available. To retain this quality information, this option will output the sequence and the quality information in FASTQ format with quality scores encoded as PHRED + 33 (even if the input scores are PHRED + 64). The meaning of the quality score is conceptually different from the input quality scores for the overlap region, but this may not matter depending on your downstream application. If you intend to use this information for further quality filtering, especially by a program expecting Illumina reads, you are not using this data correctly.
.TP
\-g log.txt
Log all output to a plain text file, \fIlog.txt\fR, instead of standard error.
.TP
\-G log.txt.bz2
Log all output to a
.BR bzip2 (1)
compressed text file, \fIlog.txt.bz2\fR, instead of standard error.
.TP
\-i index.fastq
If the index/barcode reads are in a separate FASTQ file, read them and apply them to the input reads.
.TP
\-j
This option is ignored. It used to indicate that input files specified by
.B -f
and 
.B -r
are compressed by
.BR bzip2 (1).
This is automatically detected now.
.TP
\-k kmers
Sets the number of sequence locations for a particular \fIk\fR-mer. When attempting to align the sequences, the assembler will store the location of every \fIk\fR-mer in a table. If the same \fIk\fR-mer is present multiple times, only the first ones will be stored until the table is full; when this occurs, an \fBFML\fR error is emitted. If the sequences are highly repetitive, lost positions can prevent good alignments; this can be alleviated by increasing this amount. This should be small (no more than 10; the default is 2), or the \fIk\fR-mer table will be extremely large, using a large amount of RAM per thread. Try increasing the value until \fBFML\fR errors go away.
.TP
\-l minlen
Sets the minimum length for a sequence, after primers are removed. By default, all sequences are kept. With this option, sequences shorter than desired can be discarded.
.TP
\-L maxlen 
Sets maximum length for a sequence, after primers are removed.  By default, all sequences are kept. With this option, sequences longer than desired can be discarded.
.TP
\-N
Eliminate all sequences with uncalled nucleotides in the output. Otherwise, during assembly, uncalled bases\ (Ns) from unpaired regions may be emitted.
.TP
\-o minoverlap
Sets the minimum overlap between forward and reverse reads. By default, this is at least one nucleotide of overlap. Raising this number does not generally increase the quality of the output as alignments with small overlaps tend to score poorly and are discarded anyway.
.TP
\-O maxoverlap
Sets the maximum overlap between forward and reverse reads. By default, this is the read length. In highly overlapping sequences (i.e., those where the end of one read precede the start of the other), this parameter should be set to the sum of the input reads, or a value larger than that.
.TP
\-p forwardprimer
Strip out primers from the start of the sequence. If the data contains a forward primer (e.g., a conserved region to amplify a 16S variable region), specifying it here will cause the primer to be located in the read and the primer, and any sequence before it, will be discarded. It is also possible to specify a number and the same number of leading bases will be stripped from the sequence. It may be useful to user a number if the sequence has many uncalled bases in the primer region, preventing a nucleotide primer from matching.
.TP
\-q reverseprimer
Strip out primers from the end of the sequence. The primer is specified as it appears in the reverse read (i.e., it is a reverse complement of what it would be in the alignment).
.TP
\-r reverse.fastq
FASTQ file containing the reverse reads. See
.B -f
for more information.
.TP
\-t threshold
The score, between 0 and 1, that a sequence must meet to be kept in the output. Any alignments lower than this will be discarded as low quality. Increasing this number will not necessarily prevent uncalled bases\ (Ns) from appearing in the final sequence.
It is also used as the threshold to match primers, if primers are supplied. The default value is 0.6.
.TP
\-T threads
The number of threads to spawn. This will only be available if PANDAseq was compiled with 
.BR pthreads (7).
In most cases, PANDAseq is IO-bound, not CPU-bound; therefore, adding more CPU capacity would have no effect. Try monitoring a running copy of PANDAseq with 
.BR top (1);
watch the CPU% for the PANDAseq process and the overall system CPU waiting time (\fI%wa\fR in the banner at the top). If waiting time is low and CPU% is very high, then multi-threading may increase speed. If the CPU waiting time is high, threading will simply not help.

Note that using multiple threads prevents sequences from being output in the same order as the original file. If you a filtering reads downstream, consider using the \fBfilter\fR validation module as matching them up may be difficult.
.TP
\-[U|u] unpaired.txt
Write sequences for which the optimal alignment cannot be computed to a file as concatenated pairs. For downstream processing or to stare at wistfully. If \fB-U\fR is used, the quality scores will be included.
.TP
\-w output.fasta
Write all assembled sequences to a FASTA (or FASTQ) file, \fIoutput.fasta\fR, instead of standard output.
.TP
\-W output.fasta.bz2
Write all assembled sequences to a
.BR bzip2 (1)
compressed FASTA (or FASTQ) file, \fIoutput.fasta\fR, instead of standard output.
.SH OUTPUT STATISTICS
At the end of reconstruction, several statistics are output on lines beginning with \fBSTAT\fR.
.TP
READS
The number of reads in the input files.
.TP
NOALGN
The number of sequences where there exists no overlap with a probability above the threshold.
.TP
BADR
The number of sequences where the reads are unsatisfactory (too short to assemble).
.TP
SLOW
The number of sequences where the fast hashing algorithm could not figure out the optimal overlap, and so every possible overlap had to be considered. Nothing is necessarily wrong with these sequences; they just take longer to assemble. Very repetitive patterns can cause PANDAseq to spend more time investigating overlaps that are likely wrong, resulting the processing time of the file to be quite long if there are many sequences in this category. If they are a significant percentage of the input data, try increasing the size of the \fIk\fR-mer table, using the \fB-k\fR option; this will cause PANDAseq to use more memory, but it may be faster.
.TP
NOFP
The number of sequences where the forward primer could not be aligned. This is only done when \fB-p\fR is supplied and a nucleotide sequence.
.TP
NORP
The number of sequences where the reverse primer could not be aligned. This is only done when \fB-q\fR is supplied and a nucleotide sequence.
.TP
LOWQ
The number of sequences where the quality score of the reconstruction is below the threshold. This says nothing about the quality scores of the individual bases in the forward and reverse reads.
.TP
DEGENERATE
The number of sequences containing uncalled/degenerate/N bases in the final reconstruction (it is immaterial if there are uncalled bases in the reads). This is only done when \fB-N\fR is provided.
.TP
SHORT
The number of sequences where the final reconstructed sequence is too short. This is only done when \fB-l\fR is provided.
.TP
LONG
The number of sequences where the final reconstructed sequence is too long. This is only done when \fB-L\fR is provided.
.TP
OK
The number of sequences output.
.TP
OVERLAPS
The number of sequences assembled for each possible overlapping length. The first number is the number of sequences with only one overlapping base, the second with two overlapping bases, and so on.
.SH LOGGING MESSAGES
During output, the assembler may output any of the following errors.
.TP
ERR BADID
The name of the input read did not follow the known Illumina standard formats. Older versions of CASAVA produce sequences with IDs that look like \fBHWUSI-EAS1661_9323_FC619KG:7:1:1190:15190#ATCACG/1\fR, where the fields are \fIinstrument:lane:tile:x:y#tag/direction\fR. Newer version of CASAVA produce IDs that look like \fBHWI-ST822:85:C05C3ACXX:1:1101:1171:2104 3:N:0:TAGACA\fR, where the fields are \fIinstrument:run:flowcell:lane:tile:x:y direction:filtered:flags:tag\fR. If your sequence headers do not look like either of these, either Illumina has created yet-another header format or, more likely, your sequence headers have been manipulated by some upstream processing, possibly at your sequencing centre. PANDAseq needs the original Illumina probabilities; not ones manipulated by other programs. We're very picky about that. Sometimes, for mysterious reasons, the sequences lack the barcoding tag. The \fB-B\fR option will cause the lack of barcode to be ignored. This will obviously invalidate the use of validation modules that depend on the barcode.
.TP
ERR BADNT
An invalid letter was found in a nucleotide read. Likely caused by incorrect or corrupt input files.
.TP
ERR BADSEQ
The an unexpected character or end of the input file was detected. Likely caused by incorrect or corrupt input files.
.TP
ERR EOF
The end of the input file was detected before it was expected. Likely caused by incorrect or corrupt input files.
.TP
ERR KLNG
The \fIk\fR-mer table is too small to hold a read of the size requested. This is a bug or platform-dependent behaviour. Please file a ticket either way.
.TP
ERR LOWQ
The sequence is discarded because the quality is too low given the supplied threshold.
.TP
ERR NEGS
The reconstruction parameters do not produce a valid sequence. Instead, they produce a negative-length sequence. This read pair is discarded.
.TP
ERR NODATA
A FASTQ record has no sequence data. Likely caused by incorrect or corrupt input files.
.TP
ERR NOFILE
The input file was not found or could not be read.
.TP
ERR NOFP
The forward primer could not be matched to the forward read. Either the primer is incorrect or the read is low quality or the sequence provided is not the correct original molecule.
.TP
ERR NOQUAL
Quality information is missing from the FASTQ file. This data is required to reconstruct the sequence.
.TP
ERR NORP
The reverse primer could not be matched to the reverse read. See \fBNOFP\fR.
.TP
ERR NOTPAIRED
Sequences from FASTQ files are not pairing correctly given their sequence names. Likely, the files are mismatched.
.TP
ERR OOM
An out of memory condition has occurred. Given the memory available, assembly of this sequence is not possible. As Illumina sequencing gets longer, the amount of memory needed can be adjusted. Please file a ticket.
.TP
ERR READLEN
The read length is too long for this version of PANDAseq. PANDAseq needs to be recompiled with a longer allowable seqeuence length; this length is kept short to improve performance.
.TP
INFO ARG[\fIn\fR]
The \fIn\fRth command line argument that generated this output, for posterity.
.TP
INFO BESTOLP
The best overlap parameter for a sequence.
.TP
INFO BUILD
The parameters of a reconstructed base.
.TP
INFO MISM
A mismatch has been identified in the reconstruction.
.TP
INFO MOD
Information about a module.
.TP
INFO OLD
An overlap possibility, with probability, as been identified.
.TP
INFO RECR
The proposed reconstruction parameters.
.TP
INFO VER
The version of PANDAseq that generated this output, for posterity.
.TP
STAT
Some information about the assembly process. See above.
.TP
DBG FMER
A \fIk\fR-mer has been identified in the forward read.
.TP
DBG FML
A duplicate \fIk\fR-mer has been identified in the forward read and discarded. This might cause failure to assemble a sequence if repeated too often. See the \fB-k\fR option to correct this.
.TP
DBG RMER
A \fIk\fR-mer has been identified in the reverse read.
.TP
ERR UNKNOWN ERROR
Something truly unexpected has happened. This probably involves a validation module.
.SH EXAMPLES
This will assemble a data from a run in lane 7:

.B pandaseq -f s_7_1.fastq.bz2 -r s_7_2.fastq.bz2 > s_7.fasta

This will assemble data from lane 7, stripping conserved regions around the prokaryotic 16S V3 region and store the results in
.B s_7.fasta.bz2
and store the logging output
.B s_7.log.bz2.

.B pandaseq -f s_7_1.fastq.bz2 -r s_7_2.fastq.bz2 -p CCTACGGGAGGCAGCAG -q ATTACCGCGGCTGCTGG -G s_7.log.bz2 | bzip2 > s_7.fasta.bz2
.SH VALIDATION MODULES
Validation modules are capable of making decisions about whether or not to keep output sequences. For example, one could write a module to check secondary structure of a RNA, or that a coding sequence contains no stop codons. To create a module, please see
.BR pandaxs (1).
Invoking a module can be done using the
.B -C
option on the command line. As many modules as desired may be added. The path to the module may be followed by a colon (on Windows, a semicolon) and arguments. For example, the following will include all sequences after \fBHWI-ST822:85:C05C3ACXX:1:1101:1171:2104 3:N:0:TAGACA\fR in the input file:

.B pandaseq -f s_7_1.fastq.bz2 -r s_7_2.fastq.bz2 -C \(dqafter:HWI-ST822:85:C05C3ACXX:1:1101:1171:2104 3:N:0:TAGACA\(dq > s_7.fasta
.SH INCLUDED MODULES
There are some included modules:
.TP
\(dqafter:\fIidentifer\fR\(dq
Assemble only the sequences after (and including) the sequence specified. This is done in file order.
.TP
\(dqbefore:\fIidentifer\fR\(dq
Assemble only the sequences before (and excluding) the sequence specified. This is done in file order.
.TP
completely_miss_the_point
This can be used to only include sequences with perfect overlap regions. You shouldn't want to do it. The whole point is to fix sequences which are probably good. Moreover, assuming that the sequencer is right in the overlap region and in the non-overlapping regions requires an unsound leap in statistics. My dislike has been appropriately embodied in the name of this validation module.
.TP
empty
Sometimes, an assembled sequence can have zero length. Some downstream applications do not like this, so this filter allows removing any such sequences.
.TP
filter:\fIfile\fR
Output only the sequences whose identifiers match those in the file specified, one per line. If the file is missing, sequences are read from standard input.
.TP
min_phred:\fIvalue\fR
Check the PHRED score of every base in the output sequence and make sure it is at least \fIvalue\fR. The threshold is based on the sequence as a whole, but this is based on the individual base scores, as they would be seen with the \fB-F\fR option.
.TP
min_overlapbits:\fIvalue\fR
Check that the number of “bits saved” (Cole, et al. 2013) is above the provided value.
.TP
other_primer:\fIdirection\fR:\fIprimer\fR
Sometimes, libraries are constructed with a mix of primer sequences. The allows separating the primer mix. To do this given primers, run PANDAseq twice: once with the each primer given to \fB-p\fR or \fB-q\fR and the other primer given as the \fIprimer\fR option to this module. The \fIdirection\fR must be specified as either \fBf\fR or \fBp\fR for the forward primer or \fBr\fR or \fBq\fR for the reverse primer. This module will reject and reads that match the primer, eliminating them from the output.
.TP
overlap_stat
Produces a histogram of the number of overlaps that were examined for each of the sequences that assembled. This does not indicate the number of overlaps that were examined for discarded sequences.
.TP
pear
Perform the false-positive test described in section 2.2 of Zhang 2013.
.TP
validtag:\fItag1\fR:\fItag2\fR:...
Only include sequences in the output with one of the tags specified. This can be used to demultiplex sequences. This will not work well with \fB-B\fR option.
.SH SEE ALSO
.BR pandaseq-checkid (1),
.BR pandaxs (1),
.BR gzip (1),
.BR bzip2 (1).

Andre P Masella, Andrea K Bartram, Jakub M Truszkowski, Daniel G Brown and Josh D Neufeld.
.I PANDAseq: paired-end assembler for Illumina sequences.
BMC Bioinformatics 2012, 13:31. 
.UR http://www.biomedcentral.com/1471-2105/13/31
.UE

E. Aronesty.
.I Comparison of Sequencing Utility Programs
TOBioiJ (2013); doi:10.2174/1875036201307010001
.UR http://benthamscience.com/open/openaccess.php?tobioij/articles/V007/1TOBIOIJ.htm
.UE

J. Zhang, K. Kobert, T. Flouri, and A. Stamatakis.
.I PEAR: A fast and accurate Illumina Paired-End reAd mergeR
Bioinformatics 2013 : btt593v1-btt593.
.UR http://bioinformatics.oxfordjournals.org/content/early/2013/10/18/bioinformatics.btt593.short
.UE

J. R. Cole, Q. Wang, J. A. Fish, B. Chai, D. M. McGarrell, Y. Sun, C. T. Brown, A. Porras-Alfaro, C. R. Kuske, and J. M. Tiedje.
.I Ribosomal Database Project: data and tools for high throughput rRNA analysis
Nucl. Acids Res. Database issue: first published online 27 Nov 2013; doi: 10.1093/nar/gkt1244
.UR http://nar.oxfordjournals.org/content/early/2013/11/26/nar.gkt1244.full
.UE

T. Magoc, and S. Salzberg.
.I FLASH: Fast length adjustment of short reads to improve genome assemblies.
Bioinformatics 27:21 (2011), 2957-63.a
.UR http://ccb.jhu.edu/software/FLASH/FLASH-reprint.pdf
.UE


R. C. Edgar.
.I personal communication