What are your experiences with:
- Any lab work, e.g. isolating DNA? PCR?
- Any sequencing? Sanger? HTS?
- Any HTS data analysis?
- What have you learned in Genomic Architecture?
- At time of writing there are multiple technologies, broadly categorized as sequencing-by-ligation (e.g. SOLiD) and sequencing-by-synthesis (illumina, Ion Torrent, 454). Reads are getting longer on all platforms, but especially on PacBio and MinION.
- A number of vendors have created numerous platforms for specific needs and requirements, e.g. data volumes, read lengths, cost, error profile, runtime
- A fairly current review is Goodwin et al., 2016
- Illumina (below) is currently the largest platform
- clip any synthetic oligonucleotides (adaptors, primers)
- trim low quality bases, filter short reads
- de novo or mapping assembly
- further annotation
- variant calling
- consensus sequence computation
- After DNA isolation and fragmentation, primer sequences may be ligated to the fragment as part of an amplification procedure (PCR)
- In addition, various sequencing platforms involve ligation of adaptor sequences for various roles, e.g. to label samples and to participate in the chemistry of the platform (attach to the flowcell surface, for example)
Depending on the sequencing platform, insert size, and additional services provided by the sequencing lab, the reads may already be sorted by adaptors (which are then clipped off) or you may have to do this yourself.
Sturm M, C Schroeder & P Bauer, 2016. SeqPurge: highly-sensitive adapter trimming for paired-end NGS data. BMC Bioinformatics. 17: 208. doi:10.1186/s12859-016-1069-7
- There are many tools available for adaptor clipping. Some are faster than others, and they all affect downstream analysis time differently
- Under some circumstances, it may not be necessary to do this, depending on the experimental design (for example, if there is no de-multiplexing to do and the adaptors are ignored in a mapping assembly)
Sidenote about amplicon sequencing
- In amplicon sequencing, the fragment will have been ligated with a primer as well as an adaptor sequence.
- This allows for more samples to be multiplexed because the number of combinations then becomes n adaptors * n primers
- And you probably don't need the amount of coverage on a single marker that a non-multiplexed run would give you anyway
- However, platform vendors cannot de-multiplex automatically (because they know their own adaptors, but not your primers), and with degenerate primers you'd have to do fuzzy matching against their sequences
A convenient tool for initial quality assessment of HTS read data is FastQC, whose results can indicate numerous pathologies:
- Low Phred scores overall (e.g. pacbio compared to illumina), at higher base positions on the read (i.e. near the "end"), or along homopolymers, some of which can be addressed using read trimming
- The presence of biases (GC content) and overrepresentations of certain reads (e.g. adaptors) that may still need to be clipped
- Some examples show:
- A (crude) proxy for a read possibly being chimeric is that it occurs only once, i.e. as a singleton, which you might therefore filter out. This is more likely the case in genomes than in amplicon sequencing (because the chimera might be PCR'ed)
- On platforms that have variable length reads, you might want to filter out all reads below a threshold length
- When paired-end sequencing, one of the two 'ends' might have been filtered out, in which case you might filter out the opposite end as well
What do you know about graph theory? Edges? Vertices? Degrees? Directedness?
- In Sanger sequencing assembly, a graph is constructed where every vertex is a read, and edges connect the reads that overlap
- Subsequently, the Hamiltonian path, which visits every vertex (read) exactly once is searched for, and the consensus sequence along the path is computed
- However, this overlap-layout-consensus approach is hard to solve (computer scientists call this NP-complete)
- Computing the Hamiltonian path is too computationally intensive ("NP-complete") for HTS data
- Another approach to traverse graphs is along the Eulerian path, where every edge (instead of vertex) is visited exactly once
- The Eulerian path can be traversed in linear time (computer scientists notate this as "O(|E|)") as opposed to non-deterministic polynomial time (i.e. with n input T(n) = O(nk) for some constant k in Big O notation)
- However, for this Eulerian path to exist, either zero or two vertices may exist with odd degree, so the Seven Bridges of Königsberg, which Euler studied, don't form a path (and the overlap graph might also not)
- In a complete De Bruijn graph, all vertices have even degree
- A De Bruijn graph connects symbolic sequence data such that every vertex is a sequence string of length k (a "k-mer") that is connected to other such vertices if the the sequences are identical along the substring of length k-1, i.e. the sequences are shifted one step relative to one another, which creates directedness in the graph
- The simplest cases, with binary sequence data, are shown for k = 1..3 (imagine what this would look like for four symbols):
- The De Bruijn graph presupposes that for every k-mer there are neighbours that overlap with it along the substring k-1
- HTS read data does not naturally come out meeting that assumption: there are biases causing reads to overlap more irregularly
- However, the reads can be re-processed to a spectrum of substrings of some smaller size k that are shifted relative to one another (note that this collapses the duplicates that are then created):
This re-processing can be achieved naively (there are faster tools than this) in python thusly:
def find_kmers(string, k):
kmers = []
n = len(string)
for i in range(0, n-k+1):
kmers.append(string[i:i+k])
return kmers
Chikhi R & P Medvedev, 2014. Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1): 31–37 doi:10.1093/bioinformatics/btt310
- Smaller k makes for a smaller graph, but at the cost of more duplicate k-mers collapsed, causing information loss and an inability to cross over repeat regions
- Higher k is more memory intensive, and may increase the risk of k-mers having no neighbours, causing short contigs
- Some tools exist that attempt to optimize this value, for example by attempting to predict the value that results in the most distinct k-mers, which corresponds with certain measures of assembly quality:
- Velvet is often used as benchmark / gold standard for de novo assembly, but it is limited in data set size.
- SOAPdenovo2 is an efficient, commonly used assembler for larger genomes.
- NG50 gives the length of the contig (when sorted by descending length) that is the mid point along the way to covering the length of the reference genome (compare with N50, which is the mid point along the way to the total length of the assembly).
- De novo assembly results in contigs, stretches of contiguous read data, which then may be scaffolded
- Paired-end and mate-pair read data (with larger insert sizes) can help span a gap and help orient and approximate the distance between reads
- Other ways to scaffold include using optical mapping or sequencing longer reads (e.g. pacbio) as well
- In a mapping assembly, a reference genome is scanned for the best place to map a read
- Hence, the reference genome needs to be indexed in order for this to be efficient
- All rotations for a given input string are generated
- These are sorted alphabetically
- The final column is the transformed string, i.e. BWT(T)
This string has the following properties:
- Almost incomprehensibly, the input string can be recovered from BWT(T) by a reverse function
- But, in the transformed string, the same character now occurs multiple times in sequence, which can be compressed (e.g. by counting the number of repetitions)
- By computing an additional index (FM), substring locations can quickly be found, and thus reads be mapped
- Here is a good tutorial with code samples for all the steps in Python
Commonly used BWT mapping tools are:
Genomes are annotated using multiple lines of evidence, such as:
- gene prediction, e.g. with SNAP
- EST mappings, e.g. with blastn or exonerate
- protein mappings, e.g. with blastx (protein blast) or exonerate
- repeat masking data
Variants are called with a variety of methods:
- Allele counting
- Probabilistic/Bayesian: build a model of the expected number of variants and use that to quantify support for any given call
- Heuristic techniques: based on filtering / thresholding for read depth, base quality, frequency