-
Notifications
You must be signed in to change notification settings - Fork 214
These tools work on BAM files that contain read-related information (e.g. [read][] DNA sequence, sequencing quality, mapping quality etc.). They are typically generated by [read][] alignment programs such as [bowtie2][].
The following tools will allow you to inspect your BAM files more closely.
This tool is useful to assess the overall similarity of different BAM files. A typical application is to check the correlation between replicates or published data sets.
The tool splits genomes into bins of a given length, which are definded by start and end position within the reference genome that was used for mapping. For each bin, the number of reads found in each BAM file is counted and the correlation of the [read][] coverages is computed for all pairs of BAM files.
bamCorrelate has 2 modes: bins and BED-file
In bins mode, the correlation is computed based on bins of equal length, which consecutively cover the entire genome, virtually (discarding the usually shortened last bin of each chromosome). The number of bins automatically results from the overall genome size divided by the bin length (10k bp by default). It may therefore vary depending on the selected bin size and data set. The bins mode is useful to assess the overall similarity of BAM files.
In BED-file mode, the user has to supply a list of genomic regions in BED format in addition to the BAM files. bamCorrelate subsequently uses this list to compare the [read][] coverages for these regions only. This can be used, for example, to compare ChIP-seq coverages of two different samples for a set of peak regions.
In addition to specifying the regions for which the [read][] numbers should be compared (equally sized consecutive regions in the bins mode, selected regions in the BED-file mode), you can also specify what kind of correlation measure you would like to compute: Pearson or Spearman. In short, Pearson is an appropriate measure for data that follows a normal distribution, while Spearman does not make this assumption and is generally less driven by outliers, but with the caveat of also being less sensitive. By default, a filtering of extreme values above the 99.9 percentile is done, when Pearson correlation is selected. The filtering is available as a user option for Spearman.
For an overview of all the available command line options, go here, for an example command and results, see below.
- diagnostic plot the plot produced by bamCorrelate is a clustered heatmap displaying the values for each pair-wise correlation, see below for an example
- data matrix (optional) in case you want to plot the correlation values using a different program, e.g. R, this matrix can be used
Here is the result of running bamCorrelate. We supplied 4 BAM files that were generated from 2 patients - for each patient, there is an control (called "[input][]") and a ChIP-seq sample (from GSE32222).
You can supply any number of BAM files that you would like to compare. In Galaxy, you just click "Add BAM file", in the command line you simply list all files one after the other (you can give meaningful names via the --label option).
This is the command that was used with the command line version:
$ deepTools-1.5.2/bin/bamCorrelate bins --fragmentLength 200 \
--bamfiles GSM798383_SLX-1201.250.s_4.bwa.homo_sapiens_f.bam \
GSM798384_SLX-1881.334.s_1.bwa.homo_sapiens_f.bam \
GSM798406_SLX-1202.250.s_1.bwa.homo_sapiens_f.bam \
GSM798407_SLX-1880.337.s_8.bwa.homo_sapiens_f.bam \
--labels "ChIP p1" "ChIP p2" "Input p1" "Input p2" \
--plotFile bamCorrelate_result.pdf --corMethod pearson
Here is another example of ChIP samples for two different histone marks (the histone marks are abbreviated H3K27me3 and H3K27ac and have been shown to mark inactive and active chromatin, respectively. For our example, H3K27ac was ChIPed by the same experimentator for different cell populations while H3K27me3 was performed with the same antibody, but at different times. You can see that the correlation between the H3K27ac replicates is much higher than for the H3K27me3 samples, however, for both histone marks, the ChIP-seq experiments are more similar to each other than to the other ChIP or to the input. In fact, the signals of H3K27ac and H3K27me3 are almost not correlated at all which supports the notion that their biological function is also quite opposing.
This tool computes the GC bias using the method proposed by Benjamini and Speed.
The basic assumption of the GC bias diagnosis is that an ideal sample should show a uniform distribution of sequenced reads across the genome, i.e. all regions of the genome should have similar numbers of reads, regardless of their base-pair composition. In reality, the DNA polymerases used for PCR-based amplifications during the library preparation of the sequencing protocols prefer GC-rich regions. This will influence the outcome of the sequencing as there will be more reads for GC-rich regions just because of the DNA polymerase's preference.
computeGCbias will first calculate the expected GC profile by counting the number of DNA fragments of a fixed size per GC fraction (GC fraction is defined as the number of G's or C's in a genome region of a given length) ( * ). This profile is then compared to the observed GC profile by counting the number of sequenced reads per GC fraction.
( * ) The expected GC profile depends on the reference genome as different organisms have very different GC contents. For example, one would expect more fragments with GC fractions between 30% to 60% in mouse samples (GC content of the mouse genome: 45 %) than for genome fragments from Plasmodium falciparum (genome GC content P. falciparum: 20%).
In some cases, it will make sense to exclude certain regions from the calculation of the read distributions to increase the accuracy of the computation. There are several kinds of regions that are either not expected to show a background read distribution or where the uncertainty of the reference genome might be too big. Please consider the following points:
- repetitive regions: if multi-reads (reads that map to more than one genomic position) were excluded from the BAM file, it will help to exclude known repetitive regions. You can get BED files of known repetitive regions from UCSC Table Browser (see the screenshot below for an example of human repetitive elements).
-
regions of low mappability: these are regions where the mapping of the reads notoriously fails and we recommend to exclude known regions with mappability issues from the GC computation. You can download the mappability tracks for different read lengths from UCSC, e.g. for mouse and human. In the github deepTools folder "scripts", you can find a shell script called mappabilityBigWig_to_unmappableBed.sh which will turn the bigWig mappability file from UCSC into a BED file.
-
ChIP-seq peaks: in ChIP-seq samples it is expected that certain regions should show more reads than expected based on the background distribution, therefore it makes absolute sense to exclude those regions from the GC bias calculation. We recommend to run a simple, non-conservative peak calling on the uncorrected BAM file first to obtain a BED file of peak regions that should then subsequently be supplied to computeGCbias.
-
Diagnostic plot
- box plot of absolute [read][] numbers per genomic GC fraction
- x-y plot of observed/expected [read][] ratios per genomic GC fraction (ideally, ratio should always be 1 (log2(1) = 0))
-
Data matrix
- tabular matrix file
- to be used for GC correction with correctGCbias
In an ideal sample without GC bias, the ratio of observed/expected values should be close to 1 for all GC content bins.
However, due to PCR (over)amplifications, the majority of ChIP samples usually shows a significant bias towards reads with high GC content (>50%) and a depletion of reads from GC-poor regions.
Let's start with an ideal case. The following plots were generated with computeGCbias using simulated reads from the Drosophila genome. The command looked like this (for more information on the individual command options, go to our page with All command line options):
$ deepTools-1.5.2/bin/computeGCBias -b simulatedReads.bam \
--effectiveGenomeSize 121400000 --genome dm3.2bit \
--fragmentLength 200 --biasPlot dm3_simulatedReads.png \
--GCbiasFrequenciesFile simulatedReads_frequencies.txt
As you can see, both plots based on simulated reads do not show enrichments or depletions for specific GC content bins, there is an almost flat line at log2ratio of 0 (= ratio of 1). The fluctuations on the ends of the x axis are due to the fact that only very, very few regions in the genome have such extreme GC fractions so that the number of fragments that are picked up in the random sampling can vary.
Now, let's have a look at real-life data from genomic DNA sequencing. Panels A and B can be clearly distinguished and the major change that took place between the experiments underlying the plots was that the samples in panel A were prepared with too many PCR cycles and a standard polymerase whereas the samples of panel B were subjected to very few rounds of amplification using a high fidelity DNA polymerase.
This quality control will most likely be of interest for you if you are dealing with ChIP-seq samples as a pressing question in ChIP-seq experiments is "Did my ChIP work?", i.e. did the antibody-treatment enrich sufficiently so that the ChIP signal can be separated from the background signal? (After all, around 90 % of all DNA fragments in a ChIP experiment will represent the genomic background). We use bamFingerprint routinely to monitor the outcome of ChIP-seq experiments.
This tool is based on a method developed by Diaz et al. and it determines how well the signal in the ChIP-seq sample can be differentiated from the background distribution of reads in the control sample. For factors that will enrich well-defined, rather narrow regions (e.g. transcription factors such as p300), the resulting plot can be used to assess the strength of a ChIP, but the broader the enrichments are to be expected, the less clear the plot will be. Vice versa, if you do not know what kind of signal to expect, the bamFingerprint plot will give you a straight-forward indication of how careful you will have to be during your downstream analyses to separate biological noise from meaningful signal.
The tool first samples indexed BAM files and counts all reads overlapping a window (bin) of specified length. These counts are then sorted according to their rank and the cumulative sum of [read][] counts is plotted.
- Diagnostic plot
- Data matrix of raw counts (optional)
An ideal [input][] with perfect uniform distribution of reads along the genome (i.e. without enrichments in open chromatin etc.) should generate a straight diagonal line. A very specific and strong ChIP enrichment will be indicated by a prominent and steep rise of the cumulative sum towards the highest rank. This means that a big chunk of reads from the ChIP sample is located in few bins which corresponds to high, narrow enrichments seen for transcription factors.
Here you see 3 different fingerprint plots. We chose these examples to show you how the nature of the ChIP signal (narrow and high vs. wide and not extremely high) is reflected in the "fingerprint" plots. Please note that these plots go by the name of "fingerprints" in our facility because we feel that they help us tremendously in judging individual files, but the idea underlying these plots came from Diaz et al.
[read]: https://github.com/fidelram/deepTools/wiki/Glossary#terminology "the DNA piece that was actually sequenced ("read") by the sequencing machine (usually between 30 to 100 bp long, depending on the read-length of the sequencing protocol)" [input]: https://github.com/fidelram/deepTools/wiki/Glossary#terminology "confusing, albeit commonly used name for the 'no-antibody' control sample for ChIP experiments" [bowtie2]: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml "a popular read alignment program"
deepTools is developed by the Bioinformatics Facility at the Max Planck Institute for Immunobiology and Epigenetics, Freiburg. For troubleshooting, see our FAQ and get in touch: [email protected]
Wiki Start Page | Code | deepTools Galaxy | FAQ | Glossary | Gallery |