routine sync

NYU-Molecular-Pathology · May 24, 2017 · 5e918bc · 5e918bc
1 parent 8c3dc0b
commit 5e918bc
Show file tree

Hide file tree

Showing 36 changed files with 1,333 additions and 91 deletions.
diff --git a/README.md b/README.md
@@ -3,71 +3,81 @@
 ## Usage overview
 
 Navigate to a clean new project directory. This is where all the results will end up.
+
 ```
 cd <project dir>
 ```
 
 Download the code from GitHub, which will create the `sns` sub-directory with all the code.
+
 ```
 git clone --depth 1 https://github.com/igordot/sns
 ```
 
 Scan a directory that contains FASTQ files to be used as input.
 This can be run multiple times if there are FASTQs in different directories.
+
 ```
 sns/gather-fastqs <fastq dir>
 ```
-All found files will be added to the `samples.fastq-raw.csv` file, which can be modified to adjust sample names or remove samples.
-The first column is the sample name.
-The second column is the R1 FASTQ.
-The third column is the R2 FASTQ (if available).
+
+All found files will be added to the `samples.fastq-raw.csv` file.
+It can be modified to change sample names, remove samples, or manually add samples.
+The first column is the sample name, the second column is the R1 FASTQ, and the third column is the R2 FASTQ (if available).
 Each line contains a single FASTQ (or FASTQ pair for paired-end experiments).
 If one sample has multiple FASTQs, each one will be on a different line.
 Multiple FASTQs for the same sample will be merged based on sample name.
 
-Specify a genome (only `hg19/mm10/dm3/dm6` are currently guaranteed to work).
+Specify a reference genome (only `hg19/mm10/dm3/dm6` are currently guaranteed to work).
+
 ```
 sns/generate-settings <genome>
 ```
 
 Run the analysis using a specific route (a set of analysis steps).
+
 ```
 sns/run <route>
 ```
 
 Check for potential problems.
+
 ```
 grep "ERROR:" logs-qsub/*
 ```
-There should be no matches if everything is okay.
-If there are any results, check the specific log files where the errors are found for more info.
+
+There should be no matches.
+If there are any results, there was a problem.
+Check the specific log files where the errors are found for more info.
 
 ## Routes
 
-Routes are different analysis workflows. Generic routes are sample-centric (same analysis is performed for each sample). Available routes:
+Routes are different analysis workflows.
+Generic routes are sample-centric (same analysis is performed for each sample).
+Available routes:
+
 * `rna-star`: RNA-seq using STAR. Generates BAMs, normalized bigWigs, counts matrix, and various QC metrics.
 * `rna-rsem`: RNA-seq using RSEM. Generates FPKM/TPM/counts matrix and various QC metrics.
-* `rna-snv`: RNA-seq variant detection using STAR and GATK. Generates BAMs, VCFs, and various QC metrics.
+* `rna-snv`: RNA-seq variant detection. Generates BAMs, VCFs, and various QC metrics.
 * `wgbs`: WGBS using Bismark.
 * `rrbs`: RRBS using Bismark.
-* `wes`: Whole genome/exome/targeted variant detection using BWA-MEM and GATK. Generates BAMs, VCFs, and various QC metrics.
-* `atac`: ATAC-seq using Bowtie and MACS. Generates BAMs, bigWigs, peaks, nucleosome positions, and various QC metrics.
-* `species`: Species/metagenomics/contamination analysis using Centrifuge with NCBI BLAST nt/nr nucleotide collection.
+* `wes`: Whole genome/exome/targeted variant detection. Generates BAMs, VCFs, and various QC metrics.
+* `atac`: ATAC-seq. Generates BAMs, bigWigs, peaks, nucleosome positions, and various QC metrics.
+* `species`: Species/metagenomics/contamination analysis.
 
 There are additional routes for comparing groups of samples after individual samples are processed with a generic route.
 They depend on the output of the generic routes and must be run from the same directory.
 Before running, manually add proper group names or pairs to the `samples.groups.csv` or `samples.pairs.csv` files (depending on the comparison type).
 Available comparison routes:
+
 * `rna-star-groups-dge`: Differential gene expression using DESeq2 for the `rna-star` results.
 * `wes-pairs-snv`: Somatic variant detection for the `wes` results.
 
 ## Output
 
 * Directories for different output types (such as BAMs or bigWigs) containing files for each sample.
-* `summary.*.csv`: Most segments will generate a results summary file. This lets you know if the step completed and some relevant statistics.
-* `summary-combined.*.csv`: Combined segment summaries. This table provides a comprehensive overview of the project and should reveal any potential problems.
-* `samples.*.csv`: Most segments that generate large files will also generate a separate segment-specific sample sheet. If the files referenced in the sample sheet are missing, the route will not attempt to generate them. This can be useful if the files were deleted to save space, but you want to add more samples to the same analysis without reprocessing the older samples.
-* `logs-*` directories: Most stdout/stderr output will be placed here. The information can be used for tracking progress and troubleshooting, but is generally useless.
+* `summary-combined.*.csv`: Combined segment summaries table that provides a comprehensive overview of the project.
+* `logs-*` directories: Most stdout/stderr output will be placed here. The information can be used for tracking progress and troubleshooting.
 
 ## About
 
@@ -79,6 +89,7 @@ Each route contains multiple segments (or steps).
 
 If there is a problem with any of the results, delete the broken files and re-run SNS.
 It will generate any missing output.
+Similarly, you can add additional samples and only the new ones will be processed when the route is re-run.
 
 Most output and sample sheets are in a CSV format for macOS Quick Look (spacebar file preview) compatibility.
 

diff --git a/routes/atac.md b/routes/atac.md
@@ -0,0 +1,12 @@
+# Route: atac
+
+ATAC-seq using Bowtie and MACS.
+
+Segments:
+
+* Align to the reference genome (Bowtie2).
+* Remove duplicate reads (Sambamba).
+* Generate genome browser tracks.
+* Call peaks (MACS).
+* Call nucleosomes (NucleoATAC).
+
diff --git a/routes/rna-snv.md b/routes/rna-snv.md
@@ -0,0 +1,4 @@
+# Route: rna-snv
+
+Variant detection in RNA-seq data.
+Can be run following `rna-star`.
diff --git a/routes/rna-snv.sh b/routes/rna-snv.sh
@@ -57,7 +57,12 @@ if [ -z "$fastq_R1" ] ; then
 	fastq_R1=$(grep -m 1 "^${sample}," "${proj_dir}/samples.${segment_fastq_clean}.csv" | cut -d ',' -f 2)
 	fastq_R2=$(grep -m 1 "^${sample}," "${proj_dir}/samples.${segment_fastq_clean}.csv" | cut -d ',' -f 3)
 fi
-[ "$fastq_R1" ] || exit 1
+
+# if FASTQ is not set, there was a problem
+if [ -z "$fastq_R1" ] ; then
+	echo -e "\n $script_name ERROR: $segment_fastq_clean DID NOT FINISH \n" >&2
+	exit 1
+fi
 
 # run STAR
 segment_align="align-star"
@@ -69,6 +74,12 @@ if [ -z "$bam_star" ] ; then
 fi
 [ "$bam_star" ] || exit 1
 
+# if STAR BAM is not set, there was a problem
+if [ -z "$bam_star" ] ; then
+	echo -e "\n $script_name ERROR: $segment_align DID NOT FINISH \n" >&2
+	exit 1
+fi
+
 # remove duplicates
 segment_dedup="bam-dedup-sambamba"
 bam_dd=$(grep -s -m 1 "^${sample}," "${proj_dir}/samples.${segment_dedup}.csv" | cut -d ',' -f 2)
@@ -77,7 +88,12 @@ if [ -z "$bam_dd" ] ; then
 	($bash_cmd)
 	bam_dd=$(grep -m 1 "^${sample}," "${proj_dir}/samples.${segment_dedup}.csv" | cut -d ',' -f 2)
 fi
-[ "$bam_dd" ] || exit 1
+
+# if deduplicated BAM is not set, there was a problem
+if [ -z "$bam_dd" ] ; then
+	echo -e "\n $script_name ERROR: $segment_dedup DID NOT FINISH \n" >&2
+	exit 1
+fi
 
 # add read groups
 segment_rg="bam-rg-picard"
@@ -87,7 +103,12 @@ if [ -z "$bam_rg" ] ; then
 	($bash_cmd)
 	bam_rg=$(grep -m 1 "^${sample}," "${proj_dir}/samples.${segment_rg}.csv" | cut -d ',' -f 2)
 fi
-[ "$bam_rg" ] || exit 1
+
+# if BAM with RGs is not set, there was a problem
+if [ -z "$bam_rg" ] ; then
+	echo -e "\n $script_name ERROR: $segment_rg DID NOT FINISH \n" >&2
+	exit 1
+fi
 
 # split CIGAR strings
 segment_splitncigar="bam-splitncigar-gatk"
@@ -99,6 +120,12 @@ if [ -z "$bam_split" ] ; then
 fi
 [ "$bam_split" ] || exit 1
 
+# if BAM with split CIGAR strings is not set, there was a problem
+if [ -z "$bam_split" ] ; then
+	echo -e "\n $script_name ERROR: $segment_splitncigar DID NOT FINISH \n" >&2
+	exit 1
+fi
+
 # on-target (exons) coverage
 segment_target_cov="qc-target-reads-gatk"
 bash_cmd="bash ${code_dir}/segments/${segment_target_cov}.sh $proj_dir $sample $threads $bam_split"
@@ -112,7 +139,12 @@ if [ -z "$bam_gatk" ] ; then
 	($bash_cmd)
 	bam_gatk=$(grep -m 1 "^${sample}," "${proj_dir}/samples.${segment_gatk}.csv" | cut -d ',' -f 2)
 fi
-[ "$bam_gatk" ] || exit 1
+
+# if GATK BAM is not set, there was a problem
+if [ -z "$bam_gatk" ] ; then
+	echo -e "\n $script_name ERROR: $segment_gatk DID NOT FINISH \n" >&2
+	exit 1
+fi
 
 # final average coverage
 segment_avg_cov="qc-coverage-gatk"
@@ -154,6 +186,19 @@ ${proj_dir}/summary.${segment_avg_cov}.csv \
 #########################
 
 
+# generate pairs sample sheet template
+
+samples_pairs_csv="${proj_dir}/samples.pairs.csv"
+
+if [ ! -s "$samples_pairs_csv" ] ; then
+	echo "#SAMPLE-T,#SAMPLE-N" > $samples_pairs_csv
+	sed 's/\,.*/,NA/g' ${proj_dir}/samples.fastq-raw.csv | LC_ALL=C sort -u >> $samples_pairs_csv
+fi
+
+
+#########################
+
+
 # delete empty qsub .po files
 rm -f ${qsub_dir}/sns.*.po*
 

diff --git a/routes/rna-star-groups-dge.md b/routes/rna-star-groups-dge.md
@@ -0,0 +1,37 @@
+# Route: rna-star-groups-dge
+
+Differential gene expression using DESeq2 for the `rna-star` results.
+
+## Usage
+
+After individual samples are processed with the `rna-star` route,
+manually define proper group names in the `samples.groups.csv` sample sheet.
+
+Run `rna-star-groups-dge` route from the same directory as `rna-star`.
+
+```
+sns/run rna-star-groups-dge
+```
+
+## Output
+
+The `rna-star-groups-dge` route will create a `DGE-DESeq2-*` directory with the results. The name will contain the strand (determined automatically) and the number of samples in the sample sheet. The sample sheet can be modified to exclude problematic samples or change groupings for alternate analysis.
+
+Results:
+
+* `counts.raw.csv`: Matrix of raw counts.
+* `counts.norm.csv`: Matrix of normalized counts that can be used to check the expression levels of specific genes across samples.
+* `counts.norm.xlsx`: Matrix of normalized counts in Excel format to avoid potential auto-conversion of gene names.
+* `counts.vst.csv`: Matrix of counts after variance stabilizing transformation (VST) for clustering samples or other machine learning applications. These are log2-transformed and normalized with respect to library size. The point of VST is to remove the dependence of the variance on the mean.
+* `plot.pca.png`: PCA plot that shows the samples based on their first two principal components. Useful for visualizing the overall effect of experimental covariates and batch effects.
+* `dge.*`: Differential gene expression results between different groups.
+* `plot.heatmap.*`: Heatmaps based on differentially expressed genes using multiple cutoffs.
+
+Additional output:
+
+* `input.groups.csv`: Input sample sheet.
+* `input.counts.txt`: Input gene-sample matrix of raw counts.
+* `deseq2.dds.RData`: DESeq2 object (dds) that can be loaded and modified in R if more complex analysis is needed.
+* `deseq2.vsd.RData`: VST-transformed DESeq2 object (vsd) that can be loaded and modified in R if more complex analysis is needed.
+
+General pipeline info: https://github.com/igordot/sns
diff --git a/routes/rna-star.md b/routes/rna-star.md
@@ -0,0 +1,79 @@
+# Route: rna-star
+
+Alignment and quantification of RNA-seq data.
+
+Segments:
+
+* Align to the reference genome (STAR).
+* Align to other species and common contaminants (fastq_screen).
+* Generate normalized genome browser tracks.
+* Determine the distribution of the bases within the transcripts and 5'/3' biases (Picard).
+* Determine if the library is stranded and the strand orientation.
+* Generate genes-samples counts matrix (featureCounts).
+
+For differential expression analysis, follow by running `rna-star-groups-dge`.
+
+## Usage
+
+Navigate to a clean new project directory.
+
+```
+cd <project dir>
+```
+
+Download the code from GitHub.
+
+```
+git clone --depth 1 https://github.com/igordot/sns
+```
+
+Generate a sample sheet of FASTQ files (`samples.fastq-raw.csv`).
+
+```
+sns/gather-fastqs <fastq dir>
+```
+
+Specify a reference genome, such as `hg19` or `mm10` (stored in `settings.txt`).
+
+```
+sns/generate-settings <genome>
+```
+
+Run `rna-star` route.
+
+```
+sns/run rna-star
+```
+
+Check for potential problems.
+
+```
+grep "ERROR:" logs-qsub/*
+```
+
+## Output
+
+Results:
+
+* `BAM-STAR`: BAM files. Can be used for visual inspection of individual reads or additional analysis.
+* `BIGWIG`: BigWig files normalized to the total number of reads. Can be used for visual inspection of relative expression levels.
+* `quant.featurecounts.counts.txt`: Matrix of raw counts for all genes and samples.
+
+Run metrics:
+
+* `summary-combined.rna-star.csv`: Summary table that includes the number of reads, unique and multi-mapping alignment rate, number of counts assigned to genes, fraction of coding/UTR/intronic/intergenic bases.
+* `summary.fastqscreen.png`: Alignment rates for common species and contaminants.
+* `summary.qc-picard-rnaseqmetrics.png`: Distribution of the bases within the transcripts to determine potential 5'/3' biases.
+
+Additional output (can usually be deleted or used for troubleshooting):
+
+* `logs-*`: Logs and intermediate files for various segments.
+* `samples.*.csv`: Sample sheet for segments that generate large files. The route will not attempt to generate the files listed. If the files were deleted to save space, additional samples can be added to the same analysis without reprocessing the older samples.
+* `summary`: Summary files for individual samples and segments.
+* `summary.*.csv`: Combined summary files for each segment.
+* `QC-*`: Results of QC steps for individual samples.
+* `FASTQ-CLEAN`: Merged FASTQs (one per sample).
+* `genes.featurecounts.txt`: Table of genes based on the reference GTF.
+* `quant-*`: Raw counts for all genes for individual samples.
+
+General pipeline info: https://github.com/igordot/sns
diff --git a/routes/rna-star.sh b/routes/rna-star.sh
@@ -71,7 +71,12 @@ if [ -z "$bam_star" ] ; then
 	($bash_cmd)
 	bam_star=$(grep -m 1 "^${sample}," "${proj_dir}/samples.${segment_align}.csv" | cut -d ',' -f 2)
 fi
-[ "$bam_star" ] || exit 1
+
+# if STAR BAM is not set, there was a problem
+if [ -z "$bam_star" ] ; then
+	echo -e "\n $script_name ERROR: $segment_align DID NOT FINISH \n" >&2
+	exit 1
+fi
 
 # generate BigWig (deeptools)
 segment_bigwig_deeptools="bigwig-deeptools"