Version 2.0.0 release

Release v2.0.0 - Bring the MAGeCK * CRISPR pipeline now with [MAGeCK](https://sourceforge.net/projects/mageck/) support. * CRISPR pipeline casTLE now support more than one single comparison. Still limited to a maximum of 2 replicates. * A few fixes and cleaning done on other pipelines * Now with a better documentation!
emc2cube · Nov 1, 2019 · 07aa495 · 07aa495
1 parent aaad41b
commit 07aa495
Show file tree

Hide file tree

Showing 17 changed files with 1,889 additions and 2,044 deletions.
diff --git a/.gitignore b/.gitignore
@@ -7,6 +7,7 @@ configs
 *.back
 *.bak
 *.old
+*2.sh
 Backup
 
 # Compiled source #

diff --git a/README.md b/README.md
@@ -1,103 +1,168 @@
-Bioinformatics
-==============
+# Bioinformatics pipelines, [SLURM](https://slurm.schedmd.com/overview.html) friendly
 
-High throughput sequencing scripts: bowtie2, GATK, etc...
+![GitHub package.json version](https://img.shields.io/github/package-json/v/emc2cube/Bioinformatics)
+![GitHub top language](https://img.shields.io/github/languages/top/emc2cube/Bioinformatics?color=green)
+![GitHub](https://img.shields.io/github/license/emc2cube/Bioinformatics?color=yellow)
+[![Runs on Sherlock](https://img.shields.io/badge/Runs_on-Sherlock-red)](https://www.sherlock.stanford.edu)
 
+> Set of high throughput sequencing analysis scripts to quickly generate and queue jobs on [SLURM](https://slurm.schedmd.com/overview.html)-based HPC clusters, such as [Stanford's Sherlock](https://www.sherlock.stanford.edu)🕵🏻‍♂️️
+>
+> Most scripts include some sort of failsafe: if a job fails it will be requeued once. This is useful in case of unexpected node failure.
+>
+> Currently available pipelines:
+> * Whole Exome Sequencing
+> * RNA Sequencing
+> * CRISPR screens
 
-Workflow scripts:
------------------
 
+## sh_WES.sh
 
-## sh_WES.sh (SLURM compatible)
+This script will process fastq(.gz) files and align them to a reference genome using bowtie2.
+It will then use Picard and GATK following the June 2016 best practices workflow.
+SNPs will then be annotated using ANNOVAR.
 
-Usage: sh_WES.sh </path/to/fastq(.gz)/folder> </path/to/destination/folder> [/path/to/config/file.ini]
+See the [WES.ini](https://github.com/emc2cube/Bioinformatics/blob/master/config_WES.ini) configuration file for all available options and settings.
 
-# Description
+Options:
+* --help : Display help message.
+* --version : Display version number.
 
-This script will process fastq(.gz) files and align them to a reference genome using bowtie2.
-It will then use Picard and GATK following GATK according to June 2016 best practices workflow.
-SNPs will then be annotated with ANNOVAR.
-Include a failsafe, if a job fails, it will be requeued once in case of a hardware failure.
+### Usage
 
-# Options
+```sh
+sh_WES.sh </path/to/fastq(.gz)/folder> </path/to/destination/folder> [/path/to/config/file.ini]
+```
 
-Can call trimmomatic, FastQC and compute coverage.
-Settings can be modified by using a customized config_WES.ini file.
 
-## sh_RNAseq.sh (SLURM compatible)
+## sh_RNAseq.sh
 
-Usage: sh_RNAseq.sh </path/to/fastq(.gz)/folder> </path/to/destination/folder> [/path/to/config/file.ini]
+This script will process fastq(.gz) files and align them to a reference genome using either STAR (recommended), hishat2 or tophat2.
+If STAR is used then RSEM will also be used and differential expression will be analyzed using DESeq2.
+Differential expression can also be computed using cufflinks (cufflinks is pretty much deprecated, should be avoided unless trying to reproduce old results).
 
-# Description
+See the [RNAseq.ini](https://github.com/emc2cube/Bioinformatics/blob/master/config_RNAseq.ini) configuration file for all available options and settings.
 
-This script will process fastq(.gz) files and align them to a reference genome using either STAR (recommended), hishat2 or tophat2.
-Differential expression will then be computed using cufflinks.
-If STAR is used then RSEM will also be used to generate gene read counts, pairwise comparison matrices will be created and DESeq2 analysis will be performed.
-Include a failsafe, if a job fails, it will be requeued once in case of a hardware failure.
+Options:
+* --help : Display help message.
+* --version : Display version number.
 
-# Options
+### Usage
 
-Can call trimmomatic and FastQC.
-Settings can be modified by using a customized config_RNAseq.ini file.
+```sh
+sh_RNAseq.sh </path/to/fastq(.gz)/folder> </path/to/destination/folder> [/path/to/config/file.ini]
+```
 
-## sh_bowtie2_AlignAll.sh (deprecated)
 
-Usage: sh_bowtie2_AlignAll.sh </path/to/fastq(.gz)/folder> </path/to/Aligned(.bam)/destination/folder> [/path/to/config/file.ini]
+## sh_CRISPR.sh
 
-# Description
+This script will process the fastq(.gz) files generated in a typical CRISPR screen using either [casTLE](https://bitbucket.org/dmorgens/castle/) or [MAGeCK](https://sourceforge.net/projects/mageck/).
+* If using casTLE, a reference file of all the indices will be automatically created using bowtie (NOT bowtie2). It will then analyze the screen and generate basic graphs.
+* If using MAGeCK counts, tests, mle and pathway analysis will be performed. It will also run the [R](https://www.r-project.org) package "[MAGeCKFlute](https://bioconductor.org/packages/release/bioc/html/MAGeCKFlute.html)" and in all cases generate basic graphs.
 
-This script will convert fastq files to bowtie2 aligned .bam files.
-Optional: Can call a trimming program (trimmomatic, Trim Galore or your own script).
-This script will, for all samples in input folder:
-- convert .fastq or .fastq.gz files to .sam.
-- align .sam to reference genome.
-- convert .sam to .bam.
-- Sort and index .bam file.
+See the [CRISPR.ini](https://github.com/emc2cube/Bioinformatics/blob/master/config_CRISPR.ini) configuration file for all available options and settings.
 
-## sh_gatkSNPcalling.sh (deprecated)
+Options:
+* --help : Display help message.
+* --version : Display version number.
 
-Usage: sh_gatkSNPcalling.sh </path/to/Aligned(.bam)/destination/folder> </path/to/SNPsCalled/folder> [/path/to/config/file.ini]
+Dependancies:
+[csvkit](https://csvkit.readthedocs.io/en/latest/) should be installed on your system in a location included in your $PATH.
 
-# Description
 
-This script will process aligned .bam files.
-Optional: Will first remove duplicate reads.
-This script will, for all samples:
-- perform a local realignment around known indels.
-- perform a quality score recalibration.
-- generate .g.vcf file using HaplotypeCaller.
-- Optional: stop here
-- perform joint genotyping
-- Filter variants using VQSR.
-- annotate using annovar.
-- do some cleaning on .csv for an easy downloadable file.
-- Optional: Can trigger an IFTTT event using the maker channel.
+### Usage
 
+```sh
+sh_CRISPR.sh </path/to/fastq(.gz)/folder> </path/to/destination/folder> [/path/to/config/file.ini]
+```
 
-## sh_FastQToSNPsCall.sh (deprecated)
+### Python 3.6 compatibility
 
-Usage: sh_FastQToSNPsCall.sh </path/to/fastq(.gz)/folder> </path/to/Aligned(.bam)/destination/folder> </path/to/SNPsCalled/folder> [/path/to/config/file.ini]
+For easy integration along MAGeCK, or any other modern tools, a python 3.6+ compatible version of casTLE is included.
+This is based on [casTLE commit 981d6d8](https://bitbucket.org/dmorgens/castle/commits/981d6d877c0fe3ee233e9fd977b13800987a032c) and may not be up to date.
+You still need to download the whole [casTLE repository](https://bitbucket.org/dmorgens/castle/) even if you end up switching the scripts with their python 3.6+ compatible version.
 
-# Description
 
-Will call sh_bowtie2_AlignAll.sh to convert fastq to aligned .bam and then launch sh_gatkSNPcalling.sh to call SNPs with GATK and annotate them with ANNOVAR.
+## sh_md5alldir.sh
 
+This script will process all sub-directories of the input folders and for each of them will create a <directory_name>.md5 file if it does not exist yet, or check <directory> files against the existing <directory_name>.md5 file.
 
-Utilities scripts:
-------------------
+Options:
+* -f or --force : even if there is already a <directory>.md5 file, it will be replaced by a new <directory>.md5 file.
+* --help : Display help message.
+* --version : Display version number.
 
+### Usage
 
-## sh_md5alldir.sh
+```sh
+sh_md5alldir.sh </path/to/dir/> [OPTIONS]
+```
+
+
+## sh_sha1alldir.sh
+
+This script will process all sub-directories of the input folders and for each of them will create a <directory_name>.sha1 file if it does not exist yet, or check <directory> files against the existing <directory_name>.sha1 file.
+
+Options:
+* -f or --force : even if there is already a <directory>.sha1 file, it will be replaced by a new <directory>.sha1 file.
+* --help : Display help message.
+* --version : Display version number.
+
+### Usage
+
+```sh
+sh_sha1alldir.sh </path/to/dir/> [OPTIONS]
+```
+
+
+## sh_ACMGfilter.sh
+
+This script will look for an annovar .snps.exome_summary.csv file and generate a list of all SNPs found in the ACMG guidelines in a new ACMG_genes.csv file.
+This file can be directly sent to a clinician for incidental findings reports, if required.
+
+Options:
+* --help : Display help message.
+* --version : Display version number.
+
+### Usage
+
+```sh
+sh_ACMGfilter.sh </path/to/.csv/containing/folder> [/path/to/destination/folder]
+```
+
+
+## sh_mergeFastQ.sh
+
+Simple script to consolidate fragmented .fastq files from different sequencing lanes.
+Original files will be backed up in a FastQbackup folder.
+
+Options:
+* --help : Display help message.
+* --version : Display version number.
+
+### Usage
+
+```sh
+sh_mergeFastQ.sh </path/to/fastq(.gz)/folder>
+```
+
+
+## Author(s) contributions
+
+👤 **Julien Couthouis**
+
+*Initial work and releases*
+
+* Linkedin: [@jcouthouis](https://www.linkedin.com/in/jcouthouis/)
+* Github: [@emc2cube](https://github.com/emc2cube)
+
+
+## Show your support
 
-Usage: sh_md5alldir.sh </path/to/dir/> [-options, -? or --help for help]
+Give a ![GitHub stars](https://img.shields.io/github/stars/emc2cube/Bioinformatics?style=social) if this project helped you!
 
-# Description
 
-This script will process all sub-directories of the input folders and for each of them
-will create a <directory_name>.md5 file if it does not exist yet, or check <directory> files
-against the existing <directory_name>.md5 file.
+## License
 
-# Options:
+Copyright © 2019 [Julien Couthouis](https://github.com/emc2cube).
 
--f or --force : even if a <directory>.md5 file is detected, will replace it by a fresh one
-and will not check files against it.
+This project is [EUPL-1.2](https://github.com/emc2cube/Bioinformatics/blob/master/LICENSE) licensed.
diff --git a/casTLE_Scripts_py36.tgz b/casTLE_Scripts_py36.tgz
diff --git a/config_CRISPR.ini b/config_CRISPR.ini
@@ -15,7 +15,7 @@
 #
 # Maximum number of threads (or CPUs) to request and allocate to programs.
 # In some case less than this value may automatically be allowed.
-threads=`nproc --all --ignore=1`
+threads=$(nproc --all --ignore=1)
 #
 # Maximum amount of memory (in GB) to request and allocate to programs.
 # In some case less than this value may automatically be allowed.
@@ -44,28 +44,36 @@ SLURMqos=""
 # Use it to load modules for example
 customcmd=""
 #
+# day0-label
+# Specify the label for control sample (usually day 0 or plasmid).
+# If using MAGeCK it will also turn on the negative selection QC for every other sample label.
+# The negative selection QC will compare each other sample with day0 sample, and thus estimate the degree of negative selections in essential genes.
+# If using casTLE, only the first TWO replicates will be used.
+# day0="Plasmid,Control_t0_rep1,Control_t0_rep2,Treated_t0_rep1,Treated_t0_rep2"
+day0=""
 #
-## bowtie options
-#
-# path to bowtie, bowtie-build needs to be in the same folder (probably is the case)
-bowtie="/usr/local/bin/bowtie"
+# Tests groups
+# Enter your sample names (not including ".fastq" or ".fastq.gz") for comparisons.
+# Separate replicate wih a comma and groups by a space.
+# If using casTLE, only the first TWO replicates will be used.
+# testgroups="Control_rep1,Control_rep2 Treated_condA_rep1,Treated_condA_rep Treated_condB_rep1,Treated_condB_rep2"
+testgroups=""
 #
-# Type of screen. Will be used to create Indices for the guides.
-screentype="Cas9-10"
 #
-# Name of the guides index file. Will be saved in the Indices folder.
-# It will overwrite any files with this name prefix.
-outputbowtieindex=""
-#
-# Oligo file location.
-# Leave empty if it was previously used and the corresponding Index are already generated for this type of screen.
-oligofile=""
+## casTLE options
+# download the last version from https://bitbucket.org/dmorgens/castle/
 #
+# Use casTLE?
+# 0 = No ; 1 = Yes
+usecastle="0"
 #
-## casTLE options
+# Python version?
+# Are you using python2.7 (original) or python3 (included in this repo) casTLE scripts?
+# If you also use MAGeCK python3 is REQUIRED
+# use "python" or "python3"
+python="python3"
 #
 # casTLE folder location
-# download the last version from https://bitbucket.org/dmorgens/castle/
 castlepath="/home/user/scripts/dmorgens-castle/"
 #
 # Number of permutations to generate p-values.
@@ -87,10 +95,45 @@ graphformat="pdf"
 # 0 = No ; 1 = Yes
 mouse="0"
 #
-# Enter your sample names (not including .fastq or .fastq.gz) for comparisons.
-# You should only have 4 samples, organized in 2 pairs
-# analyzecounts="Untreated1,Treated1 Untreated2,Treated2"
-analyzecounts=""
+## bowtie options
+# path to bowtie, bowtie-build needs to be in the same folder (probably is the case)
+bowtie="/usr/local/bin/bowtie"
+#
+# Type of screen. Will be used to create Indices for the guides.
+screentype="Cas9-10"
+#
+# Name of the guides index file. Will be saved in the Indices folder.
+# It will overwrite any files with this name prefix.
+outputbowtieindex=""
+#
+# Oligo file location.
+# Leave empty if it was previously used and the corresponding Index are already generated for this type of screen.
+oligofile=""
+#
+#
+## MAGeCK options
+#
+# Use MAGeCK?
+# 0 = No ; 1 = Yes
+usemageck="0"
+#
+# MAGeCK list of sgRNA names (see https://sourceforge.net/p/mageck/wiki/input/#sgrna-library-file ) location.
+magecksgRNAlibrary=""
+#
+# Use the reverse complement of the MAGeCK list of sgRNA names
+# 0 = No ; 1 = Yes
+mageckrevcomplib="0"
+#
+# MAGeCK list of control sgRNA names (see https://sourceforge.net/p/mageck/wiki/input/#negative-control-sgrna-list ) location.
+mageckcontrolsgrna=""
+#
+# GMT file for MAGeCK pathway analysis (see https://sourceforge.net/p/mageck/wiki/input/#pathway-file-gmt ) location
+gmtfile=""
+#
+# Matrix file for mle analysis ( see https://sourceforge.net/p/mageck/wiki/input/#design-matrix-file ) location.
+# While this is optional it is highly recommended as else the mle tool tend to be very prone to crashing (still crash with a matrix, but less).
+# By default will look for a "matrix.txt" file stored with the FastQ files.
+matrixfile=$([ -f ${dir}/matrix.txt ] && echo "${dir}/matrix.txt")
 #
 #
 ## IFTTT options

diff --git a/config_RNAseq.ini b/config_RNAseq.ini
@@ -15,7 +15,7 @@
 #
 # Maximum number of threads (or CPUs) to request and allocate to programs.
 # In some case less than this value may automatically be allowed.
-threads=`nproc --all --ignore=1`
+threads=$(nproc --all --ignore=1)
 #
 # Maximum amount of memory (in GB) to request and allocate to programs.
 # In some case less than this value may automatically be allowed.

diff --git a/config_WES.ini b/config_WES.ini
@@ -15,7 +15,7 @@
 #
 # Maximum number of threads (or CPUs) to request and allocate to programs.
 # In some case less than this value may automatically be allowed.
-threads=`nproc --all --ignore=1`
+threads=$(nproc --all --ignore=1)
 #
 # Maximum amount of memory (in GB) to request and allocate to programs.
 # In some case less than this value may automatically be allowed.

diff --git a/package.json b/package.json
@@ -1,18 +1,23 @@
 {
   "name": "emc2cube-bioinformatics",
-  "version": "1.0.0",
-  "description": "High throughput sequencing scripts: bowtie2, GATK, etc...",
+  "version": "2.0.0",
+  "description": "Set of high throughput sequencing analysis scripts to quickly generate and queue jobs on SLURM-based HPC clusters",
   "repository": {
     "type": "git",
     "url": "git+https://github.com/emc2cube/Bioinformatics.git"
   },
   "keywords": [
     "NGS",
+    "Next Generation Sequencing",
     "SLURM",
+    "HPC",
     "pipeline",
     "WES",
+    "Whole Exome Sequencing",
     "RNAseq",
-    "CRISPR"
+    "RNA Sequencing",
+    "CRISPR",
+    "CRISPR screens"
   ],
   "author": "Julien Couthouis",
   "license": "EUPL-1.2+",
-Original file line number
+Diff line change
@@ Expand Up / @@ -7,6 +7,7 @@ configs @@
     *.back
     *.bak
     *.old
+    *2.sh
     Backup
     # Compiled source #
@@ Expand Down @@