Skip to content

Commit

Permalink
Version 2.0.0 release
Browse files Browse the repository at this point in the history
Release v2.0.0 - Bring the MAGeCK

* CRISPR pipeline now with [MAGeCK](https://sourceforge.net/projects/mageck/) support.
* CRISPR pipeline casTLE now support more than one single comparison. Still limited to a maximum of 2 replicates.
* A few fixes and cleaning done on other pipelines
* Now with a better documentation!
  • Loading branch information
emc2cube committed Nov 1, 2019
1 parent aaad41b commit 07aa495
Show file tree
Hide file tree
Showing 17 changed files with 1,889 additions and 2,044 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ configs
*.back
*.bak
*.old
*2.sh
Backup

# Compiled source #
Expand Down
195 changes: 130 additions & 65 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,103 +1,168 @@
Bioinformatics
==============
# Bioinformatics pipelines, [SLURM](https://slurm.schedmd.com/overview.html) friendly

High throughput sequencing scripts: bowtie2, GATK, etc...
![GitHub package.json version](https://img.shields.io/github/package-json/v/emc2cube/Bioinformatics)
![GitHub top language](https://img.shields.io/github/languages/top/emc2cube/Bioinformatics?color=green)
![GitHub](https://img.shields.io/github/license/emc2cube/Bioinformatics?color=yellow)
[![Runs on Sherlock](https://img.shields.io/badge/Runs_on-Sherlock-red)](https://www.sherlock.stanford.edu)

> Set of high throughput sequencing analysis scripts to quickly generate and queue jobs on [SLURM](https://slurm.schedmd.com/overview.html)-based HPC clusters, such as [Stanford's Sherlock](https://www.sherlock.stanford.edu)🕵🏻‍♂️️
>
> Most scripts include some sort of failsafe: if a job fails it will be requeued once. This is useful in case of unexpected node failure.
>
> Currently available pipelines:
> * Whole Exome Sequencing
> * RNA Sequencing
> * CRISPR screens
Workflow scripts:
-----------------

## sh_WES.sh

## sh_WES.sh (SLURM compatible)
This script will process fastq(.gz) files and align them to a reference genome using bowtie2.
It will then use Picard and GATK following the June 2016 best practices workflow.
SNPs will then be annotated using ANNOVAR.

Usage: sh_WES.sh </path/to/fastq(.gz)/folder> </path/to/destination/folder> [/path/to/config/file.ini]
See the [WES.ini](https://github.com/emc2cube/Bioinformatics/blob/master/config_WES.ini) configuration file for all available options and settings.

# Description
Options:
* --help : Display help message.
* --version : Display version number.

This script will process fastq(.gz) files and align them to a reference genome using bowtie2.
It will then use Picard and GATK following GATK according to June 2016 best practices workflow.
SNPs will then be annotated with ANNOVAR.
Include a failsafe, if a job fails, it will be requeued once in case of a hardware failure.
### Usage

# Options
```sh
sh_WES.sh </path/to/fastq(.gz)/folder> </path/to/destination/folder> [/path/to/config/file.ini]
```

Can call trimmomatic, FastQC and compute coverage.
Settings can be modified by using a customized config_WES.ini file.

## sh_RNAseq.sh (SLURM compatible)
## sh_RNAseq.sh

Usage: sh_RNAseq.sh </path/to/fastq(.gz)/folder> </path/to/destination/folder> [/path/to/config/file.ini]
This script will process fastq(.gz) files and align them to a reference genome using either STAR (recommended), hishat2 or tophat2.
If STAR is used then RSEM will also be used and differential expression will be analyzed using DESeq2.
Differential expression can also be computed using cufflinks (cufflinks is pretty much deprecated, should be avoided unless trying to reproduce old results).

# Description
See the [RNAseq.ini](https://github.com/emc2cube/Bioinformatics/blob/master/config_RNAseq.ini) configuration file for all available options and settings.

This script will process fastq(.gz) files and align them to a reference genome using either STAR (recommended), hishat2 or tophat2.
Differential expression will then be computed using cufflinks.
If STAR is used then RSEM will also be used to generate gene read counts, pairwise comparison matrices will be created and DESeq2 analysis will be performed.
Include a failsafe, if a job fails, it will be requeued once in case of a hardware failure.
Options:
* --help : Display help message.
* --version : Display version number.

# Options
### Usage

Can call trimmomatic and FastQC.
Settings can be modified by using a customized config_RNAseq.ini file.
```sh
sh_RNAseq.sh </path/to/fastq(.gz)/folder> </path/to/destination/folder> [/path/to/config/file.ini]
```

## sh_bowtie2_AlignAll.sh (deprecated)

Usage: sh_bowtie2_AlignAll.sh </path/to/fastq(.gz)/folder> </path/to/Aligned(.bam)/destination/folder> [/path/to/config/file.ini]
## sh_CRISPR.sh

# Description
This script will process the fastq(.gz) files generated in a typical CRISPR screen using either [casTLE](https://bitbucket.org/dmorgens/castle/) or [MAGeCK](https://sourceforge.net/projects/mageck/).
* If using casTLE, a reference file of all the indices will be automatically created using bowtie (NOT bowtie2). It will then analyze the screen and generate basic graphs.
* If using MAGeCK counts, tests, mle and pathway analysis will be performed. It will also run the [R](https://www.r-project.org) package "[MAGeCKFlute](https://bioconductor.org/packages/release/bioc/html/MAGeCKFlute.html)" and in all cases generate basic graphs.

This script will convert fastq files to bowtie2 aligned .bam files.
Optional: Can call a trimming program (trimmomatic, Trim Galore or your own script).
This script will, for all samples in input folder:
- convert .fastq or .fastq.gz files to .sam.
- align .sam to reference genome.
- convert .sam to .bam.
- Sort and index .bam file.
See the [CRISPR.ini](https://github.com/emc2cube/Bioinformatics/blob/master/config_CRISPR.ini) configuration file for all available options and settings.

## sh_gatkSNPcalling.sh (deprecated)
Options:
* --help : Display help message.
* --version : Display version number.

Usage: sh_gatkSNPcalling.sh </path/to/Aligned(.bam)/destination/folder> </path/to/SNPsCalled/folder> [/path/to/config/file.ini]
Dependancies:
[csvkit](https://csvkit.readthedocs.io/en/latest/) should be installed on your system in a location included in your $PATH.

# Description

This script will process aligned .bam files.
Optional: Will first remove duplicate reads.
This script will, for all samples:
- perform a local realignment around known indels.
- perform a quality score recalibration.
- generate .g.vcf file using HaplotypeCaller.
- Optional: stop here
- perform joint genotyping
- Filter variants using VQSR.
- annotate using annovar.
- do some cleaning on .csv for an easy downloadable file.
- Optional: Can trigger an IFTTT event using the maker channel.
### Usage

```sh
sh_CRISPR.sh </path/to/fastq(.gz)/folder> </path/to/destination/folder> [/path/to/config/file.ini]
```

## sh_FastQToSNPsCall.sh (deprecated)
### Python 3.6 compatibility

Usage: sh_FastQToSNPsCall.sh </path/to/fastq(.gz)/folder> </path/to/Aligned(.bam)/destination/folder> </path/to/SNPsCalled/folder> [/path/to/config/file.ini]
For easy integration along MAGeCK, or any other modern tools, a python 3.6+ compatible version of casTLE is included.
This is based on [casTLE commit 981d6d8](https://bitbucket.org/dmorgens/castle/commits/981d6d877c0fe3ee233e9fd977b13800987a032c) and may not be up to date.
You still need to download the whole [casTLE repository](https://bitbucket.org/dmorgens/castle/) even if you end up switching the scripts with their python 3.6+ compatible version.

# Description

Will call sh_bowtie2_AlignAll.sh to convert fastq to aligned .bam and then launch sh_gatkSNPcalling.sh to call SNPs with GATK and annotate them with ANNOVAR.
## sh_md5alldir.sh

This script will process all sub-directories of the input folders and for each of them will create a <directory_name>.md5 file if it does not exist yet, or check <directory> files against the existing <directory_name>.md5 file.

Utilities scripts:
------------------
Options:
* -f or --force : even if there is already a <directory>.md5 file, it will be replaced by a new <directory>.md5 file.
* --help : Display help message.
* --version : Display version number.

### Usage

## sh_md5alldir.sh
```sh
sh_md5alldir.sh </path/to/dir/> [OPTIONS]
```


## sh_sha1alldir.sh

This script will process all sub-directories of the input folders and for each of them will create a <directory_name>.sha1 file if it does not exist yet, or check <directory> files against the existing <directory_name>.sha1 file.

Options:
* -f or --force : even if there is already a <directory>.sha1 file, it will be replaced by a new <directory>.sha1 file.
* --help : Display help message.
* --version : Display version number.

### Usage

```sh
sh_sha1alldir.sh </path/to/dir/> [OPTIONS]
```


## sh_ACMGfilter.sh

This script will look for an annovar .snps.exome_summary.csv file and generate a list of all SNPs found in the ACMG guidelines in a new ACMG_genes.csv file.
This file can be directly sent to a clinician for incidental findings reports, if required.

Options:
* --help : Display help message.
* --version : Display version number.

### Usage

```sh
sh_ACMGfilter.sh </path/to/.csv/containing/folder> [/path/to/destination/folder]
```


## sh_mergeFastQ.sh

Simple script to consolidate fragmented .fastq files from different sequencing lanes.
Original files will be backed up in a FastQbackup folder.

Options:
* --help : Display help message.
* --version : Display version number.

### Usage

```sh
sh_mergeFastQ.sh </path/to/fastq(.gz)/folder>
```


## Author(s) contributions

👤 **Julien Couthouis**

*Initial work and releases*

* Linkedin: [@jcouthouis](https://www.linkedin.com/in/jcouthouis/)
* Github: [@emc2cube](https://github.com/emc2cube)


## Show your support

Usage: sh_md5alldir.sh </path/to/dir/> [-options, -? or --help for help]
Give a ![GitHub stars](https://img.shields.io/github/stars/emc2cube/Bioinformatics?style=social) if this project helped you!

# Description

This script will process all sub-directories of the input folders and for each of them
will create a <directory_name>.md5 file if it does not exist yet, or check <directory> files
against the existing <directory_name>.md5 file.
## License

# Options:
Copyright © 2019 [Julien Couthouis](https://github.com/emc2cube).

-f or --force : even if a <directory>.md5 file is detected, will replace it by a fresh one
and will not check files against it.
This project is [EUPL-1.2](https://github.com/emc2cube/Bioinformatics/blob/master/LICENSE) licensed.
Binary file added casTLE_Scripts_py36.tgz
Binary file not shown.
83 changes: 63 additions & 20 deletions config_CRISPR.ini
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
#
# Maximum number of threads (or CPUs) to request and allocate to programs.
# In some case less than this value may automatically be allowed.
threads=`nproc --all --ignore=1`
threads=$(nproc --all --ignore=1)
#
# Maximum amount of memory (in GB) to request and allocate to programs.
# In some case less than this value may automatically be allowed.
Expand Down Expand Up @@ -44,28 +44,36 @@ SLURMqos=""
# Use it to load modules for example
customcmd=""
#
# day0-label
# Specify the label for control sample (usually day 0 or plasmid).
# If using MAGeCK it will also turn on the negative selection QC for every other sample label.
# The negative selection QC will compare each other sample with day0 sample, and thus estimate the degree of negative selections in essential genes.
# If using casTLE, only the first TWO replicates will be used.
# day0="Plasmid,Control_t0_rep1,Control_t0_rep2,Treated_t0_rep1,Treated_t0_rep2"
day0=""
#
## bowtie options
#
# path to bowtie, bowtie-build needs to be in the same folder (probably is the case)
bowtie="/usr/local/bin/bowtie"
# Tests groups
# Enter your sample names (not including ".fastq" or ".fastq.gz") for comparisons.
# Separate replicate wih a comma and groups by a space.
# If using casTLE, only the first TWO replicates will be used.
# testgroups="Control_rep1,Control_rep2 Treated_condA_rep1,Treated_condA_rep Treated_condB_rep1,Treated_condB_rep2"
testgroups=""
#
# Type of screen. Will be used to create Indices for the guides.
screentype="Cas9-10"
#
# Name of the guides index file. Will be saved in the Indices folder.
# It will overwrite any files with this name prefix.
outputbowtieindex=""
#
# Oligo file location.
# Leave empty if it was previously used and the corresponding Index are already generated for this type of screen.
oligofile=""
## casTLE options
# download the last version from https://bitbucket.org/dmorgens/castle/
#
# Use casTLE?
# 0 = No ; 1 = Yes
usecastle="0"
#
## casTLE options
# Python version?
# Are you using python2.7 (original) or python3 (included in this repo) casTLE scripts?
# If you also use MAGeCK python3 is REQUIRED
# use "python" or "python3"
python="python3"
#
# casTLE folder location
# download the last version from https://bitbucket.org/dmorgens/castle/
castlepath="/home/user/scripts/dmorgens-castle/"
#
# Number of permutations to generate p-values.
Expand All @@ -87,10 +95,45 @@ graphformat="pdf"
# 0 = No ; 1 = Yes
mouse="0"
#
# Enter your sample names (not including .fastq or .fastq.gz) for comparisons.
# You should only have 4 samples, organized in 2 pairs
# analyzecounts="Untreated1,Treated1 Untreated2,Treated2"
analyzecounts=""
## bowtie options
# path to bowtie, bowtie-build needs to be in the same folder (probably is the case)
bowtie="/usr/local/bin/bowtie"
#
# Type of screen. Will be used to create Indices for the guides.
screentype="Cas9-10"
#
# Name of the guides index file. Will be saved in the Indices folder.
# It will overwrite any files with this name prefix.
outputbowtieindex=""
#
# Oligo file location.
# Leave empty if it was previously used and the corresponding Index are already generated for this type of screen.
oligofile=""
#
#
## MAGeCK options
#
# Use MAGeCK?
# 0 = No ; 1 = Yes
usemageck="0"
#
# MAGeCK list of sgRNA names (see https://sourceforge.net/p/mageck/wiki/input/#sgrna-library-file ) location.
magecksgRNAlibrary=""
#
# Use the reverse complement of the MAGeCK list of sgRNA names
# 0 = No ; 1 = Yes
mageckrevcomplib="0"
#
# MAGeCK list of control sgRNA names (see https://sourceforge.net/p/mageck/wiki/input/#negative-control-sgrna-list ) location.
mageckcontrolsgrna=""
#
# GMT file for MAGeCK pathway analysis (see https://sourceforge.net/p/mageck/wiki/input/#pathway-file-gmt ) location
gmtfile=""
#
# Matrix file for mle analysis ( see https://sourceforge.net/p/mageck/wiki/input/#design-matrix-file ) location.
# While this is optional it is highly recommended as else the mle tool tend to be very prone to crashing (still crash with a matrix, but less).
# By default will look for a "matrix.txt" file stored with the FastQ files.
matrixfile=$([ -f ${dir}/matrix.txt ] && echo "${dir}/matrix.txt")
#
#
## IFTTT options
Expand Down
2 changes: 1 addition & 1 deletion config_RNAseq.ini
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
#
# Maximum number of threads (or CPUs) to request and allocate to programs.
# In some case less than this value may automatically be allowed.
threads=`nproc --all --ignore=1`
threads=$(nproc --all --ignore=1)
#
# Maximum amount of memory (in GB) to request and allocate to programs.
# In some case less than this value may automatically be allowed.
Expand Down
2 changes: 1 addition & 1 deletion config_WES.ini
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
#
# Maximum number of threads (or CPUs) to request and allocate to programs.
# In some case less than this value may automatically be allowed.
threads=`nproc --all --ignore=1`
threads=$(nproc --all --ignore=1)
#
# Maximum amount of memory (in GB) to request and allocate to programs.
# In some case less than this value may automatically be allowed.
Expand Down
11 changes: 8 additions & 3 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,18 +1,23 @@
{
"name": "emc2cube-bioinformatics",
"version": "1.0.0",
"description": "High throughput sequencing scripts: bowtie2, GATK, etc...",
"version": "2.0.0",
"description": "Set of high throughput sequencing analysis scripts to quickly generate and queue jobs on SLURM-based HPC clusters",
"repository": {
"type": "git",
"url": "git+https://github.com/emc2cube/Bioinformatics.git"
},
"keywords": [
"NGS",
"Next Generation Sequencing",
"SLURM",
"HPC",
"pipeline",
"WES",
"Whole Exome Sequencing",
"RNAseq",
"CRISPR"
"RNA Sequencing",
"CRISPR",
"CRISPR screens"
],
"author": "Julien Couthouis",
"license": "EUPL-1.2+",
Expand Down
Loading

0 comments on commit 07aa495

Please sign in to comment.