STARRPeaker

Uniform processing pipeline and peak caller for STARR-seq data

Changes from upstream

Note: This repository is based on gersteinlab/starrpeaker version 1.0

Make compatible with Python 3
Make number of CPUs used configurable
Allow running linearfold_v executable directly, bypassing its linearfold Python 2 front end
Remove dependency on scikit-learn, by using z-score calculation from scipy instead

Dependencies (version tested)

Python 3 (v3.8.5)
pysam (v0.16.0.1)
pybedtools (v0.8.1)
pyBigWig (v0.3.18)
numpy (v1.20.1)
scipy (v1.6.1)
pandas (v1.2.3)
statsmodels (v0.12.2)

Note that these dependencies are updated relative to the Python 2 associated ones from upstream.

Installation (updated)

Create a conda environment with Python 3

conda create -n starrpeaker python=3 scipy statsmodels pybedtools pysam pyBigWig
conda activate starrpeaker
pip install git+https://github.com/imbforge/starrpeaker
starrpeaker -h

The sections below were not changed from upstream, except for the 'Usage' options.

Preprocessing

Few notes on how alignment (BAM) files were prepared

For each biological replicates in FASTQ format

Aligned paired-end reads using BWA mem (v0.7.17)
Removed duplicates using picard (v2.9.0)
Filtered alignments using SAMtools (v1.9) with the following arguments

samtools view -F 3852 -f 2 -q 40

# -F: exclude FLAG 3852
#    4 read unmapped (0x4)
#    8 mate unmapped (0x8)
#  256 not primary alignment (0x100)
#  512 read fails platform/vendor quality checks (0x200)
# 1024 read is PCR or optical duplicate (0x400)
# 2048 supplementary alignment (0x800)

# -f: require FLAG 2
#    2 properly aligned

# -q: exclude MAPQ less than 40

Merged biological replicates using SAMtools

Inputs

Input alignment (BAM) file (STARR-seq input)
Output alignment (BAM) file (STARR-seq output)
Covariates (BigWig) file(s)
Chrom Size file (i.e., https://hgdownload-test.gi.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes)
Blacklist (BED) file (i.e., https://www.encodeproject.org/files/ENCFF419RSJ/)

Covariates

The peak calling algorithm models STARR-seq fragment coverage across the genome using multiple covariates to correct for potential sequencing bias. It is recommended to include potential confounding variables into the model. These include but not limited to GC-content, mappability tracks, and so on.

The following covariates have been precomputed for use with STARRPeaker:

Download link to covariates

Sources:

GRCh38 genome: https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/
GRCh38 GC content: https://hgdownload.soe.ucsc.edu/gbdb/hg38/bbi/gc5BaseBw/gc5Base.bw
GRCh38 mappability: (computed using gem-library)
GRCh38 RNA folding energy: (computed using linearfold; see MS for details)
hg19 genome: https://www.encodeproject.org/files/male.hg19/
hg19 GC content: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/gc5Base/hg19.gc5Base.txt.gz
hg19 mappability: (computed using gem-library)
hg19 RNA folding energy: (computed using linearfold; see MS for details)

Usage

usage: starrpeaker.py [-h] --prefix PREFIX --chromsize CHROMSIZE --blacklist
                      BLACKLIST -i INPUT -o OUTPUT [--length LENGTH]
                      [--step STEP] [--cov COV [COV ...]] [--min MIN]
                      [--max MAX] [--readstart] [--strand STRAND]
                      [--threshold THRESHOLD] [--mode MODE] [--mincov MINCOV]
                      [--eq EQ] [--se] [--cpus CPUS]

STARRPeaker

required arguments:
  --prefix PREFIX             Output File Prefix
  --chromsize CHROMSIZE       Chrom Sizes
  --blacklist BLACKLIST       Blacklist Region in BED format
  -i INPUT, --input INPUT     Input BAM File
  -o OUTPUT, --output OUTPUT  STARR-seq BAM File

optional arguments:
  -h, --help                  show this help message and exit
  --length LENGTH             Bin Length (default: 500)
  --step STEP                 Step Size (default: 100)
  --cov COV [COV ...]         Covariate BigWig Files
  --min MIN                   Minimum Template Size (default: 200)
  --max MAX                   Maximum Template Size (default: 1000)
  --readstart                 Use Read Start Position instead of Fragment Center
  --strand STRAND             Use all/fwd/rev Stranded Fragments (default: all)
  --threshold THRESHOLD       Adjusted P-value Threshold (default: 0.05)
  --mode MODE                 Mode [1 - using input as covariate (default), 2 - using input as offset]
  --mincov MINCOV             Minimum Coverage (default: 10)
  --eq EQ                     Extreme Quantile to Remove (default: 1e-5)
  --se                        Use Single-End instead of Paired-end Sequencing
  --cpus CPUS                 Number of CPUs for parallel processing (default=1)

Example

starrpeaker --prefix <prefix for output files> --chromsize <hg38.chrom.sizes> --blacklist <blacklist_GRCh38.bed> --cov <covariate 1: gc content> <covariate 2: mappability> <covariate 3: conservation> --input <input.bam> --output <output.bam> --threshold 0.05

Outputs Files

prefix.bin.bed: Genomic bin BED file
prefix.bam.bct: Alignment counts in BST format (1st col: input, 2nd col: output, 3rd col: normalized input)
prefix.cov.tsv: Covariate matrix in TSV format
prefix.input.bw: Input fragment coverage in bigWig format
prefix.output.bw: Output fragment coverage in bigWig format
prefix.peak.bed: Initial peak calls (before centering and merging)
prefix.peak.final.bed: Final peak calls
prefix.peak.pval.bw: P-value track in bigWig format (-log10)
prefix.peak.qval.bw: Q-value track in bigWig format (-log10)

Final Peak Call Format (BED6+4)

Column 1: Chromosome
Column 2: Start position
Column 3: End position
Column 4: Name (peak rank based on score, 1 being the highest rank)
Column 5: Score (integer value of "100 * fold change", maxed at 1000 per BED format specification)
Column 6: Strand
Column 7: Fold change (output/normalized-input)
Column 8: Output fragment coverage
Column 9: -log10 of P-value
Column 10: -log10 of Q-value (Benjamini-Hochberg False Discovery Rate, FDR)

BED format specification: https://genome.ucsc.edu/FAQ/FAQformat.html#format1

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
data		data
starrpeaker		starrpeaker
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STARRPeaker

Changes from upstream

Dependencies (version tested)

Installation (updated)

Preprocessing

Inputs

Covariates

Usage

Example

Outputs Files

Final Peak Call Format (BED6+4)

About

Releases

Packages

Languages

License

imbforge/starrpeaker

Folders and files

Latest commit

History

Repository files navigation

STARRPeaker

Changes from upstream

Dependencies (version tested)

Installation (updated)

Preprocessing

Inputs

Covariates

Usage

Example

Outputs Files

Final Peak Call Format (BED6+4)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages