Skip to content

jflucier/Gevry_pipelines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gevry pipeline User Manual


Contents


Requirements

All pipelines are self contained. The only requirements needed is Apptainer. The apptainer executable "singularity" should be available in your path.

Note: On interactive node run module load StdEnv/2020 apptainer/1.1.5 . You can include this command in your ~/.bashrc file


Installation

If you are running on ip34, the installation is already available. Skip to next section in documentation.

To install Gevry_pipelines you need to:

  • Create a clone of the repository:

    git clone https://github.com/jflucier/Gevry_pipelines.git

    Note: Creating a clone of the repository requires Github to be installed.

  • For convenience, set environment variable G_PIPELINES in your ~/.bashrc:

    export G_PIPELINES=/path/to/Gevry_pipelines

    Note: On ip34, Gevry pipelines path is /home/def-gevrynic/programs/Gevry_pipelines

  • Go to $G_PIPELINES/containers and run these commands:

module load StdEnv/2020 apptainer/1.1.5
cd $G_PIPELINES/containers
singularity build --force --fakeroot scenicplus.sif scenicplus.def

Run scenicplus pipelines

Scenicplus was developped based on the following tutorial available here. It is composed of 5 python scripts. For each of these scripts, you can access script help by providing -h option like example below:

$ singularity exec \
-B /fast/:/fast/ \
-B /home:/home \
-e $G_PIPELINES/containers/scenicplus.sif \
/miniconda3/envs/scenicplus/bin/python3 -W ignore scrna-seq_preprocess_scanpy.py -h
INFO:    underlay of /usr/share/zoneinfo/America/New_York required more than 50 (51) bind mounts
usage: scrna-seq_preprocess_scanpy.py [-h] -w WORKDIR -i INPUT [--cpu [CPU]] [--mem [MEM]] [--qc_min_genes [QC_MIN_GENES]] [--qc_min_cells [QC_MIN_CELLS]] [--filter_mito [FILTER_MITO]]
                                      [--filter_n_counts [FILTER_N_COUNTS]] [--norm_min_mean [NORM_MIN_MEAN]] [--norm_max_mean [NORM_MAX_MEAN]] [--norm_min_disp [NORM_MIN_DISP]]
                                      [--norm_max_value [NORM_MAX_VALUE]] [--ct_annot_n_neighbors [CT_ANNOT_N_NEIGHBORS]] [--ct_annot_leiden_res [CT_ANNOT_LEIDEN_RES]]

optional arguments:
  -h, --help            show this help message and exit
  -w WORKDIR, --workdir WORKDIR
                        your working directory
  -i INPUT, --input INPUT
                        your h5 input file
  --cpu [CPU]           Number of cpu to use
  --mem [MEM]           Max memery usage in Gigabyte
  --qc_min_genes [QC_MIN_GENES]
                        Only keep cells with at least <<min_genes>> genes expressed
  --qc_min_cells [QC_MIN_CELLS]
                        Only keep genes which are expressed in at least <<min_cells>> cells
  --filter_mito [FILTER_MITO]
                        Filter based on mitochondrial counts
  --filter_n_counts [FILTER_N_COUNTS]
                        Filter based on total counts
  --norm_min_mean [NORM_MIN_MEAN]
                        Normalisation min mean value
  --norm_max_mean [NORM_MAX_MEAN]
                        Normalisation max mean value
  --norm_min_disp [NORM_MIN_DISP]
                        Normalisation min dispersion value
  --norm_max_value [NORM_MAX_VALUE]
                        Clip (truncate) to this value after scaling
  --ct_annot_n_neighbors [CT_ANNOT_N_NEIGHBORS]
                        The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller
                        values result in more local data being preserved.
  --ct_annot_leiden_res [CT_ANNOT_LEIDEN_RES]
                        parameter value controlling the coarseness of the clustering. Higher values lead to more clusters.

Most default values are the same as used in the tutorial. Lets now look at each step in more details:

Step1 scRNA-seq preprocessing using Scanpy

This step processes single cell rna-seq. An example call would be:


> singularity exec \
-B /fast/:/fast/ \
-B /home:/home \
-e $G_PIPELINES/containers/scenicplus.sif \
/miniconda3/envs/scenicplus/bin/python3 -W ignore scrna-seq_preprocess_scanpy.py \
-w $G_PIPELINES/scenicplus/test \
-i $G_PIPELINES/scenicplus/test/data/pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix.h5

The following parameters are mandatory:

  • workdir: your working directory
  • input: your h5 input file

These options are optional. Please make sure default values are ok prior to running

  • cpu: Number of cpu to use (default 24)
  • mem: Max memory usage in Gigabyte (default 30)
  • qc_min_genes: Only keep cells with at least <<min_genes>> genes expressed (default 200)
  • qc_min_cells: Only keep genes which are expressed in at least <<min_cells>> cells (default 3)
  • filter_mito: Filter based on mitochondrial counts (default 25)
  • filter_n_counts: Filter based on total counts (default 4300)
  • norm_min_mean: Normalisation min mean value (default 0.0125)
  • norm_max_mean: Normalisation max mean value (default 3)
  • norm_min_disp: Normalisation min dispersion value (default 0.5)
  • norm_max_value: Clip (truncate) to this value after scaling (default 10)
  • ct_annot_n_neighbors: The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved (default 10)
  • ct_annot_leiden_res: Parameter value controlling the coarseness of the clustering. Higher values lead to more clusters (default 0.8)

Notice In its current form cell type annotation uses the preprocessed data from the Scanpy tutorial as reference (see here for more information. Please contact to change this.

Step2 scATAC-seq preprocessing using pycisTopic

This step processes single cell atac-seq. An example call would be:

singularity exec \
-B /fast/:/fast/ \
-B /home:/home \
-e $G_PIPELINES/containers/scenicplus.sif \
/miniconda3/envs/scenicplus/bin/python3 -W ignore scatac-seq_preprocess_pycistopic.py \
-w $G_PIPELINES/scenicplus/test \
--frag_file $G_PIPELINES/scenicplus/test/data/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz \
--tmp /fast_tmp --sample '10x_pbmc'

The following parameters are mandatory:

  • workdir: your working directory
  • frag_file: Path to ATAC fragments file
  • sample: The sample id

These options are optional. Please make sure default values are ok prior to running

  • tmp: Temp directory (default /tmp). If running on ip34, make sur you set --tmp to /fast_tmp
  • scrna: Scanpy scRNA data (default <>/scRNA/adata.h5ad)
  • cpu: Number of cpu to use (default 24)
  • shift; To set an arbitrary shift in bp. For finding enriched cutting sites(such as in ATAC - seq) a shift of 73bp is recommended (default 73)
  • ext_size: To extend reads in 5’->3’ direction to fix-sized fragment. For ATAC-seq data, a extension of 146 bp is recommended (default 146)
  • q_value: The q-value (minimum FDR) cutoff to call significant regions (default 0.05)
  • peak_half_width: Number of base pairs that each summit will be extended in each direction (default 250)
  • blacklist_regions: Path to bed file containing blacklist regions (Amemiya et al., 2019) (default <>/data/hg38-blacklist.v2.bed)
  • overwrite: Recalculate all steps even if they completed sucessfully.
  • specie: Species from from which genome size will be inputted to MACS2, options are: homo_sapiens, mus_musculus, drosophila_melanogaster (default homo_sapiens)

Step3 Motif enrichment analysis using pycistarget

This step identifies motif enrichment. An example call would be:

singularity exec \
-B /fast/:/fast/ \
-B /home:/home \
-e $G_PIPELINES/containers/scenicplus.sif \
/miniconda3/envs/scenicplus/bin/python3 -W ignore motif_enrichment_pycistarget.py \
-w $G_PIPELINES/scenicplus/test \
--tmp /fast_tmp

The following parameters are mandatory:

  • workdir: your working directory

These options are optional. Please make sure default values are ok prior to running

  • tmp: Temp directory (default /tmp). If running on ip34, make sur you set --tmp to /fast_tmp
  • specie: Species from which genomic coordinates come from, options are: homo_sapiens, mus_musculus, drosophila_melanogaster (default homo_sapiens)
  • otsu: Path to region bin topic otsu pickle (default <>/scATAC/candidate_enhancers/region_bin_topics_otsu.pkl)<>/scATAC/candidate_enhancers/region_bin_topics_otsu.pkl
  • top3k: Path to region bin topic top3k pickle (default <>/scATAC/candidate_enhancers/region_bin_topics_top3k.pkl)<>/scATAC/candidate_enhancers/region_bin_topics_top3k.pkl
  • markers: Path to marker dictionary pickle (default <>/scATAC/candidate_enhancers/markers_dict.pkl)<>/scATAC/candidate_enhancers/markers_dict.pkl
  • scores_db: Path to score feather file (default <>/data/hg38_screen_v10_clust.regions_vs_motifs.scores.feather)
  • rank_db: Path to ranking feather file (default <>/data/hg38_screen_v10_clust.regions_vs_motifs.rankings.feather)
  • motifs: Path to motif annotation table (default <>/data/motifs-v10-nr.hgnc-m0.00001-o0.0.tbl)
  • motifs_version: Motif annotation version (default v10nr_clust)
  • cpu: Number of cpu to use (default 24)
  • overwrite: Recalculate all steps even if they completed sucessfully.

Step4 Inferring enhancer-driven Gene Regulatory Networks using SCENICplus

This step identifies eGRN. An example call would be:

singularity exec \
-B /fast/:/fast/ \
-B /home:/home \
-e $G_PIPELINES/containers/scenicplus.sif \
/miniconda3/envs/scenicplus/bin/python3 -W ignore infer_enhancer-driven_gene.py \
-w $G_PIPELINES/scenicplus/test \
-tf $G_PIPELINES/scenicplus/test/data/utoronto_human_tfs_v_1.01.txt \
--tmp /fast_tmp \
--cpu 20 --sample '10x_pbmc'

The following parameters are mandatory:

  • workdir: your working directory
  • tf_file: Path to file containing genes symbols that are TFs
  • sample: The sample id

These options are optional. Please make sure default values are ok prior to running

  • scrna: Scanpy scRNA data (default <>/scRNA/adata.h5ad)
  • cistopic: cistopic object data pickle data (default <>/scATAC/cistopic_obj2.pkl)
  • menr: menr object data pickle data (default <>/motifs/menr.pkl)
  • tmp: Temp directory (default /tmp). If running on ip34, make sur you set --tmp to /fast_tmp
  • cpu: Number of cpu to use (default 24)
  • specie: Species from which data comes from. options are: homo_sapiens, mus_musculus, drosophila_melanogaster (default homo_sapiens)
  • overwrite: Recalculate all steps even if they completed sucessfully.

Step5 Downstream analysis

Still in development. Need more info to develop

About

Pipelines for Gevry labo

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors