DIT-HAP Pipeline

DIT-HAP (Diploid for Insertional Mutagenesis by Transposon and Haploid for Analysis of Phenotype) is a comprehensive Snakemake workflow for analyzing piggyBac transposon insertion sequencing data. The pipeline processes paired-end sequencing reads to identify transposon insertion sites, performs quality control, and conducts depletion analysis to understand gene essentiality.

Overview

DIT-HAP is designed for genome-wide transposon mutagenesis experiments using the piggyBac transposon system. The workflow:

Processes paired-end sequencing reads containing piggyBac junction sequences
Separates reads by junction orientation (PBL/PBR)
Maps reads to the reference genome with high-stringency filtering
Identifies insertion sites and annotates them with gene information
Performs depletion analysis with curve fitting to determine gene essentiality
Generates comprehensive reports for quality control and analysis

Features

Modular Architecture: Organized into distinct processing stages for easy maintenance
High Stringency Filtering: Configurable filtering thresholds for mapping quality, alignment, and read pairing
Multiple Analysis Modes: Support for HD/LD density experiments, generation vs. raw analysis, and haploid/diploid organisms
Curve Fitting Models: Logistic, Richards, and sigmoid models for depletion analysis
Biological Replicates: Optional DEseq2 integration for replicate analysis
Comprehensive QC: MultiQC integration and custom quality control reports
Automated Data Retrieval: Automatic download of reference genomes and annotations from PomBase

Installation

Prerequisites

Python: ≥ 3.8
Snakemake: ≥ 8.0.0
Conda: Miniconda or Anaconda
System Requirements: Linux/Unix environment with ≥ 8GB RAM recommended

Clone the Repository

git clone https://github.com/your-username/DIT-HAP_pipeline.git
cd DIT_HAP_pipeline

Set Up Conda Environments

The workflow uses modular conda environments that are automatically created:

# Create all environments at once
snakemake --use-conda --conda-create-envs-only

# Or use mamba for faster installation
snakemake --use-conda --conda-frontend mamba --conda-create-envs-only

Manual Environment Setup (Optional)

If you prefer to install dependencies manually:

# Core environment
conda create -n dit-hap -c conda-forge -c bioconda python=3.10 snakemake=8.0.0 pandas

# Additional environments will be created automatically when needed

Quick Start

1. Prepare Sample Sheet

Create a tab-separated sample sheet with the following format:

Sample	Timepoint	Condition	read1	read2
sample1	YES0	wildtype	/path/to/sample1_R1.fastq.gz	/path/to/sample1_R2.fastq.gz
sample1	YES3	wildtype	/path/to/sample1_t3_R1.fastq.gz	/path/to/sample1_t3_R2.fastq.gz

2. Configure the Workflow

Copy and modify an existing configuration file:

cp config/config_HD_generationPLUS1.yaml config/config_my_experiment.yaml

Edit the configuration file to set:

project_name: Your experiment name
sample_sheet: Path to your sample sheet
time_points: Time points for depletion analysis
Adapter sequences for your piggyBac system

3. Run the Workflow

# Dry run to check the workflow
snakemake -n --use-conda

# Run the full workflow
snakemake --use-conda --cores 16

# Run with specific configuration
snakemake --configfile config/config_my_experiment.yaml --use-conda --cores 16

Usage

Basic Commands

# Run entire workflow
snakemake --use-conda --cores [number_of_cores]

# Run with specific configuration
snakemake --configfile config/config_HD_generationPLUS1.yaml --use-conda --cores 16

# Create workflow visualization
snakemake --dag | dot -Tpdf > workflow_diagram.pdf

# Check workflow integrity
snakemake --validate

# Run specific rules
snakemake fastp_preprocessing --use-conda --cores 8
snakemake bwa_mapping --use-conda --cores 8
snakemake multiqc --use-conda --cores 4

Advanced Options

# Use temporary directory for intermediate files
export TMPDIR=/path/to/temp
snakemake --use-conda --cores 16

# Resume after interruption
snakemake --use-conda --cores 16 --restart-times 2

# Keep intermediate files for debugging
snakemake --use-conda --cores 16 --notemp

# Create detailed log files
snakemake --use-conda --cores 16 --log logs/snakemake_$(date +%Y%m%d_%H%M%S).log

Configuration

Configuration Files

Configuration files are located in the config/ directory:

config_HD_generationPLUS1.yaml - High density, generation +1 analysis
config_LD_generationPLUS1.yaml - Low density, generation +1 analysis
config_HD_generationRAW.yaml - High density, raw analysis
config_LD_generationRAW.yaml - Low density, raw analysis
config_HD_diploid.yaml - High density, diploid organism
config_LD_haploid.yaml - Low density, haploid organism

Key Configuration Parameters

# Project settings
project_name: "my_experiment"
sample_sheet: "config/sample_sheet.tsv"

# Adapter sequences for piggyBac
adapter_sequence: "CTGTCTCTTATACACATCT"
PBL_adapter: "CATGCGTCAATTTTACGCAGACTATCTTTCTAGGG"
PBR_adapter: "ACGCATGATTATCTTTAACGTACGTCACAATATGATTATCTTTCTAGGG"

# Read filtering thresholds
aligned_read_filtering:
  read_1_filtering:
    mapq_threshold: 20
    nm_threshold: 3
  read_2_filtering:
    mapq_threshold: 40
    nm_threshold: 15

# Depletion analysis time points
time_points:
  - 0
  - 3.352
  - 6.588
  - 10.104
  - 13.480

# Biological replicates
use_DEseq2_for_biological_replicates: true

Sample Sheet Format

The sample sheet must contain the following columns:

Column	Description	Example
Sample	Sample identifier	`sample1`
Timepoint	Time point or condition	`YES0`, `YES3`
Condition	Experimental condition	`wildtype`, `treatment`
read1	Path to R1 fastq file	`/path/to/R1.fastq.gz`
read2	Path to R2 fastq file	`/path/to/R2.fastq.gz`

Workflow Architecture

The DIT-HAP pipeline consists of four main modules:

1. Preparation (`workflow/rules/preparation.smk`)

Downloads reference genome and annotations from PomBase
Indexes genome for BWA mapping
Creates samtools indices

2. Preprocessing (`workflow/rules/preprocessing.smk`)

Quality control with Fastp
Adapter trimming and junction classification (PBL/PBR separation)
Read mapping with BWA-MEM2
Read parsing and filtering
Insertion site extraction

3. Depletion Analysis (`workflow/rules/depletion_analysis.smk`)

Gene-level aggregation of insertion data
Hard filtering of low-count insertions
Curve fitting with multiple models (logistic, Richards, sigmoid)
Statistical analysis of depletion curves

4. Quality Control (`workflow/rules/quality_control.smk`)

MultiQC report generation
Insertion density analysis
Read count distribution analysis
PBL/PBR correlation analysis
Insertion orientation analysis

Data Processing Flow

Raw Reads → Fastp QC → PBL/PBR Separation → BWA Mapping → Read Filtering →
Insertion Site Extraction → Gene Annotation → Hard Filtering → Curve Fitting →
Statistical Analysis → Reports

Output Structure

Results are organized in a hierarchical directory structure:

DIT_HAP_pipeline/
├── results/{project_name}/
│   ├── 01_fastp_preprocessing/
│   ├── 02_cutadapt_junction_classification/
│   ├── 03_bwa_mapping/
│   ├── 04_parse_bam_to_tsv/
│   ├── 05_aligned_read_filtering/
│   ├── 06_extract_insertion_sites/
│   ├── 07_annotation_concatenation/
│   ├── 08_hard_filtering/
│   ├── 09_insertion_level_depletion_analysis/
│   ├── 10_gene_level_depletion_analysis/
│   ├── 11_insertion_level_curve_fitting/
│   ├── 12_gene_level_curve_fitting/
│   └── 13_final_results/
├── reports/{project_name}/
│   ├── multiqc/
│   ├── mapping_filtering_statistics/
│   ├── PBL_PBR_correlation_analysis/
│   ├── read_count_distribution_analysis/
│   ├── insertion_orientation_analysis/
│   └── insertion_density_analysis/
├── logs/{project_name}/
│   ├── preparation/
│   ├── preprocessing/
│   ├── depletion_analysis/
│   └── quality_control/
└── resources/pombase_data/{release_version}/
    ├── genome_sequence_and_features/
    ├── Gene_metadata/
    └── Protein_features/

Key Output Files

results/{project_name}/13_final_results/ - Final depletion analysis results
reports/{project_name}/multiqc/quality_control_multiqc_report.html - Comprehensive QC report
reports/{project_name}/mapping_filtering_statistics/mapping_filtering_statistics.tsv - Mapping statistics
results/{project_name}/12_gene_level_curve_fitting/gene_level_fitting_statistics.tsv - Gene-level curve fitting results

Examples

Example 1: High Density Generation +1 Analysis

# Use the HD generation +1 configuration
snakemake --configfile config/config_HD_generationPLUS1.yaml --use-conda --cores 24

Example 2: Low Density Haploid Analysis

# Use the LD haploid configuration
snakemake --configfile config/config_LD_haploid.yaml --use-conda --cores 16

Example 3: Running Specific Workflow Stages

# Run only preprocessing
snakemake bwa_mapping --use-conda --cores 12

# Run only quality control
snakemake multiqc --use-conda --cores 4

# Run only depletion analysis
snakemake gene_level_curve_fitting --use-conda --cores 8

Example 4: Custom Time Point Analysis

For a custom experiment with different time points:

# In your config file
time_points:
  - 0
  - 2
  - 4
  - 6
  - 8
  - 12
  - 24

Troubleshooting

Common Issues

Memory Issues: Reduce chunk_size in configuration or use fewer cores
Conda Environment Errors: Use --conda-frontend mamba for faster resolution
Mapping Failures: Check adapter sequences and reference genome integrity
Sample Sheet Errors: Ensure tab-separated format and correct column names

Debug Mode

# Enable verbose logging
snakemake --use-conda --cores 8 --printshellcmds --reason

# Keep temporary files for inspection
snakemake --use-conda --cores 8 --notemp

# Print detailed error messages
snakemake --use-conda --cores 8 --show-failed-logs

Log File Locations

Snakemake logs: logs/{project_name}/
Rule-specific logs: logs/{project_name}/{stage}/
MultiQC report: reports/{project_name}/multiqc/

Citation

If you use DIT-HAP in your research, please cite:

DIT-HAP: Diploid for Insertional Mutagenesis by Transposon and Haploid for Analysis of Phenotype
[Yusheng Yang et al., Year]
GitHub Repository: https://github.com/DIT-HAP/DIT_HAP_pipeline

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Support

For questions and support:

Open an issue on GitHub
Check the Snakemake documentation
Review the workflow logs for detailed error messages

DIT-HAP Pipeline - Comprehensive analysis of piggyBac transposon insertion sequencing data

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
.cursor/rules		.cursor/rules
.vscode		.vscode
config		config
experiments		experiments
workflow		workflow
.gitignore		.gitignore
.snakemake-workflow-catalog.yml		.snakemake-workflow-catalog.yml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile

License

DIT-HAP/DIT_HAP_pipeline

Folders and files

Latest commit

History

Repository files navigation

DIT-HAP Pipeline

Table of Contents

Overview

Features

Installation

Prerequisites

Clone the Repository

Set Up Conda Environments

Manual Environment Setup (Optional)

Quick Start

1. Prepare Sample Sheet

2. Configure the Workflow

3. Run the Workflow

Usage

Basic Commands

Advanced Options

Configuration

Configuration Files

Key Configuration Parameters

Sample Sheet Format

Workflow Architecture

1. Preparation (workflow/rules/preparation.smk)

2. Preprocessing (workflow/rules/preprocessing.smk)

3. Depletion Analysis (workflow/rules/depletion_analysis.smk)

4. Quality Control (workflow/rules/quality_control.smk)

Data Processing Flow

Output Structure

Key Output Files

Examples

Example 1: High Density Generation +1 Analysis

Example 2: Low Density Haploid Analysis

Example 3: Running Specific Workflow Stages

Example 4: Custom Time Point Analysis

Troubleshooting

Common Issues

Debug Mode

Log File Locations

Citation

License

Contributing

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

1. Preparation (`workflow/rules/preparation.smk`)

2. Preprocessing (`workflow/rules/preprocessing.smk`)

3. Depletion Analysis (`workflow/rules/depletion_analysis.smk`)

4. Quality Control (`workflow/rules/quality_control.smk`)

Packages