DIT-HAP (Diploid for Insertional Mutagenesis by Transposon and Haploid for Analysis of Phenotype) is a comprehensive Snakemake workflow for analyzing piggyBac transposon insertion sequencing data. The pipeline processes paired-end sequencing reads to identify transposon insertion sites, performs quality control, and conducts depletion analysis to understand gene essentiality.
- Overview
- Features
- Installation
- Quick Start
- Usage
- Configuration
- Workflow Architecture
- Output Structure
- Examples
- Troubleshooting
- Citation
DIT-HAP is designed for genome-wide transposon mutagenesis experiments using the piggyBac transposon system. The workflow:
- Processes paired-end sequencing reads containing piggyBac junction sequences
- Separates reads by junction orientation (PBL/PBR)
- Maps reads to the reference genome with high-stringency filtering
- Identifies insertion sites and annotates them with gene information
- Performs depletion analysis with curve fitting to determine gene essentiality
- Generates comprehensive reports for quality control and analysis
- Modular Architecture: Organized into distinct processing stages for easy maintenance
- High Stringency Filtering: Configurable filtering thresholds for mapping quality, alignment, and read pairing
- Multiple Analysis Modes: Support for HD/LD density experiments, generation vs. raw analysis, and haploid/diploid organisms
- Curve Fitting Models: Logistic, Richards, and sigmoid models for depletion analysis
- Biological Replicates: Optional DEseq2 integration for replicate analysis
- Comprehensive QC: MultiQC integration and custom quality control reports
- Automated Data Retrieval: Automatic download of reference genomes and annotations from PomBase
- Python: ≥ 3.8
- Snakemake: ≥ 8.0.0
- Conda: Miniconda or Anaconda
- System Requirements: Linux/Unix environment with ≥ 8GB RAM recommended
git clone https://github.com/your-username/DIT-HAP_pipeline.git
cd DIT_HAP_pipelineThe workflow uses modular conda environments that are automatically created:
# Create all environments at once
snakemake --use-conda --conda-create-envs-only
# Or use mamba for faster installation
snakemake --use-conda --conda-frontend mamba --conda-create-envs-onlyIf you prefer to install dependencies manually:
# Core environment
conda create -n dit-hap -c conda-forge -c bioconda python=3.10 snakemake=8.0.0 pandas
# Additional environments will be created automatically when neededCreate a tab-separated sample sheet with the following format:
Sample Timepoint Condition read1 read2
sample1 YES0 wildtype /path/to/sample1_R1.fastq.gz /path/to/sample1_R2.fastq.gz
sample1 YES3 wildtype /path/to/sample1_t3_R1.fastq.gz /path/to/sample1_t3_R2.fastq.gzCopy and modify an existing configuration file:
cp config/config_HD_generationPLUS1.yaml config/config_my_experiment.yamlEdit the configuration file to set:
project_name: Your experiment namesample_sheet: Path to your sample sheettime_points: Time points for depletion analysis- Adapter sequences for your piggyBac system
# Dry run to check the workflow
snakemake -n --use-conda
# Run the full workflow
snakemake --use-conda --cores 16
# Run with specific configuration
snakemake --configfile config/config_my_experiment.yaml --use-conda --cores 16# Run entire workflow
snakemake --use-conda --cores [number_of_cores]
# Run with specific configuration
snakemake --configfile config/config_HD_generationPLUS1.yaml --use-conda --cores 16
# Create workflow visualization
snakemake --dag | dot -Tpdf > workflow_diagram.pdf
# Check workflow integrity
snakemake --validate
# Run specific rules
snakemake fastp_preprocessing --use-conda --cores 8
snakemake bwa_mapping --use-conda --cores 8
snakemake multiqc --use-conda --cores 4# Use temporary directory for intermediate files
export TMPDIR=/path/to/temp
snakemake --use-conda --cores 16
# Resume after interruption
snakemake --use-conda --cores 16 --restart-times 2
# Keep intermediate files for debugging
snakemake --use-conda --cores 16 --notemp
# Create detailed log files
snakemake --use-conda --cores 16 --log logs/snakemake_$(date +%Y%m%d_%H%M%S).logConfiguration files are located in the config/ directory:
config_HD_generationPLUS1.yaml- High density, generation +1 analysisconfig_LD_generationPLUS1.yaml- Low density, generation +1 analysisconfig_HD_generationRAW.yaml- High density, raw analysisconfig_LD_generationRAW.yaml- Low density, raw analysisconfig_HD_diploid.yaml- High density, diploid organismconfig_LD_haploid.yaml- Low density, haploid organism
# Project settings
project_name: "my_experiment"
sample_sheet: "config/sample_sheet.tsv"
# Adapter sequences for piggyBac
adapter_sequence: "CTGTCTCTTATACACATCT"
PBL_adapter: "CATGCGTCAATTTTACGCAGACTATCTTTCTAGGG"
PBR_adapter: "ACGCATGATTATCTTTAACGTACGTCACAATATGATTATCTTTCTAGGG"
# Read filtering thresholds
aligned_read_filtering:
read_1_filtering:
mapq_threshold: 20
nm_threshold: 3
read_2_filtering:
mapq_threshold: 40
nm_threshold: 15
# Depletion analysis time points
time_points:
- 0
- 3.352
- 6.588
- 10.104
- 13.480
# Biological replicates
use_DEseq2_for_biological_replicates: trueThe sample sheet must contain the following columns:
| Column | Description | Example |
|---|---|---|
| Sample | Sample identifier | sample1 |
| Timepoint | Time point or condition | YES0, YES3 |
| Condition | Experimental condition | wildtype, treatment |
| read1 | Path to R1 fastq file | /path/to/R1.fastq.gz |
| read2 | Path to R2 fastq file | /path/to/R2.fastq.gz |
The DIT-HAP pipeline consists of four main modules:
- Downloads reference genome and annotations from PomBase
- Indexes genome for BWA mapping
- Creates samtools indices
- Quality control with Fastp
- Adapter trimming and junction classification (PBL/PBR separation)
- Read mapping with BWA-MEM2
- Read parsing and filtering
- Insertion site extraction
- Gene-level aggregation of insertion data
- Hard filtering of low-count insertions
- Curve fitting with multiple models (logistic, Richards, sigmoid)
- Statistical analysis of depletion curves
- MultiQC report generation
- Insertion density analysis
- Read count distribution analysis
- PBL/PBR correlation analysis
- Insertion orientation analysis
Raw Reads → Fastp QC → PBL/PBR Separation → BWA Mapping → Read Filtering →
Insertion Site Extraction → Gene Annotation → Hard Filtering → Curve Fitting →
Statistical Analysis → Reports
Results are organized in a hierarchical directory structure:
DIT_HAP_pipeline/
├── results/{project_name}/
│ ├── 01_fastp_preprocessing/
│ ├── 02_cutadapt_junction_classification/
│ ├── 03_bwa_mapping/
│ ├── 04_parse_bam_to_tsv/
│ ├── 05_aligned_read_filtering/
│ ├── 06_extract_insertion_sites/
│ ├── 07_annotation_concatenation/
│ ├── 08_hard_filtering/
│ ├── 09_insertion_level_depletion_analysis/
│ ├── 10_gene_level_depletion_analysis/
│ ├── 11_insertion_level_curve_fitting/
│ ├── 12_gene_level_curve_fitting/
│ └── 13_final_results/
├── reports/{project_name}/
│ ├── multiqc/
│ ├── mapping_filtering_statistics/
│ ├── PBL_PBR_correlation_analysis/
│ ├── read_count_distribution_analysis/
│ ├── insertion_orientation_analysis/
│ └── insertion_density_analysis/
├── logs/{project_name}/
│ ├── preparation/
│ ├── preprocessing/
│ ├── depletion_analysis/
│ └── quality_control/
└── resources/pombase_data/{release_version}/
├── genome_sequence_and_features/
├── Gene_metadata/
└── Protein_features/
results/{project_name}/13_final_results/- Final depletion analysis resultsreports/{project_name}/multiqc/quality_control_multiqc_report.html- Comprehensive QC reportreports/{project_name}/mapping_filtering_statistics/mapping_filtering_statistics.tsv- Mapping statisticsresults/{project_name}/12_gene_level_curve_fitting/gene_level_fitting_statistics.tsv- Gene-level curve fitting results
# Use the HD generation +1 configuration
snakemake --configfile config/config_HD_generationPLUS1.yaml --use-conda --cores 24# Use the LD haploid configuration
snakemake --configfile config/config_LD_haploid.yaml --use-conda --cores 16# Run only preprocessing
snakemake bwa_mapping --use-conda --cores 12
# Run only quality control
snakemake multiqc --use-conda --cores 4
# Run only depletion analysis
snakemake gene_level_curve_fitting --use-conda --cores 8For a custom experiment with different time points:
# In your config file
time_points:
- 0
- 2
- 4
- 6
- 8
- 12
- 24- Memory Issues: Reduce
chunk_sizein configuration or use fewer cores - Conda Environment Errors: Use
--conda-frontend mambafor faster resolution - Mapping Failures: Check adapter sequences and reference genome integrity
- Sample Sheet Errors: Ensure tab-separated format and correct column names
# Enable verbose logging
snakemake --use-conda --cores 8 --printshellcmds --reason
# Keep temporary files for inspection
snakemake --use-conda --cores 8 --notemp
# Print detailed error messages
snakemake --use-conda --cores 8 --show-failed-logs- Snakemake logs:
logs/{project_name}/ - Rule-specific logs:
logs/{project_name}/{stage}/ - MultiQC report:
reports/{project_name}/multiqc/
If you use DIT-HAP in your research, please cite:
DIT-HAP: Diploid for Insertional Mutagenesis by Transposon and Haploid for Analysis of Phenotype
[Yusheng Yang et al., Year]
GitHub Repository: https://github.com/DIT-HAP/DIT_HAP_pipeline
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
For questions and support:
- Open an issue on GitHub
- Check the Snakemake documentation
- Review the workflow logs for detailed error messages
DIT-HAP Pipeline - Comprehensive analysis of piggyBac transposon insertion sequencing data