DavidsonGroup · Shians · Feb 4, 2026 · Feb 4, 2026 · Feb 4, 2026 · Feb 4, 2026
diff --git a/.gitignore b/.gitignore
@@ -5,8 +5,5 @@
 /modules/fuscia
 /modules/flexiplex
 /modules/STAR
-<<<<<<< HEAD
-./.git/modules/STAR/objects/pack/pack-f67e728172242412d5bda22d0469194f2044564f.pack
-./modules/STAR/.git/objects/pack/pack-f67e728172242412d5bda22d0469194f2044564f.pack
-=======
->>>>>>> a4f53fdf61082f99d2cded4e28076871d04232ef
+CLAUDE.md
+.vscode/settings.json
diff --git a/.gitmodules b/.gitmodules
diff --git a/README.md b/README.md
@@ -1,6 +1,166 @@
-# [pears](https://davidsongroup.github.io/pears/)
-Pipeline for g*e*ne-fusion seArching in Rna Single-cell sequences 
-- Aim: combine and increase finding power of fusions in single-cell RNA-seq data
-- Adapted from: fuscia (Steven Foltz 2019) and flexiplex (Davidson et al. 2022)
+# PEARS
 
-- please run this only on VAST if you're running pears on the WEHI hpc
+**P**ipeline for g**e**ne-fusion se**a**rching in **R**na **S**ingle-cell sequences
+
+PEARS is a Nextflow DSL2 pipeline that detects gene fusions at single-cell resolution from 10X scRNA-seq data. It combines three complementary fusion-calling approaches — [FUSCIA](https://github.com/ding-lab/fuscia), [Flexiplex](https://github.com/DavidsonGroup/flexiplex), and [Arriba](https://github.com/suhrig/arriba) — and assigns cell barcodes to each detected fusion event, producing per-cell fusion calls.
+
+## Pipeline overview
+
+1. **Reference preparation** — Downloads genome FASTA and GTF annotation (or uses pre-built references).
+2. **Fusion target generation** — Builds search targets from a known fusions list using the reference annotation.
+3. **Alignment** — Aligns reads with STARsolo (chimeric-aware) and produces a BAM and single-cell count matrix.
+4. **Fusion detection** - Calls fusions using [FUSCIA](https://github.com/ding-lab/fuscia), [Flexiplex](https://github.com/DavidsonGroup/flexiplex), and [Arriba](https://github.com/suhrig/arriba) in parallel.
+5. **Formatting** — Consolidates results into three CSV files of per-cell fusion calls (`fuscia_fusion_calls.csv`, `flexiplex_fusion_calls.csv`, `arriba_fusion_calls.csv`).
+
+## Requirements
+
+- [Nextflow](https://www.nextflow.io/) (>= 22.10)
+- [Conda](https://docs.conda.io/) (environments are built automatically from `env/pears_env.yml`)
+
+## Usage
+
+Running locally:
+
+```bash
+nextflow run DavidsonLab/pears \
+  --fastq_r1 "/path/to/Reads_R1.fastq.gz" \
+  --fastq_r2 "/path/to/Reads_R2.fastq.gz" \
+  --known_fusions_list "known_fusions.csv" \
+  --protocol "10x-3prime-v3" \
+  --genome_version "GRCh38+GENCODE44" \
+  -profile "local"
+```
+
+Running on SLURM cluster:
+
+```bash
+nextflow run DavidsonLab/pears \
+  --fastq_r1 "/path/to/Reads_R1.fastq.gz" \
+  --fastq_r2 "/path/to/Reads_R2.fastq.gz" \
+  --known_fusions_list "known_fusions.csv" \
+  --protocol "10x-3prime-v3" \
+  --genome_version "GRCh38+GENCODE44" \
+  -profile "slurm"
+```
+
+## Arguments
+
+### Basic
+
+| Argument | Default | Description |
+|---|---|---|
+| `--fastq_r1` | — | Glob pattern or path to Read 1 FASTQ files (gzipped). |
+| `--fastq_r2` | — | Glob pattern or path to Read 2 FASTQ files (gzipped). |
+| `--known_fusions_list` | — | CSV file of known/candidate fusions to search for (see [Known fusions list format](#known-fusions-list-format)). |
+| `--protocol` | — | 10x Chromium chemistry preset (see [Protocol presets](#protocol-presets)). Sets the barcode whitelist and UMI length automatically. |
+| `--genome_version` | `GRCh38+GENCODE44` | Genome build to download. Available versions: `GRCh38+GENCODE40` through `GRCh38+GENCODE49`. |
+| `--out_dir` | `pears_output` | Directory for all pipeline outputs. |
+| `-profile` | — | Execution environment: `local` or `slurm`. |
+
+### Protocol presets
+
+`--protocol` sets the barcode whitelist and UMI length for the given 10x chemistry. These values can be individually overridden with `--barcode_include_list` and `--umi_len` (see [Read structure overrides](#read-structure-overrides)).
+
+| Preset | Chemistry | UMI length | Barcode whitelist |
+|---|---|---|---|
+| `10x-3prime-v2` | 3' Gene Expression v2 | 10 bp | 737K-august-2016 |
+| `10x-3prime-v3` | 3' Gene Expression v3/v3.1 | 12 bp | 3M-february-2018 |
+| `10x-3prime-v4` | 3' Gene Expression v4 | 12 bp | 3M-3pgex-may-2023 |
+| `10x-5prime-v2` | 5' Gene Expression v1/v2 | 10 bp | 737K-august-2016 |
+| `10x-5prime-v3` | 5' Gene Expression v3 | 12 bp | 3M-5pgex-jan-2023 |
+
+### Read structure overrides
+
+`--barcode_include_list` and `--umi_len` override the corresponding values set by `--protocol`. If `--protocol` is omitted entirely, **both** must be provided.
+
+| Argument | Default | Overrides |
+|---|---|---|
+| `--barcode_include_list` | *set by `--protocol`* | Barcode whitelist. Path to a custom whitelist file (can be gzipped). |
+| `--umi_len` | *set by `--protocol`* | UMI length in bases. |
+
+### Pre-built reference overrides
+
+By default, the pipeline downloads the genome specified by `--genome_version` and builds the STAR index automatically. To skip this, provide **all three** arguments below — `--genome_version` is then ignored.
+
+| Argument | Default | Overrides |
+|---|---|---|
+| `--ref_fasta` | *downloaded* | Genome FASTA. Must be provided together with `--ref_gtf` and `--star_genome_index`. |
+| `--ref_gtf` | *downloaded* | GTF annotation. Must be provided together with `--ref_fasta` and `--star_genome_index`. |
+| `--star_genome_index` | *built from download* | STAR genome index directory. Must be provided together with `--ref_fasta` and `--ref_gtf`. |
+
+### Tool parameters
+
+| Argument | Default | Description |
+|---|---|---|
+| `--flexiplex_searchlen` | `20` | Length of fusion junction sequence to search for (2x actual overlap). |
+| `--flexiplex_demultiplex_options` | *auto-generated* | Flexiplex demultiplexing options string. When not set, auto-generated as `-b "?{barcode_len}" -u "?{umi_len}" -e 1 -f 0` where barcode length is read from the whitelist file and UMI length comes from `--protocol` or `--umi_len`. Setting this explicitly overrides the auto-generated value. |
+| `--fuscia_mapqual` | `30` | Minimum mapping quality for FUSCIA read extraction. |
+| `--fuscia_up` | `1000` | Upstream search distance (bp) when no gene annotation is available. |
+| `--fuscia_down` | `1000` | Downstream search distance (bp) when no gene annotation is available. |
+
+## Known fusions list format
+
+The `--known_fusions_list` input is a CSV with the following required columns:
+
+| Column | Description |
+|---|---|
+| `fusion genes` | Fusion gene pair separated by `--` (e.g. `BCAS4--BCAS3`). |
+| `chrom1` | Chromosome of gene 1 (e.g. `chr20`). |
+| `base1` | Breakpoint position of gene 1. |
+| `strand1` | Strand of gene 1 (`+` or `-`). |
+| `chrom2` | Chromosome of gene 2 (e.g. `chr17`). |
+| `base2` | Breakpoint position of gene 2. |
+| `strand2` | Strand of gene 2 (`+` or `-`). |
+
+Additional columns (e.g. `classification`) are ignored. This format is compatible with [JAFFA](https://github.com/Oshlack/JAFFA) output. Fusions involving mitochondrial genes (`MT-`) are automatically filtered out.
+
+Example:
+
+```
+fusion genes,chrom1,base1,strand1,chrom2,base2,strand2,classification
+BCAS4--BCAS3,chr20,50795173,+,chr17,61368327,+,HighConfidence
+RPS6KB1--VMP1,chr17,59914703,+,chr17,59839768,+,HighConfidence
+SLC25A24--NBPF6,chr1,108161182,-,chr1,108470597,+,HighConfidence
+```
+
+## Output
+
+Results are written to `--out_dir` (default `pears_output/`):
+
+| File/Directory | Description |
+|---|---|
+| `fuscia_fusion_calls.csv` | Per-cell fusion calls from FUSCIA. |
+| `flexiplex_fusion_calls.csv` | Per-cell fusion calls from Flexiplex. |
+| `arriba_fusion_calls.csv` | Per-cell fusion calls from Arriba. |
+| `STARsolo/` | BAM alignment, index, and single-cell count matrix. |
+| `fuscia_out/` | Per-fusion FUSCIA discordant read files. |
+| `flexiplex_out/` | Per-fusion Flexiplex barcode files. |
+| `arriba_out/` | Arriba fusion table and per-fusion barcode files. |
+| `fusion_targets.csv` | Generated fusion target coordinates and sequences. |
+| `nextflow_report.html` | Nextflow execution report. |
+| `nextflow_trace.txt` | Nextflow process trace log. |
+
+### Fusion calls CSV format
+
+The three fusion calls CSVs (`fuscia_fusion_calls.csv`, `flexiplex_fusion_calls.csv`, `arriba_fusion_calls.csv`) share the same format:
+
+| Column | Description |
+|---|---|
+| `cell_barcode` | 10x cell barcode (CB tag) identifying the single cell. The trailing `-1` suffix is stripped. |
+| `molecular_barcode` | Unique Molecular Identifier (UMI / UB tag) distinguishing distinct RNA molecules from PCR duplicates within the same cell. |
+| `fusion` | Detected gene fusion, formatted as `GENE1--GENE2` (names taken from the `--known_fusions_list` input). |
+
+Example:
+
+```
+cell_barcode,molecular_barcode,fusion
+CCACAAAAGGTTCTTG,CAGGGATCAGTA,JPH1--NCOA2
+CAGGGCTCACTTGGGC,TGATAGGAATCG,JPH1--NCOA2
+GTGTGGCGTGGCCCAT,GGTAATCAGCAA,KIAA1429--RP11-586K2.1
+```
+
+Each row represents one unique observation of a fusion transcript in a specific cell. Rows are deduplicated — each (cell_barcode, molecular_barcode, fusion) triple appears at most once. Multiple rows with different cell barcodes for the same fusion indicate independent cells harbouring that fusion. Multiple rows with the same cell barcode but different UMIs for the same fusion indicate multiple distinct fusion transcript molecules captured in that cell, providing stronger evidence. Fusions detected by more than one tool in the same cell are higher confidence and can be identified by cross-referencing the three CSVs.
+
+## Credits
+
+Adapted from [FUSCIA](https://github.com/ding-lab/fuscia) (Steven Foltz, 2019) and [Flexiplex](https://github.com/DavidsonGroup/flexiplex) (Davidson et al., 2022).
diff --git a/assets/3M-3pgex-may-2023.txt.gz b/assets/3M-3pgex-may-2023.txt.gz
diff --git a/assets/3M-5pgex-jan-2023.txt.gz b/assets/3M-5pgex-jan-2023.txt.gz
diff --git a/modules/barcodes/737K-august-2016.txt → assets/3M-february-2018.txt.gz b/modules/barcodes/737K-august-2016.txt → assets/3M-february-2018.txt.gz
diff --git a/assets/737K-august-2016.txt.gz b/assets/737K-august-2016.txt.gz
diff --git a/assets/visium-v1.txt.gz b/assets/visium-v1.txt.gz
diff --git a/assets/visium-v2.txt.gz b/assets/visium-v2.txt.gz
diff --git a/assets/visium-v3.txt.gz b/assets/visium-v3.txt.gz
diff --git a/assets/visium-v4.txt.gz b/assets/visium-v4.txt.gz
diff --git a/assets/visium-v5.txt.gz b/assets/visium-v5.txt.gz
diff --git a/bin/calculate_read_length.py b/bin/calculate_read_length.py
@@ -0,0 +1,89 @@
+#!/usr/bin/env python3
+"""
+Calculate the mode of R2 read lengths from FASTQ files.
+Emits a warning if >10% of reads differ from the mode.
+
+Usage:
+    calculate_read_length.py <fastq_file> [<fastq_file> ...]
+
+Output:
+    Prints the mode read length to stdout.
+    Warnings are printed to stderr.
+"""
+
+import gzip
+import sys
+from collections import Counter
+from pathlib import Path
+
+
+def open_fastq(filepath):
+    """Open a FASTQ file, handling gzip if needed."""
+    if str(filepath).endswith('.gz'):
+        return gzip.open(filepath, 'rt')
+    return open(filepath, 'r')
+
+
+def get_read_lengths(fastq_files, max_reads=10000):
+    """Extract read lengths from the first max_reads of FASTQ files."""
+    lengths = []
+    reads_counted = 0
+
+    for fq in sorted(fastq_files):
+        if reads_counted >= max_reads:
+            break
+        with open_fastq(fq) as f:
+            line_num = 0
+            for line in f:
+                line_num += 1
+                # Sequence is on line 2 of each 4-line FASTQ record
+                if line_num % 4 == 2:
+                    lengths.append(len(line.strip()))
+                    reads_counted += 1
+                    if reads_counted >= max_reads:
+                        break
+    return lengths
+
+
+def main():
+    if len(sys.argv) < 2:
+        print("Usage: calculate_read_length.py <fastq_file> [<fastq_file> ...]", file=sys.stderr)
+        sys.exit(1)
+
+    fastq_files = [Path(f) for f in sys.argv[1:]]
+
+    # Validate files exist
+    for fq in fastq_files:
+        if not fq.exists():
+            print(f"ERROR: File not found: {fq}", file=sys.stderr)
+            sys.exit(1)
+
+    # Get read lengths
+    lengths = get_read_lengths(fastq_files, max_reads=10000)
+
+    if not lengths:
+        print("ERROR: Could not read any sequences from FASTQ files", file=sys.stderr)
+        sys.exit(1)
+
+    # Calculate mode
+    length_counts = Counter(lengths)
+    mode_length, mode_count = length_counts.most_common(1)[0]
+    total_reads = len(lengths)
+
+    # Calculate percentage of reads matching mode
+    mode_percentage = (mode_count / total_reads) * 100
+    non_mode_percentage = 100 - mode_percentage
+
+    # Warn if >10% of reads differ from mode
+    if non_mode_percentage > 10:
+        print(f"WARNING: {non_mode_percentage:.1f}% of reads have length different from mode ({mode_length}bp)", file=sys.stderr)
+        print(f"         Length distribution: {dict(length_counts.most_common(5))}", file=sys.stderr)
+
+    print(f"R2 read length mode: {mode_length}bp ({mode_percentage:.1f}% of {total_reads} reads sampled)", file=sys.stderr)
+
+    # Output the mode length to stdout
+    print(mode_length)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/subworkflows/collate.R → bin/collate.R b/subworkflows/collate.R → bin/collate.R
diff --git a/bin/download_references.py b/bin/download_references.py
@@ -0,0 +1,65 @@
+#!/usr/bin/env python
+
+import argparse
+import urllib.request
+import gzip
+
+GENOME_LINKS = {
+    "GRCh38": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_49/GRCh38.primary_assembly.genome.fa.gz"
+}
+
+ANNO_LINKS = {
+    "GENCODE40": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.annotation.gtf.gz",
+    "GENCODE41": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.annotation.gtf.gz",
+    "GENCODE42": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/gencode.v42.annotation.gtf.gz",
+    "GENCODE43": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.annotation.gtf.gz",
+    "GENCODE44": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz",
+    "GENCODE45": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.annotation.gtf.gz",
+    "GENCODE46": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/gencode.v46.annotation.gtf.gz",
+    "GENCODE47": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.annotation.gtf.gz",
+    "GENCODE48": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.annotation.gtf.gz",
+    "GENCODE49": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_49/gencode.v49.annotation.gtf.gz"
+}
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Download reference genomes for PEARS")
+    parser.add_argument("--reference", required=True, help="Reference to download")
+    args = parser.parse_args()
+
+    # Reference should be specified in the format "GENOME+ANNO", e.g. "GRCh38+GENCODE49"
+    if "+" not in args.reference:
+        print("Error: Reference should be specified in the format 'GENOME+ANNO', e.g. 'GRCh38+GENCODE49'")
+        return
+
+    genome, anno = args.reference.split("+")
+    if genome not in GENOME_LINKS:
+        print(f"Error: Genome '{genome}' not found. Available genomes: {', '.join(GENOME_LINKS.keys())}")
+        return
+
+    if anno not in ANNO_LINKS:
+        print(f"Error: Annotation '{anno}' not found. Available annotations: {', '.join(ANNO_LINKS.keys())}")
+        return
+
+    genome_link = GENOME_LINKS[genome]
+    anno_link = ANNO_LINKS[anno]
+
+    genome_name = genome_link.split("/")[-1]
+    anno_name = anno_link.split("/")[-1]
+
+    print(f"Downloading genome from {genome_link}...")
+    urllib.request.urlretrieve(genome_link, genome_name)
+    with gzip.open(genome_name, "rb") as f_in:
+        with open(genome_name.replace(".gz", ""), "wb") as f_out:
+            f_out.write(f_in.read())
+    print("Download complete.")
+
+    print(f"Downloading annotation from {anno_link}...")
+    urllib.request.urlretrieve(anno_link, anno_name)
+    with gzip.open(anno_name, "rb") as f_in:
+        with open(anno_name.replace(".gz", ""), "wb") as f_out:
+            f_out.write(f_in.read())
+    print("Download complete.")
+
+if __name__ == "__main__":
+    main()
diff --git a/subworkflows/extract_arriba.py → bin/extract_arriba.py b/subworkflows/extract_arriba.py → bin/extract_arriba.py