Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
edb632a
Fixed arriba process
Shians Feb 4, 2026
a25fef9
Fixed flexiplex process
Shians Feb 4, 2026
88b53ac
Fixed fuscia process
Shians Feb 4, 2026
637e872
Fixed masterdata process
Shians Feb 4, 2026
aae1af6
Fixed STARsolo process
Shians Feb 4, 2026
a86a519
Fixed main process
Shians Feb 4, 2026
c816f56
Fix formatting in config
Shians Feb 4, 2026
e7096c7
Updated STARSolo process to use parameters
Shians Feb 4, 2026
3e2185f
Fixed deprecated uppercase Channel usage
Shians Feb 4, 2026
01d7b00
Updated gitignore
Shians Feb 4, 2026
5086a0e
Updated gitignore
Shians Feb 4, 2026
262082c
Changed formatFuscia to take input
Shians Feb 4, 2026
c7798f0
Added labels for resource allocation
Shians Feb 4, 2026
bf1ef3c
Changed output capture for runFuscia to be more specific
Shians Feb 4, 2026
5193154
Changed variables names to snake_case
Shians Feb 4, 2026
368c2fa
Updated param names for clarity
Shians Feb 4, 2026
ad462c3
Changed process names to camelCase
Shians Feb 4, 2026
68b7f5a
Updated environment file
Shians Feb 4, 2026
72f5f40
Removed tool init script in favour of conda env file
Shians Feb 4, 2026
28433d8
Fixed indentation
Shians Feb 4, 2026
6d222f2
Moved python script into bin
Shians Feb 4, 2026
b139e09
Removed git submodules
Shians Feb 4, 2026
115cbc3
Moved barcodes into assets folder
Shians Feb 4, 2026
baa0829
Moved fuscia code to bin
Shians Feb 4, 2026
73e8c47
Moved fuscia code into bin
Shians Feb 4, 2026
a5bcbd1
Updated processes to directly call programs without projectDir
Shians Feb 4, 2026
e4b90ce
Restructured directories
Shians Feb 5, 2026
b7d86bc
Compressed inclusion lists
Shians Feb 5, 2026
930537c
Added handling for gzipped inclusion list
Shians Feb 5, 2026
8d9a49a
Moved download_references.sh to bin
Shians Feb 5, 2026
4ca827c
Added explicit sh call
Shians Feb 5, 2026
2bb60de
Added default environment
Shians Feb 5, 2026
ccfc9ac
Added parameter validation
Shians Feb 5, 2026
23ed910
Added necessary genome and reference versions
Shians Feb 5, 2026
dc04022
Changed R2 read length to be calculated
Shians Feb 5, 2026
c5ee859
Changed index to be compiled based on genome version param
Shians Feb 5, 2026
f59ffba
Removed arity argument
Shians Feb 5, 2026
b49f280
Added named outputs for STARsolo process
Shians Feb 5, 2026
c1566be
Explicit DSL2
Shians Feb 5, 2026
dae8e8f
Renamed files
Shians Feb 5, 2026
6d6ff22
Updated file names
Shians Feb 5, 2026
f389167
Restricted conda channels
Shians Feb 5, 2026
755e65d
Updated env file versions
Shians Feb 5, 2026
b0235a7
Changed files to download as is
Shians Feb 5, 2026
244a26d
Updated parameter defaults and documentation
Shians Feb 5, 2026
b93f711
Update conda env creation options
Shians Feb 5, 2026
b0fc388
Updated script calls
Shians Feb 5, 2026
a7210b9
Changed reference download to use internal code
Shians Feb 6, 2026
05ffa1f
Added options for providing existing references and index
Shians Feb 6, 2026
d4e30c0
Passing reference fasta and gtf into arriba as arguments
Shians Feb 6, 2026
382d826
Remove redundant genome indexing process
Shians Feb 6, 2026
3dcc705
Changed mapqual to argument to pass into fuscia process
Shians Feb 6, 2026
a22507f
Renamed masterdata to fusion_targets
Shians Feb 6, 2026
21059f8
Changed direct params access inside processes to arguments
Shians Feb 6, 2026
95792bd
Split formatFlexiplex and formatArriba
Shians Feb 6, 2026
7e63dbc
Changed variables to snake_case
Shians Feb 6, 2026
10ec12a
Fixed genFusionTargets
Shians Feb 6, 2026
d7d1043
Fixed output name of gen_fusion_targets.py
Shians Feb 6, 2026
4305885
Added tracing and reporting by default
Shians Feb 7, 2026
d024090
Added trace overwrite and defaultBranch
Shians Feb 7, 2026
d523b66
Added protocol presets
Shians Feb 7, 2026
46d14b5
Updated downloaded genome reference folder to use _ instead of +
Shians Feb 8, 2026
36bc8cf
Updated variable syntax for consistency
Shians Feb 8, 2026
ee6889b
Changed execution permission
Shians Feb 8, 2026
9466b0d
Changed variable usage in processed to be consistent
Shians Feb 8, 2026
2415760
Moved output location of STAR_index
Shians Feb 8, 2026
7d4d55b
Changed samtools path as it's now managed by conda
Shians Feb 8, 2026
b139657
Refactored output formatting to external python script
Shians Feb 9, 2026
791fb6e
Removed comment
Shians Feb 9, 2026
dd26018
Added automatic generation of flexiplex arguments
Shians Feb 9, 2026
f57b6bb
Fixed flexiplex pattern generation
Shians Feb 9, 2026
0f48d02
Changed input/output declarations for consistent syntax
Shians Feb 10, 2026
6d5b3b2
Removed excess comments
Shians Feb 10, 2026
b3a11d8
Changed fusion caller output summary file names
Shians Feb 10, 2026
5d7b280
Updated README
Shians Feb 10, 2026
8b9985e
Updated README
Shians Feb 10, 2026
ae90052
Updated README
Shians Feb 10, 2026
ca351b5
Fixed FUSICA link
Shians Feb 10, 2026
7d3b1e4
Updated resource tiers
Shians Feb 10, 2026
f4632ca
Updated README
Shians Feb 10, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 2 additions & 5 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,5 @@
/modules/fuscia
/modules/flexiplex
/modules/STAR
<<<<<<< HEAD
./.git/modules/STAR/objects/pack/pack-f67e728172242412d5bda22d0469194f2044564f.pack
./modules/STAR/.git/objects/pack/pack-f67e728172242412d5bda22d0469194f2044564f.pack
=======
>>>>>>> a4f53fdf61082f99d2cded4e28076871d04232ef
CLAUDE.md
.vscode/settings.json
9 changes: 0 additions & 9 deletions .gitmodules

This file was deleted.

170 changes: 165 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,166 @@
# [pears](https://davidsongroup.github.io/pears/)
Pipeline for g*e*ne-fusion seArching in Rna Single-cell sequences
- Aim: combine and increase finding power of fusions in single-cell RNA-seq data
- Adapted from: fuscia (Steven Foltz 2019) and flexiplex (Davidson et al. 2022)
# PEARS

- please run this only on VAST if you're running pears on the WEHI hpc
**P**ipeline for g**e**ne-fusion se**a**rching in **R**na **S**ingle-cell sequences

PEARS is a Nextflow DSL2 pipeline that detects gene fusions at single-cell resolution from 10X scRNA-seq data. It combines three complementary fusion-calling approaches — [FUSCIA](https://github.com/ding-lab/fuscia), [Flexiplex](https://github.com/DavidsonGroup/flexiplex), and [Arriba](https://github.com/suhrig/arriba) — and assigns cell barcodes to each detected fusion event, producing per-cell fusion calls.

## Pipeline overview

1. **Reference preparation** — Downloads genome FASTA and GTF annotation (or uses pre-built references).
2. **Fusion target generation** — Builds search targets from a known fusions list using the reference annotation.
3. **Alignment** — Aligns reads with STARsolo (chimeric-aware) and produces a BAM and single-cell count matrix.
4. **Fusion detection** - Calls fusions using [FUSCIA](https://github.com/ding-lab/fuscia), [Flexiplex](https://github.com/DavidsonGroup/flexiplex), and [Arriba](https://github.com/suhrig/arriba) in parallel.
5. **Formatting** — Consolidates results into three CSV files of per-cell fusion calls (`fuscia_fusion_calls.csv`, `flexiplex_fusion_calls.csv`, `arriba_fusion_calls.csv`).

## Requirements

- [Nextflow](https://www.nextflow.io/) (>= 22.10)
- [Conda](https://docs.conda.io/) (environments are built automatically from `env/pears_env.yml`)

## Usage

Running locally:

```bash
nextflow run DavidsonLab/pears \
--fastq_r1 "/path/to/Reads_R1.fastq.gz" \
--fastq_r2 "/path/to/Reads_R2.fastq.gz" \
--known_fusions_list "known_fusions.csv" \
--protocol "10x-3prime-v3" \
--genome_version "GRCh38+GENCODE44" \
-profile "local"
```

Running on SLURM cluster:

```bash
nextflow run DavidsonLab/pears \
--fastq_r1 "/path/to/Reads_R1.fastq.gz" \
--fastq_r2 "/path/to/Reads_R2.fastq.gz" \
--known_fusions_list "known_fusions.csv" \
--protocol "10x-3prime-v3" \
--genome_version "GRCh38+GENCODE44" \
-profile "slurm"
```

## Arguments

### Basic

| Argument | Default | Description |
|---|---|---|
| `--fastq_r1` | — | Glob pattern or path to Read 1 FASTQ files (gzipped). |
| `--fastq_r2` | — | Glob pattern or path to Read 2 FASTQ files (gzipped). |
| `--known_fusions_list` | — | CSV file of known/candidate fusions to search for (see [Known fusions list format](#known-fusions-list-format)). |
| `--protocol` | — | 10x Chromium chemistry preset (see [Protocol presets](#protocol-presets)). Sets the barcode whitelist and UMI length automatically. |
| `--genome_version` | `GRCh38+GENCODE44` | Genome build to download. Available versions: `GRCh38+GENCODE40` through `GRCh38+GENCODE49`. |
| `--out_dir` | `pears_output` | Directory for all pipeline outputs. |
| `-profile` | — | Execution environment: `local` or `slurm`. |

### Protocol presets

`--protocol` sets the barcode whitelist and UMI length for the given 10x chemistry. These values can be individually overridden with `--barcode_include_list` and `--umi_len` (see [Read structure overrides](#read-structure-overrides)).

| Preset | Chemistry | UMI length | Barcode whitelist |
|---|---|---|---|
| `10x-3prime-v2` | 3' Gene Expression v2 | 10 bp | 737K-august-2016 |
| `10x-3prime-v3` | 3' Gene Expression v3/v3.1 | 12 bp | 3M-february-2018 |
| `10x-3prime-v4` | 3' Gene Expression v4 | 12 bp | 3M-3pgex-may-2023 |
| `10x-5prime-v2` | 5' Gene Expression v1/v2 | 10 bp | 737K-august-2016 |
| `10x-5prime-v3` | 5' Gene Expression v3 | 12 bp | 3M-5pgex-jan-2023 |

### Read structure overrides

`--barcode_include_list` and `--umi_len` override the corresponding values set by `--protocol`. If `--protocol` is omitted entirely, **both** must be provided.

| Argument | Default | Overrides |
|---|---|---|
| `--barcode_include_list` | *set by `--protocol`* | Barcode whitelist. Path to a custom whitelist file (can be gzipped). |
| `--umi_len` | *set by `--protocol`* | UMI length in bases. |

### Pre-built reference overrides

By default, the pipeline downloads the genome specified by `--genome_version` and builds the STAR index automatically. To skip this, provide **all three** arguments below — `--genome_version` is then ignored.

| Argument | Default | Overrides |
|---|---|---|
| `--ref_fasta` | *downloaded* | Genome FASTA. Must be provided together with `--ref_gtf` and `--star_genome_index`. |
| `--ref_gtf` | *downloaded* | GTF annotation. Must be provided together with `--ref_fasta` and `--star_genome_index`. |
| `--star_genome_index` | *built from download* | STAR genome index directory. Must be provided together with `--ref_fasta` and `--ref_gtf`. |

### Tool parameters

| Argument | Default | Description |
|---|---|---|
| `--flexiplex_searchlen` | `20` | Length of fusion junction sequence to search for (2x actual overlap). |
| `--flexiplex_demultiplex_options` | *auto-generated* | Flexiplex demultiplexing options string. When not set, auto-generated as `-b "?{barcode_len}" -u "?{umi_len}" -e 1 -f 0` where barcode length is read from the whitelist file and UMI length comes from `--protocol` or `--umi_len`. Setting this explicitly overrides the auto-generated value. |
| `--fuscia_mapqual` | `30` | Minimum mapping quality for FUSCIA read extraction. |
| `--fuscia_up` | `1000` | Upstream search distance (bp) when no gene annotation is available. |
| `--fuscia_down` | `1000` | Downstream search distance (bp) when no gene annotation is available. |

## Known fusions list format

The `--known_fusions_list` input is a CSV with the following required columns:

| Column | Description |
|---|---|
| `fusion genes` | Fusion gene pair separated by `--` (e.g. `BCAS4--BCAS3`). |
| `chrom1` | Chromosome of gene 1 (e.g. `chr20`). |
| `base1` | Breakpoint position of gene 1. |
| `strand1` | Strand of gene 1 (`+` or `-`). |
| `chrom2` | Chromosome of gene 2 (e.g. `chr17`). |
| `base2` | Breakpoint position of gene 2. |
| `strand2` | Strand of gene 2 (`+` or `-`). |

Additional columns (e.g. `classification`) are ignored. This format is compatible with [JAFFA](https://github.com/Oshlack/JAFFA) output. Fusions involving mitochondrial genes (`MT-`) are automatically filtered out.

Example:

```
fusion genes,chrom1,base1,strand1,chrom2,base2,strand2,classification
BCAS4--BCAS3,chr20,50795173,+,chr17,61368327,+,HighConfidence
RPS6KB1--VMP1,chr17,59914703,+,chr17,59839768,+,HighConfidence
SLC25A24--NBPF6,chr1,108161182,-,chr1,108470597,+,HighConfidence
```

## Output

Results are written to `--out_dir` (default `pears_output/`):

| File/Directory | Description |
|---|---|
| `fuscia_fusion_calls.csv` | Per-cell fusion calls from FUSCIA. |
| `flexiplex_fusion_calls.csv` | Per-cell fusion calls from Flexiplex. |
| `arriba_fusion_calls.csv` | Per-cell fusion calls from Arriba. |
| `STARsolo/` | BAM alignment, index, and single-cell count matrix. |
| `fuscia_out/` | Per-fusion FUSCIA discordant read files. |
| `flexiplex_out/` | Per-fusion Flexiplex barcode files. |
| `arriba_out/` | Arriba fusion table and per-fusion barcode files. |
| `fusion_targets.csv` | Generated fusion target coordinates and sequences. |
| `nextflow_report.html` | Nextflow execution report. |
| `nextflow_trace.txt` | Nextflow process trace log. |

### Fusion calls CSV format

The three fusion calls CSVs (`fuscia_fusion_calls.csv`, `flexiplex_fusion_calls.csv`, `arriba_fusion_calls.csv`) share the same format:

| Column | Description |
|---|---|
| `cell_barcode` | 10x cell barcode (CB tag) identifying the single cell. The trailing `-1` suffix is stripped. |
| `molecular_barcode` | Unique Molecular Identifier (UMI / UB tag) distinguishing distinct RNA molecules from PCR duplicates within the same cell. |
| `fusion` | Detected gene fusion, formatted as `GENE1--GENE2` (names taken from the `--known_fusions_list` input). |

Example:

```
cell_barcode,molecular_barcode,fusion
CCACAAAAGGTTCTTG,CAGGGATCAGTA,JPH1--NCOA2
CAGGGCTCACTTGGGC,TGATAGGAATCG,JPH1--NCOA2
GTGTGGCGTGGCCCAT,GGTAATCAGCAA,KIAA1429--RP11-586K2.1
```

Each row represents one unique observation of a fusion transcript in a specific cell. Rows are deduplicated — each (cell_barcode, molecular_barcode, fusion) triple appears at most once. Multiple rows with different cell barcodes for the same fusion indicate independent cells harbouring that fusion. Multiple rows with the same cell barcode but different UMIs for the same fusion indicate multiple distinct fusion transcript molecules captured in that cell, providing stronger evidence. Fusions detected by more than one tool in the same cell are higher confidence and can be identified by cross-referencing the three CSVs.

## Credits

Adapted from [FUSCIA](https://github.com/ding-lab/fuscia) (Steven Foltz, 2019) and [Flexiplex](https://github.com/DavidsonGroup/flexiplex) (Davidson et al., 2022).
Binary file added assets/3M-3pgex-may-2023.txt.gz
Binary file not shown.
Binary file added assets/3M-5pgex-jan-2023.txt.gz
Binary file not shown.
Binary file not shown.
Binary file added assets/737K-august-2016.txt.gz
Binary file not shown.
Binary file added assets/visium-v1.txt.gz
Binary file not shown.
Binary file added assets/visium-v2.txt.gz
Binary file not shown.
Binary file added assets/visium-v3.txt.gz
Binary file not shown.
Binary file added assets/visium-v4.txt.gz
Binary file not shown.
Binary file added assets/visium-v5.txt.gz
Binary file not shown.
89 changes: 89 additions & 0 deletions bin/calculate_read_length.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
#!/usr/bin/env python3
"""
Calculate the mode of R2 read lengths from FASTQ files.
Emits a warning if >10% of reads differ from the mode.

Usage:
calculate_read_length.py <fastq_file> [<fastq_file> ...]

Output:
Prints the mode read length to stdout.
Warnings are printed to stderr.
"""

import gzip
import sys
from collections import Counter
from pathlib import Path


def open_fastq(filepath):
"""Open a FASTQ file, handling gzip if needed."""
if str(filepath).endswith('.gz'):
return gzip.open(filepath, 'rt')
return open(filepath, 'r')


def get_read_lengths(fastq_files, max_reads=10000):
"""Extract read lengths from the first max_reads of FASTQ files."""
lengths = []
reads_counted = 0

for fq in sorted(fastq_files):
if reads_counted >= max_reads:
break
with open_fastq(fq) as f:
line_num = 0
for line in f:
line_num += 1
# Sequence is on line 2 of each 4-line FASTQ record
if line_num % 4 == 2:
lengths.append(len(line.strip()))
reads_counted += 1
if reads_counted >= max_reads:
break
return lengths


def main():
if len(sys.argv) < 2:
print("Usage: calculate_read_length.py <fastq_file> [<fastq_file> ...]", file=sys.stderr)
sys.exit(1)

fastq_files = [Path(f) for f in sys.argv[1:]]

# Validate files exist
for fq in fastq_files:
if not fq.exists():
print(f"ERROR: File not found: {fq}", file=sys.stderr)
sys.exit(1)

# Get read lengths
lengths = get_read_lengths(fastq_files, max_reads=10000)

if not lengths:
print("ERROR: Could not read any sequences from FASTQ files", file=sys.stderr)
sys.exit(1)

# Calculate mode
length_counts = Counter(lengths)
mode_length, mode_count = length_counts.most_common(1)[0]
total_reads = len(lengths)

# Calculate percentage of reads matching mode
mode_percentage = (mode_count / total_reads) * 100
non_mode_percentage = 100 - mode_percentage

# Warn if >10% of reads differ from mode
if non_mode_percentage > 10:
print(f"WARNING: {non_mode_percentage:.1f}% of reads have length different from mode ({mode_length}bp)", file=sys.stderr)
print(f" Length distribution: {dict(length_counts.most_common(5))}", file=sys.stderr)

print(f"R2 read length mode: {mode_length}bp ({mode_percentage:.1f}% of {total_reads} reads sampled)", file=sys.stderr)

# Output the mode length to stdout
print(mode_length)


if __name__ == "__main__":
main()
File renamed without changes.
65 changes: 65 additions & 0 deletions bin/download_references.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
#!/usr/bin/env python

import argparse
import urllib.request
import gzip

GENOME_LINKS = {
"GRCh38": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_49/GRCh38.primary_assembly.genome.fa.gz"
}

ANNO_LINKS = {
"GENCODE40": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.annotation.gtf.gz",
"GENCODE41": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.annotation.gtf.gz",
"GENCODE42": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/gencode.v42.annotation.gtf.gz",
"GENCODE43": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.annotation.gtf.gz",
"GENCODE44": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz",
"GENCODE45": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.annotation.gtf.gz",
"GENCODE46": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/gencode.v46.annotation.gtf.gz",
"GENCODE47": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.annotation.gtf.gz",
"GENCODE48": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.annotation.gtf.gz",
"GENCODE49": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_49/gencode.v49.annotation.gtf.gz"
}


def main():
parser = argparse.ArgumentParser(description="Download reference genomes for PEARS")
parser.add_argument("--reference", required=True, help="Reference to download")
args = parser.parse_args()

# Reference should be specified in the format "GENOME+ANNO", e.g. "GRCh38+GENCODE49"
if "+" not in args.reference:
print("Error: Reference should be specified in the format 'GENOME+ANNO', e.g. 'GRCh38+GENCODE49'")
return

genome, anno = args.reference.split("+")
if genome not in GENOME_LINKS:
print(f"Error: Genome '{genome}' not found. Available genomes: {', '.join(GENOME_LINKS.keys())}")
return

if anno not in ANNO_LINKS:
print(f"Error: Annotation '{anno}' not found. Available annotations: {', '.join(ANNO_LINKS.keys())}")
return

genome_link = GENOME_LINKS[genome]
anno_link = ANNO_LINKS[anno]

genome_name = genome_link.split("/")[-1]
anno_name = anno_link.split("/")[-1]

print(f"Downloading genome from {genome_link}...")
urllib.request.urlretrieve(genome_link, genome_name)
with gzip.open(genome_name, "rb") as f_in:
with open(genome_name.replace(".gz", ""), "wb") as f_out:
f_out.write(f_in.read())
print("Download complete.")

print(f"Downloading annotation from {anno_link}...")
urllib.request.urlretrieve(anno_link, anno_name)
with gzip.open(anno_name, "rb") as f_in:
with open(anno_name.replace(".gz", ""), "wb") as f_out:
f_out.write(f_in.read())
print("Download complete.")

if __name__ == "__main__":
main()
File renamed without changes.
Loading