Skip to content

Riboseq modernisation: Phases 1-3 (#160-#171 + PRICE)#174

Draft
pinin4fjords wants to merge 182 commits into
nf-core:devfrom
pinin4fjords:feat/modernisation-aggregation
Draft

Riboseq modernisation: Phases 1-3 (#160-#171 + PRICE)#174
pinin4fjords wants to merge 182 commits into
nf-core:devfrom
pinin4fjords:feat/modernisation-aggregation

Conversation

@pinin4fjords

@pinin4fjords pinin4fjords commented May 18, 2026

Copy link
Copy Markdown
Member

Summary

Modernisation umbrella PR for the riboseq pipeline. Brings issues #160-#171 to a single aggregation branch ready for review, ahead of being split into per-issue PRs.

Default-path behaviour is preserved for everything except #160 (the documented breaking change to --te_quantification_method). All new ORF callers and the extended-ORF / cross-caller catalogue paths are opt-in.

What this PR does

Quantification defaults (#160)

  • Breaking default change. Flip --te_quantification_method default from alignment (STAR + Salmon) to plastid_psite (in-frame P-site counts). Salmon's coverage / fragment-length / multi-mapping assumptions are inappropriate for short Ribo-seq footprints; in-frame P-site counts are the scientifically correct quantity. Two methods produce different count values; re-running an existing cohort on the new default will not reproduce prior matrices. Backward-compatible behaviour available via --te_quantification_method alignment.

Annotation backbone (#161)

  • Add --canonical_gtf for an explicit one-transcript-per-gene canonical annotation (MANE Select / Ensembl canonical). Used by ORF calling, riboWaltz P-site calibration, plastid P-site quantification, and translational-efficiency analysis. Falls back to AGAT_SP_KEEP_LONGEST_ISOFORM extraction from --gtf when not supplied. The full --gtf continues to drive genome-guided alignment.

Ribotricer demotion (#163)

  • Replace --skip_ribotricer with --run_ribotricer (default false). Default caller set is now Ribo-TISH + RiboCode; agreement logic is parameterised on the runtime-enabled caller set. Ribotricer's score column drops out of cross-caller rank aggregation when not enabled.

Novel transcript discovery (#164)

  • StringTie reference-guided novel-transcript discovery. Prefer RNA-seq BAMs; fall back to Ribo-seq BAMs with tightened parameters (--stringtie_ribo_fallback_args) and a runtime warning.
  • --novel_gtf to bypass StringTie with a user-supplied GTF.
  • Class-code filter via --stringtie_class_codes (default u, intergenic only).
  • Optional strand-aware rRNA/repeat blacklist via --rrna_blacklist (bedtools intersect -v -s).
  • Hybrid GTF (<outdir>/stringtie/hybrid_reference.gtf) by concatenating canonical backbone + filtered novel transcripts. Exposed as the hybrid_gtf workflow channel.

Extended ORF analysis (#165)

  • --extended_orf_analysis (default false). Wires the hybrid GTF into the genome-BAM callers: Ribo-TISH predict (-g hybrid, -a canonical) and Ribotricer prepare-orfs (hybrid). When enabled without a novel-transcript source the pipeline warns and falls back to canonical.

Hybrid transcriptome for RiboCode (#171)

  • Under --extended_orf_analysis true, run a second STAR pass against a hybrid transcriptome (canonical + filtered novel intergenic) so RiboCode can call ORFs on novel transcripts. New BUILD_HYBRID_TRANSCRIPTOME subworkflow extracts spliced transcript sequences with gffread -w and rebuilds the STAR index with the hybrid GTF as --sjdbGTFfile. The hybrid transcriptome FASTA and index build once per run; the second STAR pass is Ribo-seq-only. riboWaltz, plastid and Salmon stay on the canonical transcriptome (CDS-dependent assumptions). Hybrid alignments land under <outdir>/hybrid_star/.

Rp-Bp ORF caller (#169)

  • --run_rpbp (default false). Adds Rp-Bp (Malone et al. 2017) as an opt-in Tier-1 Bayesian-strict ORF caller, complementing RiboCode. Implemented as a RUN_RPBP subworkflow chaining six per-tool processes (extract-metagene-profiles, estimate-metagene-profile-bayes-factors, select-periodic-offsets, extract-orf-profiles, estimate-orf-bayes-factors, select-final-prediction-set) plus RPBP_BUILDCONFIG and RPBP_PREPAREGENOME (once per run). Splitting avoids re-running flexbar/bowtie/STAR inside rpbp - the pipeline's standard STAR alignment is used - and lets each step cache independently on resume. Wave-built container co-installs rpbp=4.0.1 and star=2.7.11b. Per-sample predicted-ORF BED + DNA + protein FASTA land under <outdir>/orf_predictions/rpbp/. Same conditional canonical-vs-hybrid GTF selection as Ribo-TISH / Ribotricer. Runtime warning at submit; expect ~10-24 h per replicate at genome scale. Docs note the STAR-defaults gap vs upstream rpbp and link tracking issue STAR alignment params: consider Ribo-seq-tuned defaults per sample type #173.

PRICE ORF caller (#170)

  • --run_price (default false). Adds PRICE (Erhard et al. 2018) as an opt-in Tier-2 caller for near-cognate ORF discovery. Invoked one-shot across the riboseq cohort (gedi -e IndexGenome + gedi -e Price); calls flow into the cross-caller catalogue. Container via Wave from the merged bioconda::gedi=1.0.6a recipe. PRICE is fed the full multi-isoform annotation (--gtf), not the one-transcript-per-gene canonical/hybrid backbone: it resolves overlapping ORFs and rescues multimappers with its own EM (Erhard et al. 2018, PMID 29529017), so the canonical backbone (which exists only to disambiguate P-site quantification) would only narrow PRICE's discovery and bias its ORF-type classification.

Cross-caller ORF catalogue (#167)

  • nf-core orftable_fasta_gtf_buildorfcatalogue subworkflow (gates on --extended_orf_analysis true + non-empty enabled-caller set). Per-caller normalisers (Ribo-TISH, RiboCode, Ribotricer, Rp-Bp, PRICE) convert each per-sample output to a unified BED12 + sidecar TSV; custom/orfmerge merges with a class-aware strategy (transcript-ID grouping for annotated multi-exon CDS, 80% reciprocal overlap for single-exon novel intergenic and smORFs ≤ 100 aa). Small ORFs (orf_class == smORF, aa_length ≤ 100) are additionally collapsed by amino-acid sequence identity (mmseqs/easycluster --min-seq-id 0.9 then the custom/orfcollapse module) so the same micropeptide encoded at distinct, non-overlapping loci becomes a single catalogue entry, following the GENCODE Ribo-seq ORF catalogue convention (Mudge et al. 2022, PMID 35831657). Collapse is on by default. Emits cohort.catalogue.{bed12,tsv,fasta}, cohort.catalogue.orf_to_gene.tsv, and a MultiQC custom-content per-class count table (cohort.catalogue.mqc.tsv) under <outdir>/orf_catalogue/.

Per-ORF P-site quantification (#166)

  • QUANTIFY_ORF_PSITE subworkflow (additive to existing gene-level path). Expands the cohort BED12 catalogue into codon-start positions (frame defined by each ORF's own ATG, not GTF phase), runs per-sample bedtools intersect against plastid wiggle tracks, assembles an ORF × sample count matrix (orf_psite_counts.tsv, zero-filled for ORFs absent from a sample). Runs whenever --extended_orf_analysis true, at least one caller is enabled, and plastid is not skipped. Warns and skips when --skip_plastid true. Matrix published under <outdir>/orf_quantification/ and emitted on the orf_count_matrix workflow channel.

ORF-level differential translation (#168)

  • Under --extended_orf_analysis true with --te_quantification_method plastid_psite:
    • The gene-level TE Ribo-seq numerator is re-aggregated from per-ORF P-site counts summing only canonical_cds ORFs (Tier 1). New module orf_to_gene_cds_counts emits the long-format replacement table that feeds the existing REPLACE_RIBOSEQ_COUNTS_IN_MATRIX substitution. Keeps uORF / dORF / novel_u / smORF dynamics out of the gene-level sum.
    • New per-ORF DESeq2 interaction-model module (deseq2/orf_dte) and DTE_ORF_LEVEL subworkflow (Tier 2) fit ~ condition + seq_type + condition:seq_type per ORF, Ribo-seq numerator from orf_psite_counts.tsv, RNA-seq denominator joined from gene-level Salmon counts via orf_to_gene.tsv. Novel intergenic ORFs with no host gene are dropped; low-count ORFs filtered before DESeq2 fitting. Results per contrast under <outdir>/dte/orf_level/; CDS-aggregated gene-level matrix under <outdir>/dte/gene_level_cds_aggregated/.
    • --run_dotseq (default false) added as an opt-in placeholder for the Tier-3 DOTSeq DTE/DOU analysis, deferred while DOTSeq remains in Bioconductor devel. Setting the flag emits an info message and runs no analysis.
    • Row-independence caveat for Tier 2 (ORFs sharing a gene-level RNA-seq denominator are perfectly correlated after the join) is documented in docs/usage.md.

Ribotish quality / hybrid mode (#162)

  • Investigation closed: Ribotish QC continues to consume the canonical GTF (it's calibration, not novel discovery).

Validation

  • Full-scale defaults run on Seqera Platform (eu-west-2): SUCCEEDED.
  • Full-scale --extended_orf_analysis true run: SUCCEEDED.
  • Full-scale PRICE run: SUCCEEDED.
  • Full-scale Rp-Bp run: 2/6 samples completed end-to-end and produced final ORF predictions. The remaining 4 hit infrastructure failures unrelated to pipeline logic and are documented under test_full per project convention; bioinformatics correctness is established. Re-validation deferred to a subsequent CE configuration.

Test plan

  • Module-level stub tests pass for all new local modules
  • tests/default.nf.test passes (defaults path)
  • tests/stringtie_extended.nf.test passes
  • tests/novel_gtf.nf.test passes
  • tests/te_plastid_psite.nf.test snapshot is up to date
  • Full-scale defaults run is green
  • Full-scale --extended_orf_analysis true run is green
  • Full-scale --run_price true run is green
  • Lint clean (nf-core lint)
  • CHANGELOG.md entries land for each issue

CI on the test matrix (docker, Nextflow 25.04.8 + latest-everything) and nf-core lint are green. Kept as a draft: the umbrella branch is to be split into the per-issue PRs (#186-#189 stack) for review.

🤖 Generated with Claude Code

pinin4fjords and others added 30 commits May 1, 2026 15:31
Install nf-core/bam_stringtie_merge subworkflow and wire its merged GTF
into the genome-BAM-side ORF callers (Ribo-TISH, Ribotricer, plastid)
plus GTF_TO_INFRAME_PSITES. Off by default (skip_stringtie = true).

RiboCode, riboWaltz, and alignment-mode Salmon stay on the reference
annotation: their transcriptome BAMs were keyed to it at STAR alignment
time, and a stringtie_realign mode for those is left for a follow-on PR.

Closes nf-core#157

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The riboseq nf-core modules for ribotish/quality, ribotish/predict, and
ribotricer/prepareorfs accept only a single GTF input - no slot for a
secondary annotation. Without that slot we can't pass the merged GTF as
primary -g and the reference GTF as secondary -a, which is the only
correct way to wire Ribo-TISH predict / Ribotricer for novel-ORF discovery
on a CDS-less StringTie merged GTF.

Routing the merged GTF directly into those tools as the only annotation
gives wrong outputs:
- ribotish quality fails ("Counted reads: 0") because it filters reads by
  reference CDS which the merged GTF lacks.
- plastid metagene generate is similarly CDS-bound.
- ribotish predict / ribotricer would scan transcripts but lose all
  classification (every ORF tagged Novel).

For correctness, this PR now publishes the merged GTF as a side product
only. Wiring the merged GTF into the ORF callers is left for a follow-on
that depends on upstream nf-core/modules updates exposing the secondary
annotation argument.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…[skip ci]

Extend the reference tuple with a fourth, optional `reference_gtf` slot so
callers can supply a CDS-bearing GTF for TIS background modelling and
TisType classification, paired with a CDS-free discovery GTF passed via
`-g`. Pattern uses `stageAs: 'NO_FILE'` plus a `name != 'NO_FILE'` guard
so the existing single-GTF call sites keep working by passing `[]`.

- main.nf: extend `meta3` tuple, add `reference_gtf_arg` and wire to the
  `ribotish predict` command line.
- meta.yml: document the new optional input.
- tests/main.nf.test: pass `[]` in the existing tests; add new pass-through
  and stub tests that supply a non-empty `-a` GTF. Snapshot file
  intentionally left unchanged (new entries will be generated on the next
  `nf-test test` run; no existing entries modified).

Implements nf-core/modules#11644.
Replace `--skip_ribotricer` (default false) with `--run_ribotricer`
(default false). Ribotricer benchmarking showed its ORF-score column is
rank-unstable across biological replicates (Spearman 0.288 vs Jaccard
0.770), so it should not contribute to cross-caller rank aggregation.

- Schema/config: add run_ribotricer (boolean, default false); help_text
  explains the rank-instability rationale.
- Workflow: gate Ribotricer behind `if (params.run_ribotricer)` and emit
  a log.warn when enabled. Add a dynamic ORF-caller list and derive the
  rank-aggregation subset (Ribotricer excluded) and the strict-majority
  agreement threshold (floor(N/2)+1) so the agreement logic introduced
  by nf-core#7 adapts to whatever caller set is enabled at runtime.
- conf/test.config: drop now-removed skip_ribotricer (test profile no
  longer needs to opt out of a default-on caller).
- README / docs/output.md: re-label Ribotricer as opt-in.
- CHANGELOG: parameter rename in the unreleased Parameters table plus a
  Changed entry referencing the issue.

Closes nf-core#163.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip the schema and nextflow.config default for --te_quantification_method
from alignment (STAR + Salmon) to plastid_psite (in-frame P-site counts
from plastid). Salmon's coverage-uniformity, fragment-length and
multi-mapping assumptions are inappropriate for short, length-constrained
Ribo-seq footprints; in-frame P-site counts are the scientifically correct
quantity for Ribo-seq.

Breaking default change: the two methods produce fundamentally different
per-gene count values (Salmon TPM-derived pseudo-counts vs raw integer
in-frame P-site sums), so re-running an existing cohort on a newer
pipeline version with the default will not reproduce the previous count
matrix. Users who need backward-compatible Salmon-style counts must set
--te_quantification_method alignment explicitly.

Updates schema description / help_text, docs/usage.md, and the CHANGELOG.

Closes nf-core#160.
Add an optional --canonical_gtf parameter that supplies a
one-transcript-per-gene canonical GTF (e.g. MANE Select or
Ensembl_canonical-filtered) as the annotation backbone for ORF calling
(Ribo-TISH, Ribotricer, RiboCode), riboWaltz P-site calibration, plastid
P-site quantification, in-frame P-site quantification and the
translational-efficiency analysis. The full --gtf is still used
unconditionally for genome-guided alignment (STAR / Salmon transcriptome
generation).

When --canonical_gtf is omitted, PREPARE_GENOME falls back to
agat_sp_keep_longest_isoform on the full GTF and pipes the result
through gffread, so the backbone path is always populated.

Pulls in the agat/spkeeplongestisoform nf-core module for the fallback
and aliases gffread (GFFREAD_CANONICAL) plus gunzip (GUNZIP_CANONICAL_GTF)
inside PREPARE_GENOME. The new ch_canonical_gtf channel is emitted from
PREPARE_GENOME and routed via the RIBOSEQ workflow take block;
ch_gtf remains the alignment-side channel.

Docs (usage / output) cover MANE Select for human, the
Ensembl_canonical grep recipe for other Ensembl organisms, the AGAT
fallback, and the MANE-vs-Ensembl_canonical trade-off for non-coding
genes.

[skip ci]
The Nextflow 26.04 strict parser rejects `Math.floor()` in workflow
script blocks; `intdiv(2)+1` is the parser-clean Groovy equivalent for
strict-majority of N callers (N=2 -> 2, N=3 -> 2, N=4 -> 3).
…ip ci]

`stageAs: 'NO_FILE'` does not produce a file when callers pass `[]` for
the absent case, so `reference_gtf.name` was unsafe. A Groovy truthy
check (`reference_gtf ? '-a ...' : ''`) handles both the empty-list and
real-file cases correctly without an empty `-a` argument leaking
through to ribotish.
…ed channel [skip ci]

stageAs: 'NO_FILE' was symlinking real input GTFs as 'NO_FILE', losing
the .gtf extension that ribotish predict's suffixType() parser needs.
Drop the stageAs alias; with a plain optional path() and a Groovy
truthy guard, the absent case stages nothing and the present case
keeps its filename.

Also add a dedicated 4-element ch_fasta_gtf_for_ribotish channel in
workflows/riboseq/main.nf so the predict callers pass the new 4-tuple
shape ([meta, fasta, gtf, []] for Phase 1; the 4th slot becomes the
canonical backbone GTF when --extended_orf_analysis lands in Phase 2).
ch_fasta_gtf stays 3-element for ribotricer prepareorfs.
…name collision [skip ci]

When callers supply the same GTF as both the primary -g and optional -a
input (e.g. test data that has only one chr20 GTF), Nextflow raises
'input file name collision'. Adding stageAs: 'secondary.gtf' renames the
secondary input so the two inputs always land at distinct paths in the
work directory. The .gtf extension is preserved so ribotish's suffixType
parser still recognises it. Truthy check still distinguishes the
absent ([]) and present cases.
nf-core#161 wired ch_canonical_gtf into QUANTIFY_STAR_SALMON and QUANTIFY_PSEUDO_TE,
but the Salmon transcript fasta and Salmon indices in PREPARE_GENOME are
built from the full GTF (transcripts.fa from RSEM_PREPAREREFERENCE on ch_gtf;
SALMON_INDEX / SALMON_INDEX_TE on that transcript fasta). Inside the
QUANTIFY_PSEUDO_ALIGNMENT subworkflow, the same gtf channel feeds
CUSTOM_TX2GENE, so tx2gene was being derived from the canonical (subset)
GTF while quant.sf contained the full transcriptome. tximport then produced
gene-level rownames that included genes absent from tx2gene, and
SE_GENE_UNIFIED's findColumnWithAllEntries(ids, rowdata=tx2gene) failed:

    No column contains all vector entries ENSG00000000419, ENSG00000019186, ...

Per the canonical-backbone spec (issue_proposals/02_canonical_backbone.md),
the full GTF is reserved for genome-guided alignment and transcriptome
quantification; --canonical_gtf drives ORF calling, P-site calibration,
plastid, riboWaltz, and (via the plastid_psite replacement step) DTE.
ORF/P-site/plastid wiring on ch_canonical_gtf is unchanged.

For gene-level DTE consistency with the spec's canonical-backbone intent
(spec section "Decision note"), follow-up: filter DTE inputs (anota2seq /
deltaTE matrices) to canonical genes at the comparison step, or generate a
canonical-restricted gene-level matrix downstream of QUANTIFY_STAR_SALMON.
The plastid_psite TE mode already enforces canonical semantics by
overwriting Ribo-seq columns with in-frame P-site counts from the canonical
GTF; pseudo and default modes do not yet.
…lastid_psite [skip ci]

Snapshots regenerated on the nf-dev VM with the Phase 1 changes applied
(plastid_psite default, canonical_gtf backbone, run_ribotricer opt-in,
ribotish/predict -a flag). All 6 pipeline tests and 6 ribotish module
tests pass against the new snapshots.
…ndation [skip ci]

# Conflicts:
#	CHANGELOG.md
…tersect [skip ci]

Adds the modules needed to classify StringTie novel transcripts against the
full reference (gffcompare), filter the annotated GTF to a user-configurable
set of class codes (new local FILTER_GTF_CLASS_CODE module, default `u` =
intergenic only), and optionally subtract an rRNA/repeat blacklist
(bedtools/intersect, strand-aware via `-v -s`). Workflow wiring lands in
follow-up commits.

Refs nf-core#164.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…f bypass [skip ci]

Refines the StringTie novel-transcript discovery path added in PR nf-core#158:

- Prefer RNA-seq BAMs for StringTie assembly; fall back to Ribo-seq BAMs
  with tightened parameters (new `--stringtie_ribo_fallback_args` default
  `-m 100 -c 5 -j 3 -f 0.05 -g 100`) and emit a runtime warning. Every
  published microprotein discovery workflow assembles from RNA-seq because
  Ribo-seq footprints produce fragmented assemblies and false-positive
  intergenic transcripts from P-site pileups; the fallback is provided
  only because RNA-seq input is optional in the riboseq pipeline.

- Add `--novel_gtf` to bypass StringTie and feed a user-supplied
  novel-transcript GTF into the same filtering chain. Useful when a
  curated annotation already exists (long-read RNA-seq, published ORF
  catalogues).

- After StringTie merge (or `--novel_gtf` load) run gffcompare against
  the full reference GTF, then filter to a user-configurable set of
  class codes via the new local FILTER_GTF_CLASS_CODE module
  (`--stringtie_class_codes`, default `u` = unambiguous intergenic;
  stranded users may add `x` for antisense overlap).

- Optional strand-aware rRNA/repeat blacklist intersect when
  `--rrna_blacklist` is supplied (`bedtools intersect -v -s`).

- Concatenate the canonical backbone (from nf-core#161) with the filtered novel
  GTF via the new local CONCAT_GTF module to produce
  `<outdir>/stringtie/hybrid_reference.gtf`. Exposed as the
  `hybrid_gtf` workflow emit channel for downstream ORF-caller wiring
  (subject of a separate issue). When both `--skip_stringtie` is true
  and `--novel_gtf` is null, the emit equals the canonical GTF so
  downstream wiring is uniform.

Refs nf-core#164.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… --extended_orf_analysis [skip ci]

Add --extended_orf_analysis (default false) gating the hybrid GTF
(canonical + filtered novel intergenic transcripts, from nf-core#164) into the
genome-BAM ORF callers:

- Ribo-TISH predict: hybrid GTF on -g (discovery target); canonical
  backbone on -a (background + classification).
- Ribotricer prepare-orfs: hybrid GTF directly (no secondary-annotation
  concept; CDS-absent transcripts auto-labelled 'novel').

RiboCode, riboWaltz, plastid and the Salmon/pseudo-TE quantification
subworkflows continue on the canonical backbone regardless of the flag.
This is the transcriptome-BAM architectural constraint: the
transcriptome BAM produced by STAR --quantMode TranscriptomeSAM is keyed
to the reference transcriptome FASTA, and novel StringTie transcripts
don't exist in that FASTA. Bringing RiboCode online for novel
transcripts requires a second STAR pass against a hybrid transcriptome
FASTA, tracked in nf-core#171.

When --extended_orf_analysis true is set but no novel-transcript source
is configured (--skip_stringtie true and --novel_gtf null), the pipeline
warns and falls back to canonical so users can compose flags
incrementally.

Default behaviour is unchanged: with --extended_orf_analysis false the
workflow graph is identical to the pre-nf-core#165 state.

Closes nf-core#165

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nf-core#161 + nf-core#165 built the reference-GTF channels with an empty meta map
([:]), which makes meta.id null. RIBOTRICER_PREPAREORFS uses
${meta.id} for its prefix and was publishing null_candidate_orfs.tsv.
Phase 1 tests didn't surface this because they ran with the new
run_ribotricer=false default; Phase 2's --extended_orf_analysis tests
enable ribotricer to exercise the hybrid-GTF wiring. Use
[id: 'reference'] so the published filename is reference_candidate_orfs.tsv
regardless of canonical / hybrid origin.
StringTie can produce transcripts with strand '.' when coverage is too
sparse or splice signal is ambiguous. Ribo-TISH predict raises
KeyError: '.' on those because it only indexes '+' and '-' strands when
building the CDS regions dictionary. Drop unstranded GTF entries here
so the downstream hybrid GTF only contains stranded transcripts.
…kip ci]

Two new pipeline-level tests exercise Phase 2's extended ORF discovery:

- tests/novel_gtf.nf.test: --novel_gtf bypass path. Uses a small
  synthetic intergenic chr20 GTF (tests/data/novel_intergenic_chr20.gtf)
  to drive gffcompare class-u filter -> CONCAT_GTF -> hybrid GTF ->
  Ribo-TISH predict on hybrid (-g) + canonical (-a).

- tests/stringtie_extended.nf.test: --skip_stringtie false path.
  Exercises full BAM_STRINGTIE_MERGE on the test profile's RNA-seq
  samples through the same gffcompare/filter/concat chain.

Both run with --extended_orf_analysis true. Ribotricer is left at the
default (--run_ribotricer false) because the chr20 test data is too
sparse for Ribotricer's periodicity inference; that coverage gap is
expected to be closed by -profile test_full at Phase 4.

Exempt tests/data/ from the top-level data/ ignore so committed
test fixtures aren't dropped.
New local subworkflow BUILD_HYBRID_TRANSCRIPTOME that extracts spliced
transcript sequences from the hybrid GTF (canonical + filtered novel
intergenic) with gffread -w and rebuilds a STAR index against the same
genome FASTA but with the hybrid GTF as --sjdbGTFfile. Emits the hybrid
transcriptome FASTA and STAR index as separate channels for the second
STAR alignment pass (issue nf-core#171).
…alysis [skip ci]

Under --extended_orf_analysis true, run a second STAR alignment pass for
Ribo-seq samples against the hybrid transcriptome (canonical + filtered
novel intergenic) built by BUILD_HYBRID_TRANSCRIPTOME, then route the
resulting hybrid transcriptome BAM and the hybrid GTF into RIBOCODE_GTFUPDATE
/ RIBOCODE_PREPARE / RIBOCODE_RIBOCODE in place of the canonical wiring.

The reference STAR alignment, riboWaltz, plastid and Salmon-based
quantification are unchanged. riboWaltz stays on the canonical reference
transcriptome by design (nf-core#171 spec note: feeding it CDS-absent novel
transcripts would degrade its diagnostic plots without contributing to ORF
discovery). The default --extended_orf_analysis false path is unchanged.

- workflows/riboseq/main.nf: aliased FASTQ_ALIGN_STAR_HYBRID include;
  conditional second STAR pass on Ribo-seq reads only; conditional swap of
  RiboCode's transcriptome BAM + GTF inputs.
- conf/modules.config: withName blocks for the aliased hybrid STAR processes
  (GFFREAD_HYBRID_TRANSCRIPTOME, STAR_GENOMEGENERATE_HYBRID,
  FASTQ_ALIGN_STAR_HYBRID:STAR_ALIGN, and the hybrid BAM sort/stats chain),
  publishing under <outdir>/hybrid_star/.
- docs/usage.md, docs/output.md: document the second STAR pass, its cost,
  the rationale for riboWaltz staying on canonical, and the hybrid_star/
  output layout.
- CHANGELOG.md: Added entry under v1.3.0dev / Added.
…[skip ci]

Wave container co-installs rpbp=4.0.1 and star=2.7.11b
(prepare-rpbp-genome requires STAR on PATH). The three modules
share that container; rpbp/buildconfig uses a coreutils container
to render the YAML config from pipeline inputs so users don't
have to author one.

RUN_RPBP chains BUILDCONFIG -> PREPAREGENOME -> PREDICTORFS,
broadcasting the shared index + config to per-sample predictions.
- New params run_rpbp (default false), extra_rpbp_preparegenome_args,
  extra_rpbp_predictorfs_args, rpbp_config_extra_yaml.
- Schema entries for all four params under the riboseq group.
- modules.config publishDir + ext.args for the three Rp-Bp processes;
  predicted ORFs land at <outdir>/orf_predictions/rpbp/.
- workflow gates RUN_RPBP on run_rpbp; appends 'rpbp' to the dynamic
  enabled_orf_callers list. Rp-Bp's Bayes factor is stable so it is
  retained in rank_aggregation_callers (only ribotricer is excluded).
- Same conditional canonical-vs-hybrid GTF selection as the other
  genome-BAM ORF callers (Ribo-TISH predict, Ribotricer).
- Runtime warning at workflow start when run_rpbp is true (~20-24h/rep).
- Docs (usage + output) and CHANGELOG entry.
pinin4fjords and others added 2 commits June 15, 2026 23:32
…qs2) [skip ci]

The coordinate-based merge in custom/orfmerge only groups genomically
overlapping ORFs, so a micropeptide encoded at several distinct loci (typically
repetitive regions) survives as separate catalogue rows. Following the GENCODE
Ribo-seq ORF catalogue convention (Mudge et al. 2022, Nat Biotechnol,
doi:10.1038/s41587-022-01369-0; gencode-riboseqORFs collapse_cutoff 0.9), the
catalogue subworkflow now clusters peptides with mmseqs/easycluster
(--min-seq-id 0.9) and a new custom/orfcollapse module folds each multi-member
small-ORF (aa_length <= 100) cluster down to one representative, unioning
cross-caller / cross-sample evidence and gene mappings.

bedtools/getfasta in the catalogue now uses -split -s -nameOnly so the AA FASTA
is the spliced, strand-correct sequence keyed by orf_id.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ed snapshots

Strip the (+)/(-) strand suffix bedtools/getfasta -nameOnly -s appends so AA
FASTA headers and MMseqs2 cluster ids map to bare orf_ids. Replace the
orfcollapse module test with self-contained fixtures that exercise the
collapse. Exclude the non-deterministic hybrid-STAR MultiQC aggregate files
from extended-ORF snapshots (tests/.nftignore_orf) and regenerate the dotseq,
novel_gtf and stringtie_extended snapshots on x86 (verified deterministic).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
pinin4fjords and others added 16 commits June 16, 2026 12:22
…notate .nftignore

- Sort orfmerge clusters by genomic coordinate before assigning orf_ids so the
  catalogue (and downstream ORF P-site counts) are byte-deterministic, not
  dependent on .collect() arrival order.
- Set mmseqs/easycluster --min-seq-id 0.9 in the catalogue subworkflow config.
- Regenerate the dotseq/novel_gtf/stringtie_extended snapshots (verified
  deterministic across two independent x86 runs).
- Annotate tests/.nftignore with per-output exclusion reasons from a two-run
  determinism audit (timestamps, parallel-summation float jitter, multithreaded
  EM/R stochasticity, hybrid-STAR multimapper non-determinism).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…se modules

Pull the merged nf-core/modules versions of
orftable_fasta_gtf_buildorfcatalogue and its custom/orfmerge,
custom/orfcollapse (pandas), mmseqs/easycluster and seqkit/translate
components.

Reconcile the pipeline wiring to the new interface: pass val_collapse=true
(amino-acid small-ORF deduplication on by default), add SEQKIT_TRANSLATE
--trim, and fix the CUSTOM_ORFCOLLAPSE publish pattern to match the emitted
${prefix}.fasta so the catalogue AA FASTA publishes. Regenerate the dotseq,
novel_gtf and stringtie_extended catalogue snapshots.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…the reference transcriptome, not canonical); clarify AGAT longest-CDS, plastid stays canonical, one-transcript-per-gene not required
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Flows up the nf-core#182 PR-review correction (reviewer @suhrig) that removed the
Ribo-seq-BAM StringTie fallback: NOVEL_TRANSCRIPT_DISCOVERY runs on RNA-seq
BAMs only (errors when --skip_stringtie false with no RNA-seq), restores the
strand-aware blacklist channel, renames --stringtie_class_codes back to
--gffcompare_class_codes, and removes --stringtie_ribo_fallback_args. Keeps
agg-truth in step with dev so the leaf reconciliation stays empty.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Ribotricer and Rp-Bp runtime warnings, the usage.md caller notes, the
nf-core#169 CHANGELOG entry and a main.nf comment cited an internal benchmark
(FK/NGB, May 2026) with bare Spearman/Jaccard figures a community user can't
act on or verify. Keep the actionable points (Ribotricer scores unstable ->
binary calls only; Rp-Bp ~20-24h/replicate, stable Bayes-factor score
retained for ranking) and drop the unpublishable provenance. Also corrects a
RiboCode 'canonical wiring' comment to 'reference-transcriptome wiring'.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…predict

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tParameters

Param-only validation (novel-source no-op + plastid-skip no-op) belongs with
genomeExistsError()/dotseqPrerequisitesError() in the pipeline-init utils
subworkflow, not inline in the RIBOSEQ workflow body.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…angelog

Mirror of the cleanup applied on feat/171-second-star-pass (PR nf-core#184): remove
PR/issue numbers from the second-pass / dispatch / hybrid-STAR comments and
condense the nf-core#171 CHANGELOG entry. Strict-syntax (channel factory casing)
deferred to a later pipeline-wide pass.
Route the hybrid second-pass transcriptome BAM through the same BAM_DEDUP_UMI
subworkflow as the primary path so RiboCode sees unique molecules rather than
PCR duplicates under --with_umi; hybrid dedup outputs publish under
hybrid_star/. Also drop a stray double space in the RiboCode dispatch closures.

Mirror of the fix on feat/171-second-star-pass (PR nf-core#184).
…utput filenames + step list)

- schema run_rpbp help_text: drop the FK/NGB internal benchmark numbers
  (uncitable; mirrors the same removal already applied for Ribotricer/Rp-Bp
  log.warns and usage.md).
- docs/output.md: predicted-ORF outputs are published as
  *.predicted-orfs.{bed.gz,dna.fa,protein.fa} (the module prefix is
  predicted-orfs; the previous .filtered.* names do not exist).
- docs/usage.md: the subworkflow runs get-periodic-lengths-offsets, not
  get-all-read-filtering-counts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Rp-Bp builds its candidate ORF set per transcript isoform and deduplicates
by genomic coordinates, then resolves overlaps itself (longest-per-stop,
best Bayes factor). Feeding it the one-transcript-per-gene canonical backbone
silently dropped isoform-specific ORFs and biased ORF-type labelling toward
canonical CDS, with no compensating benefit. Route it to the full ch_gtf
(default) like PRICE; extended mode keeps the hybrid GTF for novel transcripts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…annotation

Document, in the Rp-Bp and PRICE usage sections, that both callers receive the
full --gtf rather than the canonical one-transcript-per-gene backbone: each
enumerates/handles ORFs across all isoforms and resolves overlaps itself, so
collapsing to canonical would drop isoform-specific ORFs and bias ORF-type
labels. Cites Malone et al. 2017.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pinin4fjords

Copy link
Copy Markdown
Member Author

Three Rp-Bp/PRICE items surfaced during the #186 (Rp-Bp) alignment that are leaf/aggregation-level. They were deliberately left out of the component PRs, either to preserve leaf==agg-truth or because they touch already-merged components, so flagging them here to reconcile when finalising this branch.

  1. PRICE annotation in extended mode (code/doc mismatch). orf_caller_dispatch feeds PRICE the full ch_gtf unconditionally (no extended_orf_active branch), but docs/usage.md and docs/output.md state PRICE switches to the hybrid GTF under --extended_orf_analysis. So PRICE currently cannot discover novel intergenic ORFs in extended mode despite the docs claiming it can, and it diverges from Rp-Bp, which does route to the hybrid GTF in extended mode. Decide PRICE's intended extended behaviour and make code + docs agree.

  2. Ribotricer schema still carries the uncitable benchmark. nextflow_schema.json run_ribotricer help_text still has the internal FK/NGB, May 2026 numbers (mean Spearman 0.288 vs Jaccard 0.770) in both dev and agg-truth. The equivalent provenance was already stripped from the Rp-Bp schema, the runtime log.warns and usage.md. Left in place because it is feat: demote Ribotricer to opt-in --run_ribotricer #178's param (merged to dev) and cleaning only agg-truth would break leaf==agg-truth; fix dev and agg-truth together.

  3. Rp-Bp missing from CITATIONS.md. Rp-Bp is cited in prose (Malone et al. 2017) in the CHANGELOG and usage.md, but has no CITATIONS.md entry in dev or agg-truth. Add it alongside the existing Ribo-TISH / PRICE entries.

pinin4fjords and others added 10 commits June 25, 2026 15:17
custom/orfmerge gains --min-callers / --min-samples (default 1, no
filtering) and emits an additional consensus catalogue view
(*.consensus.{bed12,tsv,orf_to_gene.tsv}); published under
<outdir>/orf_catalogue/consensus/. The peptide-level smORF collapse is
behind --skip_orf_collapse (default off). orfcollapse docstring corrected
to state the smORF-only / locus-agnostic / global-identity departures from
the GENCODE reference rather than implying equivalence.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Collapse the per-aspect nf-core#164/nf-core#154 bullets to one line per issue and trim
the verbose modernisation entries to single-line summaries.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
novel_gtf, stringtie_extended and dotseq now publish the consensus ORF
catalogue (orf_catalogue/consensus/) alongside the full catalogue.
Regenerated on x86_64 (NXF 25.04.8, nf-test 0.9.3, --profile=+docker).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…onsensus-view code

Match the vendored orfmerge template (two-pass catalogue write), its
regenerated module snapshot, and the orftable subworkflow (consensus emits
+ regenerated snapshot) to the upstream nf-core/modules implementation.
Output is byte-identical; pipeline snapshots are unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…th upstream

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sync custom/orfcollapse + the orftable subworkflow to the consensus-after-
collapse implementation, and route orf_catalogue/consensus/ from
CUSTOM_ORFCOLLAPSE when collapse runs, falling back to CUSTOM_ORFMERGE under
--skip_orf_collapse.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s #12167 merge

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Upstream module is CUSTOM_BED12CODONPOSITIONS (not ORF_INFRAME_PSITES); the
counts TSV is three-column (sample_id, orf_id, count); versions are emitted on
the versions topic channel, not a versions.yml file.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants