Riboseq modernisation: Phases 1-3 (#160-#171 + PRICE)#174
Riboseq modernisation: Phases 1-3 (#160-#171 + PRICE)#174pinin4fjords wants to merge 182 commits into
Conversation
Install nf-core/bam_stringtie_merge subworkflow and wire its merged GTF into the genome-BAM-side ORF callers (Ribo-TISH, Ribotricer, plastid) plus GTF_TO_INFRAME_PSITES. Off by default (skip_stringtie = true). RiboCode, riboWaltz, and alignment-mode Salmon stay on the reference annotation: their transcriptome BAMs were keyed to it at STAR alignment time, and a stringtie_realign mode for those is left for a follow-on PR. Closes nf-core#157 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The riboseq nf-core modules for ribotish/quality, ribotish/predict, and
ribotricer/prepareorfs accept only a single GTF input - no slot for a
secondary annotation. Without that slot we can't pass the merged GTF as
primary -g and the reference GTF as secondary -a, which is the only
correct way to wire Ribo-TISH predict / Ribotricer for novel-ORF discovery
on a CDS-less StringTie merged GTF.
Routing the merged GTF directly into those tools as the only annotation
gives wrong outputs:
- ribotish quality fails ("Counted reads: 0") because it filters reads by
reference CDS which the merged GTF lacks.
- plastid metagene generate is similarly CDS-bound.
- ribotish predict / ribotricer would scan transcripts but lose all
classification (every ORF tagged Novel).
For correctness, this PR now publishes the merged GTF as a side product
only. Wiring the merged GTF into the ORF callers is left for a follow-on
that depends on upstream nf-core/modules updates exposing the secondary
annotation argument.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…[skip ci] Extend the reference tuple with a fourth, optional `reference_gtf` slot so callers can supply a CDS-bearing GTF for TIS background modelling and TisType classification, paired with a CDS-free discovery GTF passed via `-g`. Pattern uses `stageAs: 'NO_FILE'` plus a `name != 'NO_FILE'` guard so the existing single-GTF call sites keep working by passing `[]`. - main.nf: extend `meta3` tuple, add `reference_gtf_arg` and wire to the `ribotish predict` command line. - meta.yml: document the new optional input. - tests/main.nf.test: pass `[]` in the existing tests; add new pass-through and stub tests that supply a non-empty `-a` GTF. Snapshot file intentionally left unchanged (new entries will be generated on the next `nf-test test` run; no existing entries modified). Implements nf-core/modules#11644.
Replace `--skip_ribotricer` (default false) with `--run_ribotricer` (default false). Ribotricer benchmarking showed its ORF-score column is rank-unstable across biological replicates (Spearman 0.288 vs Jaccard 0.770), so it should not contribute to cross-caller rank aggregation. - Schema/config: add run_ribotricer (boolean, default false); help_text explains the rank-instability rationale. - Workflow: gate Ribotricer behind `if (params.run_ribotricer)` and emit a log.warn when enabled. Add a dynamic ORF-caller list and derive the rank-aggregation subset (Ribotricer excluded) and the strict-majority agreement threshold (floor(N/2)+1) so the agreement logic introduced by nf-core#7 adapts to whatever caller set is enabled at runtime. - conf/test.config: drop now-removed skip_ribotricer (test profile no longer needs to opt out of a default-on caller). - README / docs/output.md: re-label Ribotricer as opt-in. - CHANGELOG: parameter rename in the unreleased Parameters table plus a Changed entry referencing the issue. Closes nf-core#163. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip the schema and nextflow.config default for --te_quantification_method from alignment (STAR + Salmon) to plastid_psite (in-frame P-site counts from plastid). Salmon's coverage-uniformity, fragment-length and multi-mapping assumptions are inappropriate for short, length-constrained Ribo-seq footprints; in-frame P-site counts are the scientifically correct quantity for Ribo-seq. Breaking default change: the two methods produce fundamentally different per-gene count values (Salmon TPM-derived pseudo-counts vs raw integer in-frame P-site sums), so re-running an existing cohort on a newer pipeline version with the default will not reproduce the previous count matrix. Users who need backward-compatible Salmon-style counts must set --te_quantification_method alignment explicitly. Updates schema description / help_text, docs/usage.md, and the CHANGELOG. Closes nf-core#160.
Add an optional --canonical_gtf parameter that supplies a one-transcript-per-gene canonical GTF (e.g. MANE Select or Ensembl_canonical-filtered) as the annotation backbone for ORF calling (Ribo-TISH, Ribotricer, RiboCode), riboWaltz P-site calibration, plastid P-site quantification, in-frame P-site quantification and the translational-efficiency analysis. The full --gtf is still used unconditionally for genome-guided alignment (STAR / Salmon transcriptome generation). When --canonical_gtf is omitted, PREPARE_GENOME falls back to agat_sp_keep_longest_isoform on the full GTF and pipes the result through gffread, so the backbone path is always populated. Pulls in the agat/spkeeplongestisoform nf-core module for the fallback and aliases gffread (GFFREAD_CANONICAL) plus gunzip (GUNZIP_CANONICAL_GTF) inside PREPARE_GENOME. The new ch_canonical_gtf channel is emitted from PREPARE_GENOME and routed via the RIBOSEQ workflow take block; ch_gtf remains the alignment-side channel. Docs (usage / output) cover MANE Select for human, the Ensembl_canonical grep recipe for other Ensembl organisms, the AGAT fallback, and the MANE-vs-Ensembl_canonical trade-off for non-coding genes. [skip ci]
…t [skip ci] # Conflicts: # CHANGELOG.md
The Nextflow 26.04 strict parser rejects `Math.floor()` in workflow script blocks; `intdiv(2)+1` is the parser-clean Groovy equivalent for strict-majority of N callers (N=2 -> 2, N=3 -> 2, N=4 -> 3).
…ip ci] `stageAs: 'NO_FILE'` does not produce a file when callers pass `[]` for the absent case, so `reference_gtf.name` was unsafe. A Groovy truthy check (`reference_gtf ? '-a ...' : ''`) handles both the empty-list and real-file cases correctly without an empty `-a` argument leaking through to ribotish.
…ed channel [skip ci] stageAs: 'NO_FILE' was symlinking real input GTFs as 'NO_FILE', losing the .gtf extension that ribotish predict's suffixType() parser needs. Drop the stageAs alias; with a plain optional path() and a Groovy truthy guard, the absent case stages nothing and the present case keeps its filename. Also add a dedicated 4-element ch_fasta_gtf_for_ribotish channel in workflows/riboseq/main.nf so the predict callers pass the new 4-tuple shape ([meta, fasta, gtf, []] for Phase 1; the 4th slot becomes the canonical backbone GTF when --extended_orf_analysis lands in Phase 2). ch_fasta_gtf stays 3-element for ribotricer prepareorfs.
…name collision [skip ci] When callers supply the same GTF as both the primary -g and optional -a input (e.g. test data that has only one chr20 GTF), Nextflow raises 'input file name collision'. Adding stageAs: 'secondary.gtf' renames the secondary input so the two inputs always land at distinct paths in the work directory. The .gtf extension is preserved so ribotish's suffixType parser still recognises it. Truthy check still distinguishes the absent ([]) and present cases.
nf-core#161 wired ch_canonical_gtf into QUANTIFY_STAR_SALMON and QUANTIFY_PSEUDO_TE, but the Salmon transcript fasta and Salmon indices in PREPARE_GENOME are built from the full GTF (transcripts.fa from RSEM_PREPAREREFERENCE on ch_gtf; SALMON_INDEX / SALMON_INDEX_TE on that transcript fasta). Inside the QUANTIFY_PSEUDO_ALIGNMENT subworkflow, the same gtf channel feeds CUSTOM_TX2GENE, so tx2gene was being derived from the canonical (subset) GTF while quant.sf contained the full transcriptome. tximport then produced gene-level rownames that included genes absent from tx2gene, and SE_GENE_UNIFIED's findColumnWithAllEntries(ids, rowdata=tx2gene) failed: No column contains all vector entries ENSG00000000419, ENSG00000019186, ... Per the canonical-backbone spec (issue_proposals/02_canonical_backbone.md), the full GTF is reserved for genome-guided alignment and transcriptome quantification; --canonical_gtf drives ORF calling, P-site calibration, plastid, riboWaltz, and (via the plastid_psite replacement step) DTE. ORF/P-site/plastid wiring on ch_canonical_gtf is unchanged. For gene-level DTE consistency with the spec's canonical-backbone intent (spec section "Decision note"), follow-up: filter DTE inputs (anota2seq / deltaTE matrices) to canonical genes at the comparison step, or generate a canonical-restricted gene-level matrix downstream of QUANTIFY_STAR_SALMON. The plastid_psite TE mode already enforces canonical semantics by overwriting Ribo-seq columns with in-frame P-site counts from the canonical GTF; pseudo and default modes do not yet.
…lastid_psite [skip ci] Snapshots regenerated on the nf-dev VM with the Phase 1 changes applied (plastid_psite default, canonical_gtf backbone, run_ribotricer opt-in, ribotish/predict -a flag). All 6 pipeline tests and 6 ribotish module tests pass against the new snapshots.
…ndation [skip ci] # Conflicts: # CHANGELOG.md
…tersect [skip ci] Adds the modules needed to classify StringTie novel transcripts against the full reference (gffcompare), filter the annotated GTF to a user-configurable set of class codes (new local FILTER_GTF_CLASS_CODE module, default `u` = intergenic only), and optionally subtract an rRNA/repeat blacklist (bedtools/intersect, strand-aware via `-v -s`). Workflow wiring lands in follow-up commits. Refs nf-core#164. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…f bypass [skip ci] Refines the StringTie novel-transcript discovery path added in PR nf-core#158: - Prefer RNA-seq BAMs for StringTie assembly; fall back to Ribo-seq BAMs with tightened parameters (new `--stringtie_ribo_fallback_args` default `-m 100 -c 5 -j 3 -f 0.05 -g 100`) and emit a runtime warning. Every published microprotein discovery workflow assembles from RNA-seq because Ribo-seq footprints produce fragmented assemblies and false-positive intergenic transcripts from P-site pileups; the fallback is provided only because RNA-seq input is optional in the riboseq pipeline. - Add `--novel_gtf` to bypass StringTie and feed a user-supplied novel-transcript GTF into the same filtering chain. Useful when a curated annotation already exists (long-read RNA-seq, published ORF catalogues). - After StringTie merge (or `--novel_gtf` load) run gffcompare against the full reference GTF, then filter to a user-configurable set of class codes via the new local FILTER_GTF_CLASS_CODE module (`--stringtie_class_codes`, default `u` = unambiguous intergenic; stranded users may add `x` for antisense overlap). - Optional strand-aware rRNA/repeat blacklist intersect when `--rrna_blacklist` is supplied (`bedtools intersect -v -s`). - Concatenate the canonical backbone (from nf-core#161) with the filtered novel GTF via the new local CONCAT_GTF module to produce `<outdir>/stringtie/hybrid_reference.gtf`. Exposed as the `hybrid_gtf` workflow emit channel for downstream ORF-caller wiring (subject of a separate issue). When both `--skip_stringtie` is true and `--novel_gtf` is null, the emit equals the canonical GTF so downstream wiring is uniform. Refs nf-core#164. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lter, hybrid GTF) [skip ci]
… --extended_orf_analysis [skip ci] Add --extended_orf_analysis (default false) gating the hybrid GTF (canonical + filtered novel intergenic transcripts, from nf-core#164) into the genome-BAM ORF callers: - Ribo-TISH predict: hybrid GTF on -g (discovery target); canonical backbone on -a (background + classification). - Ribotricer prepare-orfs: hybrid GTF directly (no secondary-annotation concept; CDS-absent transcripts auto-labelled 'novel'). RiboCode, riboWaltz, plastid and the Salmon/pseudo-TE quantification subworkflows continue on the canonical backbone regardless of the flag. This is the transcriptome-BAM architectural constraint: the transcriptome BAM produced by STAR --quantMode TranscriptomeSAM is keyed to the reference transcriptome FASTA, and novel StringTie transcripts don't exist in that FASTA. Bringing RiboCode online for novel transcripts requires a second STAR pass against a hybrid transcriptome FASTA, tracked in nf-core#171. When --extended_orf_analysis true is set but no novel-transcript source is configured (--skip_stringtie true and --novel_gtf null), the pipeline warns and falls back to canonical so users can compose flags incrementally. Default behaviour is unchanged: with --extended_orf_analysis false the workflow graph is identical to the pre-nf-core#165 state. Closes nf-core#165 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_orf_analysis [skip ci]
nf-core#161 + nf-core#165 built the reference-GTF channels with an empty meta map ([:]), which makes meta.id null. RIBOTRICER_PREPAREORFS uses ${meta.id} for its prefix and was publishing null_candidate_orfs.tsv. Phase 1 tests didn't surface this because they ran with the new run_ribotricer=false default; Phase 2's --extended_orf_analysis tests enable ribotricer to exercise the hybrid-GTF wiring. Use [id: 'reference'] so the published filename is reference_candidate_orfs.tsv regardless of canonical / hybrid origin.
StringTie can produce transcripts with strand '.' when coverage is too sparse or splice signal is ambiguous. Ribo-TISH predict raises KeyError: '.' on those because it only indexes '+' and '-' strands when building the CDS regions dictionary. Drop unstranded GTF entries here so the downstream hybrid GTF only contains stranded transcripts.
…kip ci] Two new pipeline-level tests exercise Phase 2's extended ORF discovery: - tests/novel_gtf.nf.test: --novel_gtf bypass path. Uses a small synthetic intergenic chr20 GTF (tests/data/novel_intergenic_chr20.gtf) to drive gffcompare class-u filter -> CONCAT_GTF -> hybrid GTF -> Ribo-TISH predict on hybrid (-g) + canonical (-a). - tests/stringtie_extended.nf.test: --skip_stringtie false path. Exercises full BAM_STRINGTIE_MERGE on the test profile's RNA-seq samples through the same gffcompare/filter/concat chain. Both run with --extended_orf_analysis true. Ribotricer is left at the default (--run_ribotricer false) because the chr20 test data is too sparse for Ribotricer's periodicity inference; that coverage gap is expected to be closed by -profile test_full at Phase 4. Exempt tests/data/ from the top-level data/ ignore so committed test fixtures aren't dropped.
New local subworkflow BUILD_HYBRID_TRANSCRIPTOME that extracts spliced transcript sequences from the hybrid GTF (canonical + filtered novel intergenic) with gffread -w and rebuilds a STAR index against the same genome FASTA but with the hybrid GTF as --sjdbGTFfile. Emits the hybrid transcriptome FASTA and STAR index as separate channels for the second STAR alignment pass (issue nf-core#171).
…alysis [skip ci] Under --extended_orf_analysis true, run a second STAR alignment pass for Ribo-seq samples against the hybrid transcriptome (canonical + filtered novel intergenic) built by BUILD_HYBRID_TRANSCRIPTOME, then route the resulting hybrid transcriptome BAM and the hybrid GTF into RIBOCODE_GTFUPDATE / RIBOCODE_PREPARE / RIBOCODE_RIBOCODE in place of the canonical wiring. The reference STAR alignment, riboWaltz, plastid and Salmon-based quantification are unchanged. riboWaltz stays on the canonical reference transcriptome by design (nf-core#171 spec note: feeding it CDS-absent novel transcripts would degrade its diagnostic plots without contributing to ORF discovery). The default --extended_orf_analysis false path is unchanged. - workflows/riboseq/main.nf: aliased FASTQ_ALIGN_STAR_HYBRID include; conditional second STAR pass on Ribo-seq reads only; conditional swap of RiboCode's transcriptome BAM + GTF inputs. - conf/modules.config: withName blocks for the aliased hybrid STAR processes (GFFREAD_HYBRID_TRANSCRIPTOME, STAR_GENOMEGENERATE_HYBRID, FASTQ_ALIGN_STAR_HYBRID:STAR_ALIGN, and the hybrid BAM sort/stats chain), publishing under <outdir>/hybrid_star/. - docs/usage.md, docs/output.md: document the second STAR pass, its cost, the rationale for riboWaltz staying on canonical, and the hybrid_star/ output layout. - CHANGELOG.md: Added entry under v1.3.0dev / Added.
…RiboCode parity [skip ci]
…[skip ci] Wave container co-installs rpbp=4.0.1 and star=2.7.11b (prepare-rpbp-genome requires STAR on PATH). The three modules share that container; rpbp/buildconfig uses a coreutils container to render the YAML config from pipeline inputs so users don't have to author one. RUN_RPBP chains BUILDCONFIG -> PREPAREGENOME -> PREDICTORFS, broadcasting the shared index + config to per-sample predictions.
- New params run_rpbp (default false), extra_rpbp_preparegenome_args, extra_rpbp_predictorfs_args, rpbp_config_extra_yaml. - Schema entries for all four params under the riboseq group. - modules.config publishDir + ext.args for the three Rp-Bp processes; predicted ORFs land at <outdir>/orf_predictions/rpbp/. - workflow gates RUN_RPBP on run_rpbp; appends 'rpbp' to the dynamic enabled_orf_callers list. Rp-Bp's Bayes factor is stable so it is retained in rank_aggregation_callers (only ribotricer is excluded). - Same conditional canonical-vs-hybrid GTF selection as the other genome-BAM ORF callers (Ribo-TISH predict, Ribotricer). - Runtime warning at workflow start when run_rpbp is true (~20-24h/rep). - Docs (usage + output) and CHANGELOG entry.
…qs2) [skip ci] The coordinate-based merge in custom/orfmerge only groups genomically overlapping ORFs, so a micropeptide encoded at several distinct loci (typically repetitive regions) survives as separate catalogue rows. Following the GENCODE Ribo-seq ORF catalogue convention (Mudge et al. 2022, Nat Biotechnol, doi:10.1038/s41587-022-01369-0; gencode-riboseqORFs collapse_cutoff 0.9), the catalogue subworkflow now clusters peptides with mmseqs/easycluster (--min-seq-id 0.9) and a new custom/orfcollapse module folds each multi-member small-ORF (aa_length <= 100) cluster down to one representative, unioning cross-caller / cross-sample evidence and gene mappings. bedtools/getfasta in the catalogue now uses -split -s -nameOnly so the AA FASTA is the spliced, strand-correct sequence keyed by orf_id. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ed snapshots Strip the (+)/(-) strand suffix bedtools/getfasta -nameOnly -s appends so AA FASTA headers and MMseqs2 cluster ids map to bare orf_ids. Replace the orfcollapse module test with self-contained fixtures that exercise the collapse. Exclude the non-deterministic hybrid-STAR MultiQC aggregate files from extended-ORF snapshots (tests/.nftignore_orf) and regenerate the dotseq, novel_gtf and stringtie_extended snapshots on x86 (verified deterministic). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…notate .nftignore - Sort orfmerge clusters by genomic coordinate before assigning orf_ids so the catalogue (and downstream ORF P-site counts) are byte-deterministic, not dependent on .collect() arrival order. - Set mmseqs/easycluster --min-seq-id 0.9 in the catalogue subworkflow config. - Regenerate the dotseq/novel_gtf/stringtie_extended snapshots (verified deterministic across two independent x86 runs). - Annotate tests/.nftignore with per-output exclusion reasons from a two-run determinism audit (timestamps, parallel-summation float jitter, multithreaded EM/R stochasticity, hybrid-STAR multimapper non-determinism). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…se modules
Pull the merged nf-core/modules versions of
orftable_fasta_gtf_buildorfcatalogue and its custom/orfmerge,
custom/orfcollapse (pandas), mmseqs/easycluster and seqkit/translate
components.
Reconcile the pipeline wiring to the new interface: pass val_collapse=true
(amino-acid small-ORF deduplication on by default), add SEQKIT_TRANSLATE
--trim, and fix the CUSTOM_ORFCOLLAPSE publish pattern to match the emitted
${prefix}.fasta so the catalogue AA FASTA publishes. Regenerate the dotseq,
novel_gtf and stringtie_extended catalogue snapshots.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…the reference transcriptome, not canonical); clarify AGAT longest-CDS, plastid stays canonical, one-transcript-per-gene not required
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Flows up the nf-core#182 PR-review correction (reviewer @suhrig) that removed the Ribo-seq-BAM StringTie fallback: NOVEL_TRANSCRIPT_DISCOVERY runs on RNA-seq BAMs only (errors when --skip_stringtie false with no RNA-seq), restores the strand-aware blacklist channel, renames --stringtie_class_codes back to --gffcompare_class_codes, and removes --stringtie_ribo_fallback_args. Keeps agg-truth in step with dev so the leaf reconciliation stays empty. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Ribotricer and Rp-Bp runtime warnings, the usage.md caller notes, the nf-core#169 CHANGELOG entry and a main.nf comment cited an internal benchmark (FK/NGB, May 2026) with bare Spearman/Jaccard figures a community user can't act on or verify. Keep the actionable points (Ribotricer scores unstable -> binary calls only; Rp-Bp ~20-24h/replicate, stable Bayes-factor score retained for ranking) and drop the unpublishable provenance. Also corrects a RiboCode 'canonical wiring' comment to 'reference-transcriptome wiring'. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…predict Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tParameters Param-only validation (novel-source no-op + plastid-skip no-op) belongs with genomeExistsError()/dotseqPrerequisitesError() in the pipeline-init utils subworkflow, not inline in the RIBOSEQ workflow body. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…angelog Mirror of the cleanup applied on feat/171-second-star-pass (PR nf-core#184): remove PR/issue numbers from the second-pass / dispatch / hybrid-STAR comments and condense the nf-core#171 CHANGELOG entry. Strict-syntax (channel factory casing) deferred to a later pipeline-wide pass.
Route the hybrid second-pass transcriptome BAM through the same BAM_DEDUP_UMI subworkflow as the primary path so RiboCode sees unique molecules rather than PCR duplicates under --with_umi; hybrid dedup outputs publish under hybrid_star/. Also drop a stray double space in the RiboCode dispatch closures. Mirror of the fix on feat/171-second-star-pass (PR nf-core#184).
…utput filenames + step list)
- schema run_rpbp help_text: drop the FK/NGB internal benchmark numbers
(uncitable; mirrors the same removal already applied for Ribotricer/Rp-Bp
log.warns and usage.md).
- docs/output.md: predicted-ORF outputs are published as
*.predicted-orfs.{bed.gz,dna.fa,protein.fa} (the module prefix is
predicted-orfs; the previous .filtered.* names do not exist).
- docs/usage.md: the subworkflow runs get-periodic-lengths-offsets, not
get-all-read-filtering-counts.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Rp-Bp builds its candidate ORF set per transcript isoform and deduplicates by genomic coordinates, then resolves overlaps itself (longest-per-stop, best Bayes factor). Feeding it the one-transcript-per-gene canonical backbone silently dropped isoform-specific ORFs and biased ORF-type labelling toward canonical CDS, with no compensating benefit. Route it to the full ch_gtf (default) like PRICE; extended mode keeps the hybrid GTF for novel transcripts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…annotation Document, in the Rp-Bp and PRICE usage sections, that both callers receive the full --gtf rather than the canonical one-transcript-per-gene backbone: each enumerates/handles ORFs across all isoforms and resolves overlaps itself, so collapsing to canonical would drop isoform-specific ORFs and bias ORF-type labels. Cites Malone et al. 2017. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Three Rp-Bp/PRICE items surfaced during the #186 (Rp-Bp) alignment that are leaf/aggregation-level. They were deliberately left out of the component PRs, either to preserve leaf==agg-truth or because they touch already-merged components, so flagging them here to reconcile when finalising this branch.
|
custom/orfmerge gains --min-callers / --min-samples (default 1, no
filtering) and emits an additional consensus catalogue view
(*.consensus.{bed12,tsv,orf_to_gene.tsv}); published under
<outdir>/orf_catalogue/consensus/. The peptide-level smORF collapse is
behind --skip_orf_collapse (default off). orfcollapse docstring corrected
to state the smORF-only / locus-agnostic / global-identity departures from
the GENCODE reference rather than implying equivalence.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Collapse the per-aspect nf-core#164/nf-core#154 bullets to one line per issue and trim the verbose modernisation entries to single-line summaries. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
novel_gtf, stringtie_extended and dotseq now publish the consensus ORF catalogue (orf_catalogue/consensus/) alongside the full catalogue. Regenerated on x86_64 (NXF 25.04.8, nf-test 0.9.3, --profile=+docker). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…onsensus-view code Match the vendored orfmerge template (two-pass catalogue write), its regenerated module snapshot, and the orftable subworkflow (consensus emits + regenerated snapshot) to the upstream nf-core/modules implementation. Output is byte-identical; pipeline snapshots are unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…th upstream Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sync custom/orfcollapse + the orftable subworkflow to the consensus-after- collapse implementation, and route orf_catalogue/consensus/ from CUSTOM_ORFCOLLAPSE when collapse runs, falling back to CUSTOM_ORFMERGE under --skip_orf_collapse. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s #12167 merge Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Upstream module is CUSTOM_BED12CODONPOSITIONS (not ORF_INFRAME_PSITES); the counts TSV is three-column (sample_id, orf_id, count); versions are emitted on the versions topic channel, not a versions.yml file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Modernisation umbrella PR for the riboseq pipeline. Brings issues #160-#171 to a single aggregation branch ready for review, ahead of being split into per-issue PRs.
Default-path behaviour is preserved for everything except #160 (the documented breaking change to
--te_quantification_method). All new ORF callers and the extended-ORF / cross-caller catalogue paths are opt-in.What this PR does
Quantification defaults (#160)
--te_quantification_methoddefault fromalignment(STAR + Salmon) toplastid_psite(in-frame P-site counts). Salmon's coverage / fragment-length / multi-mapping assumptions are inappropriate for short Ribo-seq footprints; in-frame P-site counts are the scientifically correct quantity. Two methods produce different count values; re-running an existing cohort on the new default will not reproduce prior matrices. Backward-compatible behaviour available via--te_quantification_method alignment.Annotation backbone (#161)
--canonical_gtffor an explicit one-transcript-per-gene canonical annotation (MANE Select / Ensembl canonical). Used by ORF calling, riboWaltz P-site calibration, plastid P-site quantification, and translational-efficiency analysis. Falls back toAGAT_SP_KEEP_LONGEST_ISOFORMextraction from--gtfwhen not supplied. The full--gtfcontinues to drive genome-guided alignment.Ribotricer demotion (#163)
--skip_ribotricerwith--run_ribotricer(defaultfalse). Default caller set is now Ribo-TISH + RiboCode; agreement logic is parameterised on the runtime-enabled caller set. Ribotricer's score column drops out of cross-caller rank aggregation when not enabled.Novel transcript discovery (#164)
--stringtie_ribo_fallback_args) and a runtime warning.--novel_gtfto bypass StringTie with a user-supplied GTF.--stringtie_class_codes(defaultu, intergenic only).--rrna_blacklist(bedtools intersect -v -s).<outdir>/stringtie/hybrid_reference.gtf) by concatenating canonical backbone + filtered novel transcripts. Exposed as thehybrid_gtfworkflow channel.Extended ORF analysis (#165)
--extended_orf_analysis(defaultfalse). Wires the hybrid GTF into the genome-BAM callers: Ribo-TISHpredict(-ghybrid,-acanonical) and Ribotricerprepare-orfs(hybrid). When enabled without a novel-transcript source the pipeline warns and falls back to canonical.Hybrid transcriptome for RiboCode (#171)
--extended_orf_analysis true, run a second STAR pass against a hybrid transcriptome (canonical + filtered novel intergenic) so RiboCode can call ORFs on novel transcripts. NewBUILD_HYBRID_TRANSCRIPTOMEsubworkflow extracts spliced transcript sequences withgffread -wand rebuilds the STAR index with the hybrid GTF as--sjdbGTFfile. The hybrid transcriptome FASTA and index build once per run; the second STAR pass is Ribo-seq-only. riboWaltz, plastid and Salmon stay on the canonical transcriptome (CDS-dependent assumptions). Hybrid alignments land under<outdir>/hybrid_star/.Rp-Bp ORF caller (#169)
--run_rpbp(defaultfalse). Adds Rp-Bp (Malone et al. 2017) as an opt-in Tier-1 Bayesian-strict ORF caller, complementing RiboCode. Implemented as aRUN_RPBPsubworkflow chaining six per-tool processes (extract-metagene-profiles,estimate-metagene-profile-bayes-factors,select-periodic-offsets,extract-orf-profiles,estimate-orf-bayes-factors,select-final-prediction-set) plusRPBP_BUILDCONFIGandRPBP_PREPAREGENOME(once per run). Splitting avoids re-running flexbar/bowtie/STAR inside rpbp - the pipeline's standard STAR alignment is used - and lets each step cache independently on resume. Wave-built container co-installsrpbp=4.0.1andstar=2.7.11b. Per-sample predicted-ORF BED + DNA + protein FASTA land under<outdir>/orf_predictions/rpbp/. Same conditional canonical-vs-hybrid GTF selection as Ribo-TISH / Ribotricer. Runtime warning at submit; expect ~10-24 h per replicate at genome scale. Docs note the STAR-defaults gap vs upstream rpbp and link tracking issue STAR alignment params: consider Ribo-seq-tuned defaults per sample type #173.PRICE ORF caller (#170)
--run_price(defaultfalse). Adds PRICE (Erhard et al. 2018) as an opt-in Tier-2 caller for near-cognate ORF discovery. Invoked one-shot across the riboseq cohort (gedi -e IndexGenome+gedi -e Price); calls flow into the cross-caller catalogue. Container via Wave from the mergedbioconda::gedi=1.0.6arecipe. PRICE is fed the full multi-isoform annotation (--gtf), not the one-transcript-per-gene canonical/hybrid backbone: it resolves overlapping ORFs and rescues multimappers with its own EM (Erhard et al. 2018, PMID 29529017), so the canonical backbone (which exists only to disambiguate P-site quantification) would only narrow PRICE's discovery and bias its ORF-type classification.Cross-caller ORF catalogue (#167)
orftable_fasta_gtf_buildorfcataloguesubworkflow (gates on--extended_orf_analysis true+ non-empty enabled-caller set). Per-caller normalisers (Ribo-TISH, RiboCode, Ribotricer, Rp-Bp, PRICE) convert each per-sample output to a unified BED12 + sidecar TSV;custom/orfmergemerges with a class-aware strategy (transcript-ID grouping for annotated multi-exon CDS, 80% reciprocal overlap for single-exon novel intergenic and smORFs ≤ 100 aa). Small ORFs (orf_class == smORF, aa_length ≤ 100) are additionally collapsed by amino-acid sequence identity (mmseqs/easycluster--min-seq-id 0.9then thecustom/orfcollapsemodule) so the same micropeptide encoded at distinct, non-overlapping loci becomes a single catalogue entry, following the GENCODE Ribo-seq ORF catalogue convention (Mudge et al. 2022, PMID 35831657). Collapse is on by default. Emitscohort.catalogue.{bed12,tsv,fasta},cohort.catalogue.orf_to_gene.tsv, and a MultiQC custom-content per-class count table (cohort.catalogue.mqc.tsv) under<outdir>/orf_catalogue/.Per-ORF P-site quantification (#166)
QUANTIFY_ORF_PSITEsubworkflow (additive to existing gene-level path). Expands the cohort BED12 catalogue into codon-start positions (frame defined by each ORF's own ATG, not GTFphase), runs per-samplebedtools intersectagainst plastid wiggle tracks, assembles an ORF × sample count matrix (orf_psite_counts.tsv, zero-filled for ORFs absent from a sample). Runs whenever--extended_orf_analysis true, at least one caller is enabled, and plastid is not skipped. Warns and skips when--skip_plastid true. Matrix published under<outdir>/orf_quantification/and emitted on theorf_count_matrixworkflow channel.ORF-level differential translation (#168)
--extended_orf_analysis truewith--te_quantification_method plastid_psite:canonical_cdsORFs (Tier 1). New moduleorf_to_gene_cds_countsemits the long-format replacement table that feeds the existingREPLACE_RIBOSEQ_COUNTS_IN_MATRIXsubstitution. Keeps uORF / dORF / novel_u / smORF dynamics out of the gene-level sum.deseq2/orf_dte) andDTE_ORF_LEVELsubworkflow (Tier 2) fit~ condition + seq_type + condition:seq_typeper ORF, Ribo-seq numerator fromorf_psite_counts.tsv, RNA-seq denominator joined from gene-level Salmon counts viaorf_to_gene.tsv. Novel intergenic ORFs with no host gene are dropped; low-count ORFs filtered before DESeq2 fitting. Results per contrast under<outdir>/dte/orf_level/; CDS-aggregated gene-level matrix under<outdir>/dte/gene_level_cds_aggregated/.--run_dotseq(defaultfalse) added as an opt-in placeholder for the Tier-3 DOTSeq DTE/DOU analysis, deferred while DOTSeq remains in Bioconductor devel. Setting the flag emits an info message and runs no analysis.docs/usage.md.Ribotish quality / hybrid mode (#162)
Validation
--extended_orf_analysis truerun: SUCCEEDED.test_fullper project convention; bioinformatics correctness is established. Re-validation deferred to a subsequent CE configuration.Test plan
tests/default.nf.testpasses (defaults path)tests/stringtie_extended.nf.testpassestests/novel_gtf.nf.testpassestests/te_plastid_psite.nf.testsnapshot is up to date--extended_orf_analysis truerun is green--run_price truerun is greennf-core lint)CI on the test matrix (docker, Nextflow
25.04.8+latest-everything) andnf-core lintare green. Kept as a draft: the umbrella branch is to be split into the per-issue PRs (#186-#189 stack) for review.🤖 Generated with Claude Code