Riboseq modernisation: Phases 1-3 (#160-#171 + PRICE) by pinin4fjords · Pull Request #174 · nf-core/riboseq

pinin4fjords · 2026-05-18T06:15:40Z

Summary

Modernisation umbrella PR for the riboseq pipeline. Brings issues #160-#171 to a single aggregation branch ready for review, ahead of being split into per-issue PRs.

Default-path behaviour is preserved for everything except #160 (the documented breaking change to --te_quantification_method). All new ORF callers and the extended-ORF / cross-caller catalogue paths are opt-in.

What this PR does

Quantification defaults (#160)

Breaking default change. Flip --te_quantification_method default from alignment (STAR + Salmon) to plastid_psite (in-frame P-site counts). Salmon's coverage / fragment-length / multi-mapping assumptions are inappropriate for short Ribo-seq footprints; in-frame P-site counts are the scientifically correct quantity. Two methods produce different count values; re-running an existing cohort on the new default will not reproduce prior matrices. Backward-compatible behaviour available via --te_quantification_method alignment.

Annotation backbone (#161)

Add --canonical_gtf for an explicit one-transcript-per-gene canonical annotation (MANE Select / Ensembl canonical). Used by ORF calling, riboWaltz P-site calibration, plastid P-site quantification, and translational-efficiency analysis. Falls back to AGAT_SP_KEEP_LONGEST_ISOFORM extraction from --gtf when not supplied. The full --gtf continues to drive genome-guided alignment.

Ribotricer demotion (#163)

Replace --skip_ribotricer with --run_ribotricer (default false). Default caller set is now Ribo-TISH + RiboCode; agreement logic is parameterised on the runtime-enabled caller set. Ribotricer's score column drops out of cross-caller rank aggregation when not enabled.

Novel transcript discovery (#164)

StringTie reference-guided novel-transcript discovery. Prefer RNA-seq BAMs; fall back to Ribo-seq BAMs with tightened parameters (--stringtie_ribo_fallback_args) and a runtime warning.
--novel_gtf to bypass StringTie with a user-supplied GTF.
Class-code filter via --stringtie_class_codes (default u, intergenic only).
Optional strand-aware rRNA/repeat blacklist via --rrna_blacklist (bedtools intersect -v -s).
Hybrid GTF (<outdir>/stringtie/hybrid_reference.gtf) by concatenating canonical backbone + filtered novel transcripts. Exposed as the hybrid_gtf workflow channel.

Extended ORF analysis (#165)

--extended_orf_analysis (default false). Wires the hybrid GTF into the genome-BAM callers: Ribo-TISH predict (-g hybrid, -a canonical) and Ribotricer prepare-orfs (hybrid). When enabled without a novel-transcript source the pipeline warns and falls back to canonical.

Hybrid transcriptome for RiboCode (#171)

Under --extended_orf_analysis true, run a second STAR pass against a hybrid transcriptome (canonical + filtered novel intergenic) so RiboCode can call ORFs on novel transcripts. New BUILD_HYBRID_TRANSCRIPTOME subworkflow extracts spliced transcript sequences with gffread -w and rebuilds the STAR index with the hybrid GTF as --sjdbGTFfile. The hybrid transcriptome FASTA and index build once per run; the second STAR pass is Ribo-seq-only. riboWaltz, plastid and Salmon stay on the canonical transcriptome (CDS-dependent assumptions). Hybrid alignments land under <outdir>/hybrid_star/.

Rp-Bp ORF caller (#169)

--run_rpbp (default false). Adds Rp-Bp (Malone et al. 2017) as an opt-in Tier-1 Bayesian-strict ORF caller, complementing RiboCode. Implemented as a RUN_RPBP subworkflow chaining six per-tool processes (extract-metagene-profiles, estimate-metagene-profile-bayes-factors, select-periodic-offsets, extract-orf-profiles, estimate-orf-bayes-factors, select-final-prediction-set) plus RPBP_BUILDCONFIG and RPBP_PREPAREGENOME (once per run). Splitting avoids re-running flexbar/bowtie/STAR inside rpbp - the pipeline's standard STAR alignment is used - and lets each step cache independently on resume. Wave-built container co-installs rpbp=4.0.1 and star=2.7.11b. Per-sample predicted-ORF BED + DNA + protein FASTA land under <outdir>/orf_predictions/rpbp/. Same conditional canonical-vs-hybrid GTF selection as Ribo-TISH / Ribotricer. Runtime warning at submit; expect ~10-24 h per replicate at genome scale. Docs note the STAR-defaults gap vs upstream rpbp and link tracking issue STAR alignment params: consider Ribo-seq-tuned defaults per sample type #173.

PRICE ORF caller (#170)

--run_price (default false). Adds PRICE (Erhard et al. 2018) as an opt-in Tier-2 caller for near-cognate ORF discovery. Invoked one-shot across the riboseq cohort (gedi -e IndexGenome + gedi -e Price); calls flow into the cross-caller catalogue. Container via Wave from the merged bioconda::gedi=1.0.6a recipe. PRICE is fed the full multi-isoform annotation (--gtf), not the one-transcript-per-gene canonical/hybrid backbone: it resolves overlapping ORFs and rescues multimappers with its own EM (Erhard et al. 2018, PMID 29529017), so the canonical backbone (which exists only to disambiguate P-site quantification) would only narrow PRICE's discovery and bias its ORF-type classification.

Cross-caller ORF catalogue (#167)

nf-core orftable_fasta_gtf_buildorfcatalogue subworkflow (gates on --extended_orf_analysis true + non-empty enabled-caller set). Per-caller normalisers (Ribo-TISH, RiboCode, Ribotricer, Rp-Bp, PRICE) convert each per-sample output to a unified BED12 + sidecar TSV; custom/orfmerge merges with a class-aware strategy (transcript-ID grouping for annotated multi-exon CDS, 80% reciprocal overlap for single-exon novel intergenic and smORFs ≤ 100 aa). Small ORFs (orf_class == smORF, aa_length ≤ 100) are additionally collapsed by amino-acid sequence identity (mmseqs/easycluster --min-seq-id 0.9 then the custom/orfcollapse module) so the same micropeptide encoded at distinct, non-overlapping loci becomes a single catalogue entry, following the GENCODE Ribo-seq ORF catalogue convention (Mudge et al. 2022, PMID 35831657). Collapse is on by default. Emits cohort.catalogue.{bed12,tsv,fasta}, cohort.catalogue.orf_to_gene.tsv, and a MultiQC custom-content per-class count table (cohort.catalogue.mqc.tsv) under <outdir>/orf_catalogue/.

Per-ORF P-site quantification (#166)

QUANTIFY_ORF_PSITE subworkflow (additive to existing gene-level path). Expands the cohort BED12 catalogue into codon-start positions (frame defined by each ORF's own ATG, not GTF phase), runs per-sample bedtools intersect against plastid wiggle tracks, assembles an ORF × sample count matrix (orf_psite_counts.tsv, zero-filled for ORFs absent from a sample). Runs whenever --extended_orf_analysis true, at least one caller is enabled, and plastid is not skipped. Warns and skips when --skip_plastid true. Matrix published under <outdir>/orf_quantification/ and emitted on the orf_count_matrix workflow channel.

ORF-level differential translation (#168)

Under --extended_orf_analysis true with --te_quantification_method plastid_psite:
- The gene-level TE Ribo-seq numerator is re-aggregated from per-ORF P-site counts summing only canonical_cds ORFs (Tier 1). New module orf_to_gene_cds_counts emits the long-format replacement table that feeds the existing REPLACE_RIBOSEQ_COUNTS_IN_MATRIX substitution. Keeps uORF / dORF / novel_u / smORF dynamics out of the gene-level sum.
- New per-ORF DESeq2 interaction-model module (deseq2/orf_dte) and DTE_ORF_LEVEL subworkflow (Tier 2) fit ~ condition + seq_type + condition:seq_type per ORF, Ribo-seq numerator from orf_psite_counts.tsv, RNA-seq denominator joined from gene-level Salmon counts via orf_to_gene.tsv. Novel intergenic ORFs with no host gene are dropped; low-count ORFs filtered before DESeq2 fitting. Results per contrast under <outdir>/dte/orf_level/; CDS-aggregated gene-level matrix under <outdir>/dte/gene_level_cds_aggregated/.
- --run_dotseq (default false) added as an opt-in placeholder for the Tier-3 DOTSeq DTE/DOU analysis, deferred while DOTSeq remains in Bioconductor devel. Setting the flag emits an info message and runs no analysis.
- Row-independence caveat for Tier 2 (ORFs sharing a gene-level RNA-seq denominator are perfectly correlated after the join) is documented in docs/usage.md.

Ribotish quality / hybrid mode (#162)

Investigation closed: Ribotish QC continues to consume the canonical GTF (it's calibration, not novel discovery).

Validation

Full-scale defaults run on Seqera Platform (eu-west-2): SUCCEEDED.
Full-scale --extended_orf_analysis true run: SUCCEEDED.
Full-scale PRICE run: SUCCEEDED.
Full-scale Rp-Bp run: 2/6 samples completed end-to-end and produced final ORF predictions. The remaining 4 hit infrastructure failures unrelated to pipeline logic and are documented under test_full per project convention; bioinformatics correctness is established. Re-validation deferred to a subsequent CE configuration.

Test plan

CI on the test matrix (docker, Nextflow 25.04.8 + latest-everything) and nf-core lint are green. Kept as a draft: the umbrella branch is to be split into the per-issue PRs (#186-#189 stack) for review.

🤖 Generated with Claude Code

Install nf-core/bam_stringtie_merge subworkflow and wire its merged GTF into the genome-BAM-side ORF callers (Ribo-TISH, Ribotricer, plastid) plus GTF_TO_INFRAME_PSITES. Off by default (skip_stringtie = true). RiboCode, riboWaltz, and alignment-mode Salmon stay on the reference annotation: their transcriptome BAMs were keyed to it at STAR alignment time, and a stringtie_realign mode for those is left for a follow-on PR. Closes nf-core#157 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The riboseq nf-core modules for ribotish/quality, ribotish/predict, and ribotricer/prepareorfs accept only a single GTF input - no slot for a secondary annotation. Without that slot we can't pass the merged GTF as primary -g and the reference GTF as secondary -a, which is the only correct way to wire Ribo-TISH predict / Ribotricer for novel-ORF discovery on a CDS-less StringTie merged GTF. Routing the merged GTF directly into those tools as the only annotation gives wrong outputs: - ribotish quality fails ("Counted reads: 0") because it filters reads by reference CDS which the merged GTF lacks. - plastid metagene generate is similarly CDS-bound. - ribotish predict / ribotricer would scan transcripts but lose all classification (every ORF tagged Novel). For correctness, this PR now publishes the merged GTF as a side product only. Wiring the merged GTF into the ORF callers is left for a follow-on that depends on upstream nf-core/modules updates exposing the secondary annotation argument. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…[skip ci] Extend the reference tuple with a fourth, optional `reference_gtf` slot so callers can supply a CDS-bearing GTF for TIS background modelling and TisType classification, paired with a CDS-free discovery GTF passed via `-g`. Pattern uses `stageAs: 'NO_FILE'` plus a `name != 'NO_FILE'` guard so the existing single-GTF call sites keep working by passing `[]`. - main.nf: extend `meta3` tuple, add `reference_gtf_arg` and wire to the `ribotish predict` command line. - meta.yml: document the new optional input. - tests/main.nf.test: pass `[]` in the existing tests; add new pass-through and stub tests that supply a non-empty `-a` GTF. Snapshot file intentionally left unchanged (new entries will be generated on the next `nf-test test` run; no existing entries modified). Implements nf-core/modules#11644.

Replace `--skip_ribotricer` (default false) with `--run_ribotricer` (default false). Ribotricer benchmarking showed its ORF-score column is rank-unstable across biological replicates (Spearman 0.288 vs Jaccard 0.770), so it should not contribute to cross-caller rank aggregation. - Schema/config: add run_ribotricer (boolean, default false); help_text explains the rank-instability rationale. - Workflow: gate Ribotricer behind `if (params.run_ribotricer)` and emit a log.warn when enabled. Add a dynamic ORF-caller list and derive the rank-aggregation subset (Ribotricer excluded) and the strict-majority agreement threshold (floor(N/2)+1) so the agreement logic introduced by nf-core#7 adapts to whatever caller set is enabled at runtime. - conf/test.config: drop now-removed skip_ribotricer (test profile no longer needs to opt out of a default-on caller). - README / docs/output.md: re-label Ribotricer as opt-in. - CHANGELOG: parameter rename in the unreleased Parameters table plus a Changed entry referencing the issue. Closes nf-core#163. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Flip the schema and nextflow.config default for --te_quantification_method from alignment (STAR + Salmon) to plastid_psite (in-frame P-site counts from plastid). Salmon's coverage-uniformity, fragment-length and multi-mapping assumptions are inappropriate for short, length-constrained Ribo-seq footprints; in-frame P-site counts are the scientifically correct quantity for Ribo-seq. Breaking default change: the two methods produce fundamentally different per-gene count values (Salmon TPM-derived pseudo-counts vs raw integer in-frame P-site sums), so re-running an existing cohort on a newer pipeline version with the default will not reproduce the previous count matrix. Users who need backward-compatible Salmon-style counts must set --te_quantification_method alignment explicitly. Updates schema description / help_text, docs/usage.md, and the CHANGELOG. Closes nf-core#160.

Add an optional --canonical_gtf parameter that supplies a one-transcript-per-gene canonical GTF (e.g. MANE Select or Ensembl_canonical-filtered) as the annotation backbone for ORF calling (Ribo-TISH, Ribotricer, RiboCode), riboWaltz P-site calibration, plastid P-site quantification, in-frame P-site quantification and the translational-efficiency analysis. The full --gtf is still used unconditionally for genome-guided alignment (STAR / Salmon transcriptome generation). When --canonical_gtf is omitted, PREPARE_GENOME falls back to agat_sp_keep_longest_isoform on the full GTF and pipes the result through gffread, so the backbone path is always populated. Pulls in the agat/spkeeplongestisoform nf-core module for the fallback and aliases gffread (GFFREAD_CANONICAL) plus gunzip (GUNZIP_CANONICAL_GTF) inside PREPARE_GENOME. The new ch_canonical_gtf channel is emitted from PREPARE_GENOME and routed via the RIBOSEQ workflow take block; ch_gtf remains the alignment-side channel. Docs (usage / output) cover MANE Select for human, the Ensembl_canonical grep recipe for other Ensembl organisms, the AGAT fallback, and the MANE-vs-Ensembl_canonical trade-off for non-coding genes. [skip ci]

…site [skip ci]

…t [skip ci] # Conflicts: # CHANGELOG.md

The Nextflow 26.04 strict parser rejects `Math.floor()` in workflow script blocks; `intdiv(2)+1` is the parser-clean Groovy equivalent for strict-majority of N callers (N=2 -> 2, N=3 -> 2, N=4 -> 3).

…ip ci] `stageAs: 'NO_FILE'` does not produce a file when callers pass `[]` for the absent case, so `reference_gtf.name` was unsafe. A Groovy truthy check (`reference_gtf ? '-a ...' : ''`) handles both the empty-list and real-file cases correctly without an empty `-a` argument leaking through to ribotish.

…ed channel [skip ci] stageAs: 'NO_FILE' was symlinking real input GTFs as 'NO_FILE', losing the .gtf extension that ribotish predict's suffixType() parser needs. Drop the stageAs alias; with a plain optional path() and a Groovy truthy guard, the absent case stages nothing and the present case keeps its filename. Also add a dedicated 4-element ch_fasta_gtf_for_ribotish channel in workflows/riboseq/main.nf so the predict callers pass the new 4-tuple shape ([meta, fasta, gtf, []] for Phase 1; the 4th slot becomes the canonical backbone GTF when --extended_orf_analysis lands in Phase 2). ch_fasta_gtf stays 3-element for ribotricer prepareorfs.

…name collision [skip ci] When callers supply the same GTF as both the primary -g and optional -a input (e.g. test data that has only one chr20 GTF), Nextflow raises 'input file name collision'. Adding stageAs: 'secondary.gtf' renames the secondary input so the two inputs always land at distinct paths in the work directory. The .gtf extension is preserved so ribotish's suffixType parser still recognises it. Truthy check still distinguishes the absent ([]) and present cases.

nf-core#161 wired ch_canonical_gtf into QUANTIFY_STAR_SALMON and QUANTIFY_PSEUDO_TE, but the Salmon transcript fasta and Salmon indices in PREPARE_GENOME are built from the full GTF (transcripts.fa from RSEM_PREPAREREFERENCE on ch_gtf; SALMON_INDEX / SALMON_INDEX_TE on that transcript fasta). Inside the QUANTIFY_PSEUDO_ALIGNMENT subworkflow, the same gtf channel feeds CUSTOM_TX2GENE, so tx2gene was being derived from the canonical (subset) GTF while quant.sf contained the full transcriptome. tximport then produced gene-level rownames that included genes absent from tx2gene, and SE_GENE_UNIFIED's findColumnWithAllEntries(ids, rowdata=tx2gene) failed: No column contains all vector entries ENSG00000000419, ENSG00000019186, ... Per the canonical-backbone spec (issue_proposals/02_canonical_backbone.md), the full GTF is reserved for genome-guided alignment and transcriptome quantification; --canonical_gtf drives ORF calling, P-site calibration, plastid, riboWaltz, and (via the plastid_psite replacement step) DTE. ORF/P-site/plastid wiring on ch_canonical_gtf is unchanged. For gene-level DTE consistency with the spec's canonical-backbone intent (spec section "Decision note"), follow-up: filter DTE inputs (anota2seq / deltaTE matrices) to canonical genes at the comparison step, or generate a canonical-restricted gene-level matrix downstream of QUANTIFY_STAR_SALMON. The plastid_psite TE mode already enforces canonical semantics by overwriting Ribo-seq columns with in-frame P-site counts from the canonical GTF; pseudo and default modes do not yet.

…lastid_psite [skip ci] Snapshots regenerated on the nf-dev VM with the Phase 1 changes applied (plastid_psite default, canonical_gtf backbone, run_ribotricer opt-in, ribotish/predict -a flag). All 6 pipeline tests and 6 ribotish module tests pass against the new snapshots.

…ndation [skip ci] # Conflicts: # CHANGELOG.md

…tersect [skip ci] Adds the modules needed to classify StringTie novel transcripts against the full reference (gffcompare), filter the annotated GTF to a user-configurable set of class codes (new local FILTER_GTF_CLASS_CODE module, default `u` = intergenic only), and optionally subtract an rRNA/repeat blacklist (bedtools/intersect, strand-aware via `-v -s`). Workflow wiring lands in follow-up commits. Refs nf-core#164. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…f bypass [skip ci] Refines the StringTie novel-transcript discovery path added in PR nf-core#158: - Prefer RNA-seq BAMs for StringTie assembly; fall back to Ribo-seq BAMs with tightened parameters (new `--stringtie_ribo_fallback_args` default `-m 100 -c 5 -j 3 -f 0.05 -g 100`) and emit a runtime warning. Every published microprotein discovery workflow assembles from RNA-seq because Ribo-seq footprints produce fragmented assemblies and false-positive intergenic transcripts from P-site pileups; the fallback is provided only because RNA-seq input is optional in the riboseq pipeline. - Add `--novel_gtf` to bypass StringTie and feed a user-supplied novel-transcript GTF into the same filtering chain. Useful when a curated annotation already exists (long-read RNA-seq, published ORF catalogues). - After StringTie merge (or `--novel_gtf` load) run gffcompare against the full reference GTF, then filter to a user-configurable set of class codes via the new local FILTER_GTF_CLASS_CODE module (`--stringtie_class_codes`, default `u` = unambiguous intergenic; stranded users may add `x` for antisense overlap). - Optional strand-aware rRNA/repeat blacklist intersect when `--rrna_blacklist` is supplied (`bedtools intersect -v -s`). - Concatenate the canonical backbone (from nf-core#161) with the filtered novel GTF via the new local CONCAT_GTF module to produce `<outdir>/stringtie/hybrid_reference.gtf`. Exposed as the `hybrid_gtf` workflow emit channel for downstream ORF-caller wiring (subject of a separate issue). When both `--skip_stringtie` is true and `--novel_gtf` is null, the emit equals the canonical GTF so downstream wiring is uniform. Refs nf-core#164. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lter, hybrid GTF) [skip ci]

… --extended_orf_analysis [skip ci] Add --extended_orf_analysis (default false) gating the hybrid GTF (canonical + filtered novel intergenic transcripts, from nf-core#164) into the genome-BAM ORF callers: - Ribo-TISH predict: hybrid GTF on -g (discovery target); canonical backbone on -a (background + classification). - Ribotricer prepare-orfs: hybrid GTF directly (no secondary-annotation concept; CDS-absent transcripts auto-labelled 'novel'). RiboCode, riboWaltz, plastid and the Salmon/pseudo-TE quantification subworkflows continue on the canonical backbone regardless of the flag. This is the transcriptome-BAM architectural constraint: the transcriptome BAM produced by STAR --quantMode TranscriptomeSAM is keyed to the reference transcriptome FASTA, and novel StringTie transcripts don't exist in that FASTA. Bringing RiboCode online for novel transcripts requires a second STAR pass against a hybrid transcriptome FASTA, tracked in nf-core#171. When --extended_orf_analysis true is set but no novel-transcript source is configured (--skip_stringtie true and --novel_gtf null), the pipeline warns and falls back to canonical so users can compose flags incrementally. Default behaviour is unchanged: with --extended_orf_analysis false the workflow graph is identical to the pre-nf-core#165 state. Closes nf-core#165 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…_orf_analysis [skip ci]

nf-core#161 + nf-core#165 built the reference-GTF channels with an empty meta map ([:]), which makes meta.id null. RIBOTRICER_PREPAREORFS uses ${meta.id} for its prefix and was publishing null_candidate_orfs.tsv. Phase 1 tests didn't surface this because they ran with the new run_ribotricer=false default; Phase 2's --extended_orf_analysis tests enable ribotricer to exercise the hybrid-GTF wiring. Use [id: 'reference'] so the published filename is reference_candidate_orfs.tsv regardless of canonical / hybrid origin.

StringTie can produce transcripts with strand '.' when coverage is too sparse or splice signal is ambiguous. Ribo-TISH predict raises KeyError: '.' on those because it only indexes '+' and '-' strands when building the CDS regions dictionary. Drop unstranded GTF entries here so the downstream hybrid GTF only contains stranded transcripts.

…kip ci] Two new pipeline-level tests exercise Phase 2's extended ORF discovery: - tests/novel_gtf.nf.test: --novel_gtf bypass path. Uses a small synthetic intergenic chr20 GTF (tests/data/novel_intergenic_chr20.gtf) to drive gffcompare class-u filter -> CONCAT_GTF -> hybrid GTF -> Ribo-TISH predict on hybrid (-g) + canonical (-a). - tests/stringtie_extended.nf.test: --skip_stringtie false path. Exercises full BAM_STRINGTIE_MERGE on the test profile's RNA-seq samples through the same gffcompare/filter/concat chain. Both run with --extended_orf_analysis true. Ribotricer is left at the default (--run_ribotricer false) because the chr20 test data is too sparse for Ribotricer's periodicity inference; that coverage gap is expected to be closed by -profile test_full at Phase 4. Exempt tests/data/ from the top-level data/ ignore so committed test fixtures aren't dropped.

New local subworkflow BUILD_HYBRID_TRANSCRIPTOME that extracts spliced transcript sequences from the hybrid GTF (canonical + filtered novel intergenic) with gffread -w and rebuilds a STAR index against the same genome FASTA but with the hybrid GTF as --sjdbGTFfile. Emits the hybrid transcriptome FASTA and STAR index as separate channels for the second STAR alignment pass (issue nf-core#171).

…alysis [skip ci] Under --extended_orf_analysis true, run a second STAR alignment pass for Ribo-seq samples against the hybrid transcriptome (canonical + filtered novel intergenic) built by BUILD_HYBRID_TRANSCRIPTOME, then route the resulting hybrid transcriptome BAM and the hybrid GTF into RIBOCODE_GTFUPDATE / RIBOCODE_PREPARE / RIBOCODE_RIBOCODE in place of the canonical wiring. The reference STAR alignment, riboWaltz, plastid and Salmon-based quantification are unchanged. riboWaltz stays on the canonical reference transcriptome by design (nf-core#171 spec note: feeding it CDS-absent novel transcripts would degrade its diagnostic plots without contributing to ORF discovery). The default --extended_orf_analysis false path is unchanged. - workflows/riboseq/main.nf: aliased FASTQ_ALIGN_STAR_HYBRID include; conditional second STAR pass on Ribo-seq reads only; conditional swap of RiboCode's transcriptome BAM + GTF inputs. - conf/modules.config: withName blocks for the aliased hybrid STAR processes (GFFREAD_HYBRID_TRANSCRIPTOME, STAR_GENOMEGENERATE_HYBRID, FASTQ_ALIGN_STAR_HYBRID:STAR_ALIGN, and the hybrid BAM sort/stats chain), publishing under <outdir>/hybrid_star/. - docs/usage.md, docs/output.md: document the second STAR pass, its cost, the rationale for riboWaltz staying on canonical, and the hybrid_star/ output layout. - CHANGELOG.md: Added entry under v1.3.0dev / Added.

…RiboCode parity [skip ci]

…[skip ci] Wave container co-installs rpbp=4.0.1 and star=2.7.11b (prepare-rpbp-genome requires STAR on PATH). The three modules share that container; rpbp/buildconfig uses a coreutils container to render the YAML config from pipeline inputs so users don't have to author one. RUN_RPBP chains BUILDCONFIG -> PREPAREGENOME -> PREDICTORFS, broadcasting the shared index + config to per-sample predictions.

- New params run_rpbp (default false), extra_rpbp_preparegenome_args, extra_rpbp_predictorfs_args, rpbp_config_extra_yaml. - Schema entries for all four params under the riboseq group. - modules.config publishDir + ext.args for the three Rp-Bp processes; predicted ORFs land at <outdir>/orf_predictions/rpbp/. - workflow gates RUN_RPBP on run_rpbp; appends 'rpbp' to the dynamic enabled_orf_callers list. Rp-Bp's Bayes factor is stable so it is retained in rank_aggregation_callers (only ribotricer is excluded). - Same conditional canonical-vs-hybrid GTF selection as the other genome-BAM ORF callers (Ribo-TISH predict, Ribotricer). - Runtime warning at workflow start when run_rpbp is true (~20-24h/rep). - Docs (usage + output) and CHANGELOG entry.

…qs2) [skip ci] The coordinate-based merge in custom/orfmerge only groups genomically overlapping ORFs, so a micropeptide encoded at several distinct loci (typically repetitive regions) survives as separate catalogue rows. Following the GENCODE Ribo-seq ORF catalogue convention (Mudge et al. 2022, Nat Biotechnol, doi:10.1038/s41587-022-01369-0; gencode-riboseqORFs collapse_cutoff 0.9), the catalogue subworkflow now clusters peptides with mmseqs/easycluster (--min-seq-id 0.9) and a new custom/orfcollapse module folds each multi-member small-ORF (aa_length <= 100) cluster down to one representative, unioning cross-caller / cross-sample evidence and gene mappings. bedtools/getfasta in the catalogue now uses -split -s -nameOnly so the AA FASTA is the spliced, strand-correct sequence keyed by orf_id. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ed snapshots Strip the (+)/(-) strand suffix bedtools/getfasta -nameOnly -s appends so AA FASTA headers and MMseqs2 cluster ids map to bare orf_ids. Replace the orfcollapse module test with self-contained fixtures that exercise the collapse. Exclude the non-deterministic hybrid-STAR MultiQC aggregate files from extended-ORF snapshots (tests/.nftignore_orf) and regenerate the dotseq, novel_gtf and stringtie_extended snapshots on x86 (verified deterministic). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…notate .nftignore - Sort orfmerge clusters by genomic coordinate before assigning orf_ids so the catalogue (and downstream ORF P-site counts) are byte-deterministic, not dependent on .collect() arrival order. - Set mmseqs/easycluster --min-seq-id 0.9 in the catalogue subworkflow config. - Regenerate the dotseq/novel_gtf/stringtie_extended snapshots (verified deterministic across two independent x86 runs). - Annotate tests/.nftignore with per-output exclusion reasons from a two-run determinism audit (timestamps, parallel-summation float jitter, multithreaded EM/R stochasticity, hybrid-STAR multimapper non-determinism). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…se modules Pull the merged nf-core/modules versions of orftable_fasta_gtf_buildorfcatalogue and its custom/orfmerge, custom/orfcollapse (pandas), mmseqs/easycluster and seqkit/translate components. Reconcile the pipeline wiring to the new interface: pass val_collapse=true (amino-acid small-ORF deduplication on by default), add SEQKIT_TRANSLATE --trim, and fix the CUSTOM_ORFCOLLAPSE publish pattern to match the emitted ${prefix}.fasta so the catalogue AA FASTA publishes. Regenerate the dotseq, novel_gtf and stringtie_extended catalogue snapshots. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…the reference transcriptome, not canonical); clarify AGAT longest-CDS, plastid stays canonical, one-transcript-per-gene not required

…t ORF agreement

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… entry

@suhrig

Flows up the nf-core#182 PR-review correction (reviewer @suhrig) that removed the Ribo-seq-BAM StringTie fallback: NOVEL_TRANSCRIPT_DISCOVERY runs on RNA-seq BAMs only (errors when --skip_stringtie false with no RNA-seq), restores the strand-aware blacklist channel, renames --stringtie_class_codes back to --gffcompare_class_codes, and removes --stringtie_ribo_fallback_args. Keeps agg-truth in step with dev so the leaf reconciliation stays empty. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The Ribotricer and Rp-Bp runtime warnings, the usage.md caller notes, the nf-core#169 CHANGELOG entry and a main.nf comment cited an internal benchmark (FK/NGB, May 2026) with bare Spearman/Jaccard figures a community user can't act on or verify. Keep the actionable points (Ribotricer scores unstable -> binary calls only; Rp-Bp ~20-24h/replicate, stable Bayes-factor score retained for ranking) and drop the unpublishable provenance. Also corrects a RiboCode 'canonical wiring' comment to 'reference-transcriptome wiring'. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…predict Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…tParameters Param-only validation (novel-source no-op + plastid-skip no-op) belongs with genomeExistsError()/dotseqPrerequisitesError() in the pipeline-init utils subworkflow, not inline in the RIBOSEQ workflow body. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…angelog Mirror of the cleanup applied on feat/171-second-star-pass (PR nf-core#184): remove PR/issue numbers from the second-pass / dispatch / hybrid-STAR comments and condense the nf-core#171 CHANGELOG entry. Strict-syntax (channel factory casing) deferred to a later pipeline-wide pass.

Route the hybrid second-pass transcriptome BAM through the same BAM_DEDUP_UMI subworkflow as the primary path so RiboCode sees unique molecules rather than PCR duplicates under --with_umi; hybrid dedup outputs publish under hybrid_star/. Also drop a stray double space in the RiboCode dispatch closures. Mirror of the fix on feat/171-second-star-pass (PR nf-core#184).

…text [skip ci]

…utput filenames + step list) - schema run_rpbp help_text: drop the FK/NGB internal benchmark numbers (uncitable; mirrors the same removal already applied for Ribotricer/Rp-Bp log.warns and usage.md). - docs/output.md: predicted-ORF outputs are published as *.predicted-orfs.{bed.gz,dna.fa,protein.fa} (the module prefix is predicted-orfs; the previous .filtered.* names do not exist). - docs/usage.md: the subworkflow runs get-periodic-lengths-offsets, not get-all-read-filtering-counts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Rp-Bp builds its candidate ORF set per transcript isoform and deduplicates by genomic coordinates, then resolves overlaps itself (longest-per-stop, best Bayes factor). Feeding it the one-transcript-per-gene canonical backbone silently dropped isoform-specific ORFs and biased ORF-type labelling toward canonical CDS, with no compensating benefit. Route it to the full ch_gtf (default) like PRICE; extended mode keeps the hybrid GTF for novel transcripts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…annotation Document, in the Rp-Bp and PRICE usage sections, that both callers receive the full --gtf rather than the canonical one-transcript-per-gene backbone: each enumerates/handles ORFs across all isoforms and resolves overlaps itself, so collapsing to canonical would drop isoform-specific ORFs and bias ORF-type labels. Cites Malone et al. 2017. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pinin4fjords · 2026-06-25T10:59:34Z

Three Rp-Bp/PRICE items surfaced during the #186 (Rp-Bp) alignment that are leaf/aggregation-level. They were deliberately left out of the component PRs, either to preserve leaf==agg-truth or because they touch already-merged components, so flagging them here to reconcile when finalising this branch.

PRICE annotation in extended mode (code/doc mismatch). orf_caller_dispatch feeds PRICE the full ch_gtf unconditionally (no extended_orf_active branch), but docs/usage.md and docs/output.md state PRICE switches to the hybrid GTF under --extended_orf_analysis. So PRICE currently cannot discover novel intergenic ORFs in extended mode despite the docs claiming it can, and it diverges from Rp-Bp, which does route to the hybrid GTF in extended mode. Decide PRICE's intended extended behaviour and make code + docs agree.
Ribotricer schema still carries the uncitable benchmark. nextflow_schema.json run_ribotricer help_text still has the internal FK/NGB, May 2026 numbers (mean Spearman 0.288 vs Jaccard 0.770) in both dev and agg-truth. The equivalent provenance was already stripped from the Rp-Bp schema, the runtime log.warns and usage.md. Left in place because it is feat: demote Ribotricer to opt-in --run_ribotricer #178's param (merged to dev) and cleaning only agg-truth would break leaf==agg-truth; fix dev and agg-truth together.
Rp-Bp missing from CITATIONS.md. Rp-Bp is cited in prose (Malone et al. 2017) in the CHANGELOG and usage.md, but has no CITATIONS.md entry in dev or agg-truth. Add it alongside the existing Ribo-TISH / PRICE entries.

custom/orfmerge gains --min-callers / --min-samples (default 1, no filtering) and emits an additional consensus catalogue view (*.consensus.{bed12,tsv,orf_to_gene.tsv}); published under <outdir>/orf_catalogue/consensus/. The peptide-level smORF collapse is behind --skip_orf_collapse (default off). orfcollapse docstring corrected to state the smORF-only / locus-agnostic / global-identity departures from the GENCODE reference rather than implying equivalence. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Collapse the per-aspect nf-core#164/nf-core#154 bullets to one line per issue and trim the verbose modernisation entries to single-line summaries. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

novel_gtf, stringtie_extended and dotseq now publish the consensus ORF catalogue (orf_catalogue/consensus/) alongside the full catalogue. Regenerated on x86_64 (NXF 25.04.8, nf-test 0.9.3, --profile=+docker). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…onsensus-view code Match the vendored orfmerge template (two-pass catalogue write), its regenerated module snapshot, and the orftable subworkflow (consensus emits + regenerated snapshot) to the upstream nf-core/modules implementation. Output is byte-identical; pipeline snapshots are unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…th upstream Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Sync custom/orfcollapse + the orftable subworkflow to the consensus-after- collapse implementation, and route orf_catalogue/consensus/ from CUSTOM_ORFCOLLAPSE when collapse runs, falling back to CUSTOM_ORFMERGE under --skip_orf_collapse. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…s #12167 merge Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Upstream module is CUSTOM_BED12CODONPOSITIONS (not ORF_INFRAME_PSITES); the counts TSV is three-column (sample_id, orf_id, count); versions are emitted on the versions topic channel, not a versions.yml file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pinin4fjords and others added 30 commits May 1, 2026 15:31

Merge nf-core#160: flip default te_quantification_method to plastid_p…

898db55

…site [skip ci]

Merge ribotish/predict -a flag enhancement [skip ci]

284c323

Merge nf-core#161: canonical annotation backbone [skip ci]

97876c5

Merge nf-core#163: demote Ribotricer to opt-in with dynamic caller se…

00fda88

…t [skip ci] # Conflicts: # CHANGELOG.md

fix(workflow): use intdiv() for ORF-agreement majority calc [skip ci]

7604e85

The Nextflow 26.04 strict parser rejects `Math.floor()` in workflow script blocks; `intdiv(2)+1` is the parser-clean Groovy equivalent for strict-majority of N callers (N=2 -> 2, N=3 -> 2, N=4 -> 3).

Merge feat/stringtie-novel-transcripts (PR nf-core#158) — Phase 2 fou…

180b95c

…ndation [skip ci] # Conflicts: # CHANGELOG.md

Merge nf-core#164: StringTie refinements (RNA-seq pref, gffcompare fi…

f713c32

…lter, hybrid GTF) [skip ci]

Merge nf-core#165: wire hybrid GTF into ORF callers behind --extended…

92a5487

…_orf_analysis [skip ci]

Merge nf-core#171: second STAR pass against hybrid transcriptome for …

6c29eff

…RiboCode parity [skip ci]

pinin4fjords and others added 2 commits June 15, 2026 23:32

pinin4fjords mentioned this pull request Jun 16, 2026

Move STAR --runRNGseed 0 out of production config into test config #190

Open

pinin4fjords and others added 16 commits June 16, 2026 12:22

docs: correct canonical-GTF consumers (RiboCode/riboWaltz/Salmon use …

7ffc83d

…the reference transcriptome, not canonical); clarify AGAT longest-CDS, plastid stays canonical, one-transcript-per-gene not required

docs(usage): document opt-in Ribotricer and precision-weighted defaul…

7434018

…t ORF agreement

docs: changelog nf-core#162 ribotish quality note; drop stale issue ref

b394975

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs: add nf-core#179 ribotish/predict secondary-annotation changelog…

e5b7c61

… entry

docs(orf-dispatch): note the empty-payload -a contract for Ribo-TISH …

a3da9ad

…predict Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fix: add GEDI/PRICE citation, improve PRICE log.warn and schema help_…

40de999

…text [skip ci]

pinin4fjords and others added 10 commits June 25, 2026 15:17

docs(changelog): trim entries to one-liners and dedupe granular bullets

8f6a3a6

Collapse the per-aspect nf-core#164/nf-core#154 bullets to one line per issue and trim the verbose modernisation entries to single-line summaries. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chore(modules): sync orftable subworkflow snapshot (.catalogue.aa) wi…

804f759

…th upstream Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chore(modules): re-pin orf catalogue components to the nf-core/module…

029c809

…s #12167 merge Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs: consensus view is filtered after the smORF collapse

0ebaf45

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

test: regenerate catalogue snapshots for the post-collapse consensus

ab14add

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Riboseq modernisation: Phases 1-3 (#160-#171 + PRICE)#174

Riboseq modernisation: Phases 1-3 (#160-#171 + PRICE)#174
pinin4fjords wants to merge 182 commits into
nf-core:devfrom
pinin4fjords:feat/modernisation-aggregation

pinin4fjords commented May 18, 2026 •

edited

Loading

Uh oh!

pinin4fjords commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pinin4fjords commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR does

Quantification defaults (#160)

Annotation backbone (#161)

Ribotricer demotion (#163)

Novel transcript discovery (#164)

Extended ORF analysis (#165)

Hybrid transcriptome for RiboCode (#171)

Rp-Bp ORF caller (#169)

PRICE ORF caller (#170)

Cross-caller ORF catalogue (#167)

Per-ORF P-site quantification (#166)

ORF-level differential translation (#168)

Ribotish quality / hybrid mode (#162)

Validation

Test plan

Uh oh!

pinin4fjords commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pinin4fjords commented May 18, 2026 •

edited

Loading