Skip to content

feat: cross-sample ORF catalogue#187

Merged
pinin4fjords merged 19 commits into
devfrom
feat/167-orf-catalogue
Jun 26, 2026
Merged

feat: cross-sample ORF catalogue#187
pinin4fjords merged 19 commits into
devfrom
feat/167-orf-catalogue

Conversation

@pinin4fjords

@pinin4fjords pinin4fjords commented May 22, 2026

Copy link
Copy Markdown
Member

Summary

Builds a cohort-level cross-caller ORF catalogue under --extended_orf_analysis true. Each enabled caller's per-sample output (Ribo-TISH, RiboCode, Ribotricer, Rp-Bp, PRICE) is normalised to a unified BED12 + sidecar TSV, then merged class-aware (transcript-ID grouping for annotated multi-exon CDS; 80% reciprocal overlap for single-exon novel intergenic and smORFs $\leq$ 100 aa), recording cross-caller (called_by_* / score_*) and cross-sample (n_samples) evidence per ORF.

smORFs are then peptide-level deduplicated with MMseqs2 (--min-seq-id 0.9 -c 0.8), folding micropeptides encoded at multiple loci down to one representative (GENCODE Ribo-seq ORF catalogue convention, Mudge et al. 2022); opt out with --skip_orf_collapse.

Outputs land under <outdir>/orf_catalogue/: *.catalogue.{bed12,tsv,orf_to_gene.tsv,fasta} plus a MultiQC custom-content per-class count table.

Consensus view

A consensus view is published under <outdir>/orf_catalogue/consensus/, filtered to ORFs supported by at least --orf_min_callers distinct callers and recurring in at least --orf_min_samples samples (both default 1, i.e. no filtering, so it equals the full catalogue out of the box). The filter runs after the smORF collapse, so it is the high-confidence subset of the de-redundified catalogue and a micropeptide folded across loci is judged on its combined cross-caller / cross-sample evidence. The full unfiltered catalogue is always published regardless; raising either threshold (e.g. --orf_min_callers 2) gives a consensus catalogue that tames downstream ORF-level multiple testing.

Components

Built from upstream nf-core/modules components (all pinned to master): the orftable_fasta_gtf_buildorfcatalogue subworkflow plus custom/orfnormalise, custom/orfmerge, custom/orfcollapse, mmseqs/easycluster, bedtools/getfasta and seqkit/translate.

The catalogue runs once per pipeline invocation, gated on --extended_orf_analysis true and a non-empty enabled-caller set; the default-off path is unchanged.

Closes #167

🤖 Generated with Claude Code

Gather per-sample, per-caller ORF predictions (Ribo-TISH, RiboCode,
Ribotricer, Rp-Bp, PRICE), normalise each to a unified BED12 + sidecar
TSV, then merge into a cohort-level catalogue with a class-aware
strategy (transcript-ID grouping for annotated multi-exon CDS, 80%
reciprocal overlap for single-exon novel intergenic and smORFs).
Emits orf_catalogue.{bed12,tsv}, orf_to_gene.tsv, and an AA FASTA
under <outdir>/orf_catalogue/, plus a MultiQC custom-content per-class
count table.

Implementation uses the upstream orftable_fasta_gtf_buildorfcatalogue
subworkflow (nf-core/modules#11740): CUSTOM_ORFNORMALISE per caller,
CUSTOM_ORFMERGE for cohort-level merge, BEDTOOLS_GETFASTA +
SEQKIT_TRANSLATE to produce the catalogue AA FASTA.

Per-caller prediction channels (ch_*_predictions) default to
Channel.empty() and are overridden inside each caller's if-block,
gating the catalogue invocation on extended_orf_active +
at-least-one-caller.

modules.json currently pins custom/orfnormalise, custom/orfmerge,
and the orftable_fasta_gtf_buildorfcatalogue subworkflow to
nf-core/modules#11740 (branch custom-orf-catalogue, sha 6597190c).
Once #11740 merges, run nf-core modules update / subworkflows update
to swap pins to master.
@nf-core-bot

Copy link
Copy Markdown
Member

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.5.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

Base automatically changed from feat/169-rpbp to dev June 25, 2026 12:47
pinin4fjords and others added 5 commits June 25, 2026 14:20
Rebuild the cross-sample ORF catalogue slice on the reconciled dev base
(ORF_CALLER_DISPATCH structure). The catalogue's per-caller prediction
channels now come from ORF_CALLER_DISPATCH.out.* rather than the inline
caller blocks of the original branch base.

Catalogue components taken from the upstream-pinned form: custom/orfnormalise,
custom/orfmerge, custom/orfcollapse, mmseqs/easycluster, bedtools/getfasta,
seqkit/translate and the orftable_fasta_gtf_buildorfcatalogue subworkflow, all
pinned to nf-core/modules master.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
custom/orfmerge gains --min-callers / --min-samples (default 1, no
filtering) and emits an additional consensus catalogue view
(*.consensus.{bed12,tsv,orf_to_gene.tsv}) restricted to ORFs supported by
at least that many distinct callers and recurring in at least that many
samples. The full unfiltered catalogue is still published as before; the
consensus view is published under <outdir>/orf_catalogue/consensus/.
Surfaced via --orf_min_callers / --orf_min_samples; 2+ gives a consensus
catalogue that tames downstream ORF-level multiple testing.

The peptide-level smORF collapse is now behind --skip_orf_collapse
(default off, so collapse runs as before), wired to the catalogue
subworkflow's val_collapse argument.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The smORF-only restriction and locus-agnostic amino-acid clustering are
this pipeline's design choices, not properties of the GENCODE Ribo-seq
ORF consolidation. MMseqs2 global identity (--min-seq-id 0.9) approximates
rather than reproduces GENCODE's longest-shared-substring / P-site-overlap
collapse_cutoff 0.9. Docstring reworded to state both departures explicitly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a --min-samples 2 case asserting the consensus catalogue is a
non-empty strict subset of the full catalogue and every retained ORF
meets the recurrence threshold, while the full catalogue is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
novel_gtf and stringtie_extended now produce the cross-sample ORF
catalogue (orf_catalogue/, including the consensus/ view and the
normalised/ per-caller inputs) under --extended_orf_analysis true.
Regenerated on x86_64 (NXF 25.04.8, nf-test 0.9.3, --profile=+docker).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 9e6f6ef

+| ✅ 285 tests passed       |+
#| ❔   6 tests were ignored |#
!| ❗   5 tests had warnings |!
Details

❗ Test warnings:

  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_if_empty_null - ifEmpty(null) found in /home/runner/work/riboseq/riboseq/subworkflows/local/prepare_genome/main.nf: _ versions = ch_versions.ifEmpty(null) // channel: [ versions.yml ]
    _
  • schema_lint - Input mimetype is missing or empty

❔ Tests ignored:

  • nextflow_config - Config default ignored: params.ribo_database_manifest
  • nf_test_content - nf_test_content
  • files_unchanged - File ignored due to lint config: assets/nf-core-riboseq_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-riboseq_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-riboseq_logo_dark.png
  • files_unchanged - File ignored due to lint config: .gitignore or .prettierignore

✅ Tests passed:

Run details

  • nf-core/tools version 3.5.1
  • Run at 2026-06-25 21:12:30

pinin4fjords and others added 6 commits June 25, 2026 15:23
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…onsensus-view code [skip ci]

Match the vendored orfmerge template (two-pass catalogue write), its
regenerated module snapshot, and the orftable subworkflow (consensus emits
+ regenerated snapshot) to the upstream nf-core/modules implementation, so
the eventual re-pin is a modules.json sha bump with no file changes. Output
is byte-identical, so the pipeline snapshots are unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…th upstream [skip ci]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sync custom/orfcollapse + the orftable subworkflow to the upstream
consensus-after-collapse implementation, and route the published
orf_catalogue/consensus/ from CUSTOM_ORFCOLLAPSE (the de-redundified
catalogue) when collapse runs, falling back to CUSTOM_ORFMERGE when
--skip_orf_collapse is set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s #12167 merge

custom/orfmerge, custom/orfcollapse and the orftable_fasta_gtf_buildorfcatalogue
subworkflow now match their pinned sha (76e959312e), clearing the module_changes
divergence carried while #12167 was in review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pinin4fjords pinin4fjords marked this pull request as ready for review June 25, 2026 21:05
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pinin4fjords pinin4fjords requested review from iraiosub and suhrig June 25, 2026 21:10
@pinin4fjords pinin4fjords merged commit 6ee9885 into dev Jun 26, 2026
35 checks passed
@pinin4fjords pinin4fjords deleted the feat/167-orf-catalogue branch June 26, 2026 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants