diff --git a/CHANGELOG.md b/CHANGELOG.md index 661300d5..fe3a834b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -25,6 +25,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - [#165](https://github.com/nf-core/riboseq/issues/165) - Add `--extended_orf_analysis` (default `false`) which routes the hybrid GTF into the genome-BAM ORF callers (Ribo-TISH `predict`, Ribotricer `prepare-orfs`) so novel intergenic ORFs are within scope. RiboCode and transcriptome-BAM consumers stay on the canonical backbone. When the flag is set without a novel-transcript source, the pipeline warns and falls back to canonical ([@pinin4fjords](https://github.com/pinin4fjords)) - [#171](https://github.com/nf-core/riboseq/issues/171) - Under `--extended_orf_analysis true`, run a second STAR pass for Ribo-seq samples against the hybrid transcriptome so RiboCode can call ORFs on novel transcripts; RiboCode then consumes the hybrid transcriptome BAM and hybrid GTF. Outputs land under `/hybrid_star/`; the default-off path is unchanged ([@pinin4fjords](https://github.com/pinin4fjords)) - [#170](https://github.com/nf-core/riboseq/issues/170) - Add PRICE (Erhard et al. 2018) as an opt-in ORF caller via `--run_price` (default `false`). Invoked one-shot across the riboseq cohort (`gedi -e IndexGenome` + `gedi -e Price`); calls flow into the cross-caller ORF catalogue. Container via `bioconda::gedi=1.0.6a` ([@pinin4fjords](https://github.com/pinin4fjords)) +- [#169](https://github.com/nf-core/riboseq/issues/169) - Add Rp-Bp (Malone et al. 2017) as an opt-in ORF caller via `--run_rpbp`. Implemented via the upstream `fasta_gtf_bam_rpbp` subworkflow (per-tool processes for extract-metagene-profiles, estimate-metagene-profile-bayes-factors, select-periodic-offsets, extract-orf-profiles, estimate-orf-bayes-factors, select-final-prediction-set, plus a shared prepare-genome). Honours `--extended_orf_analysis` by feeding the hybrid GTF when active. Per-sample predicted-ORF BED + DNA + protein FASTA under `/orf_predictions/rpbp/`. Expect ~20-24h per replicate at genome-wide scale; pipeline emits a runtime warning when enabled ([@pinin4fjords](https://github.com/pinin4fjords)) ### `Fixed` @@ -82,6 +83,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | | `--run_price` | | | `--extra_price_indexgenome_args` | | | `--extra_price_price_args` | +| | `--run_rpbp` | +| | `--extra_rpbp_preparegenome_args` | +| | `--extra_rpbp_predictorfs_args` | ### `Dependencies` diff --git a/conf/modules.config b/conf/modules.config index 79104e23..1944c30c 100644 --- a/conf/modules.config +++ b/conf/modules.config @@ -1266,4 +1266,20 @@ process { ] } + withName: 'RPBP_PREPAREGENOME' { + ext.args = { params.extra_rpbp_preparegenome_args ?: '' } + } + withName: 'RPBP_ESTIMATEMETAGENEBAYESFACTORS|RPBP_ESTIMATEORFBAYESFACTORS' { + // Stan MCMC, 20-24h per replicate at genome scale; first-attempt headroom. + time = { 30.h * task.attempt } + } + withName: '.*:FASTA_GTF_BAM_RPBP:RPBP_SELECTFINALPREDICTIONSET' { + ext.args = { "--select-longest-by-stop --select-best-overlapping ${params.extra_rpbp_predictorfs_args ?: ''}".trim() } + publishDir = [ + path: { "${params.outdir}/orf_predictions/rpbp" }, + mode: params.publish_dir_mode, + pattern: "*.predicted-orfs.{bed.gz,dna.fa,protein.fa}" + ] + } + } diff --git a/docs/output.md b/docs/output.md index 747cee5e..72736583 100644 --- a/docs/output.md +++ b/docs/output.md @@ -37,6 +37,8 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d - [Ribo-TISH predict](#ribo-tish-predict) - [Ribotricer detect-orfs](#ribotricer-detect-orfs) - [RiboCode](#ribocode) + - [Rp-Bp](#rp-bp) + - [PRICE](#price) - [P-site identification](#p-site-identification) - [riboWaltz](#ribowaltz) - [plastid](#plastid) @@ -441,6 +443,20 @@ The `-f0_percent`, `-pv1`, and `-pv2` parameters belong to the **metaplots** ste If RiboCode is not needed for your analysis, you can skip it entirely with `--skip_ribocode`. +### Rp-Bp + +
+Output files + +- `orf_predictions/rpbp/` + - `*.predicted-orfs.bed.gz`: per-sample predicted-ORF BED with Bayes factor scores (column 5) after the final-prediction-set filter (`--select-longest-by-stop --select-best-overlapping`). + - `*.predicted-orfs.dna.fa`: per-sample predicted-ORF nucleotide FASTA matching the BED. + - `*.predicted-orfs.protein.fa`: per-sample predicted-ORF protein FASTA matching the BED. + +
+ +Produced only when `--run_rpbp true` is set. Rp-Bp's Bayesian fit is slow (~20-24h per replicate at genome-wide scale); see [Rp-Bp in usage.md](usage.md#rp-bp-opt-in-overnight). Rp-Bp's Bayes factor is stable across replicates and is retained in the cross-caller rank-aggregation set. When `--extended_orf_analysis true` is set, Rp-Bp consumes the hybrid GTF and so reports novel intergenic ORFs alongside annotated ones. + ### PRICE
diff --git a/docs/usage.md b/docs/usage.md index a0746f23..4fe8d8c0 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -334,6 +334,22 @@ By default the pipeline calls ORFs with two tools, Ribo-TISH `predict` and RiboC Ribotricer is available as a third caller but is off by default. Enable it with `--run_ribotricer true` for broader recall, after which an ORF is agreed on a majority vote (2 of 3). It is opt-in because its ORF-score column is unstable across biological replicates even though its binary call set is reproducible. When enabled, its binary calls count toward agreement but its score is excluded from cross-caller rank aggregation, and the pipeline warns at runtime. +### Rp-Bp (opt-in, overnight) + +[Rp-Bp](https://github.com/dieterich-lab/rp-bp) (Malone et al., 2017) is a Bayesian-strict ORF caller that complements RiboCode's permissive canonical-CDS calls. It is the recommended second caller when statistical rigour matters more than turnaround time. Activate with `--run_rpbp true`. + +> :warning: **Runtime cost.** Rp-Bp's Bayesian MCMC fit dominates wall-clock and takes roughly **20-24 hours per replicate** at genome-wide scale. The pipeline emits a runtime warning when `--run_rpbp` is set. Plan compute time, queue limits and instance lifetimes accordingly. + +Rp-Bp's score column (Bayes factor) is stable and is retained in the cross-caller rank-aggregation set alongside RiboCode and Ribo-TISH; Ribotricer's score column is excluded due to known instability but Rp-Bp's is not. + +Rp-Bp runs through the upstream `nf-core/rpbp/*` modules driven by the `FASTA_GTF_BAM_RPBP` nf-core subworkflow, which orchestrates `prepare-rpbp-genome`, `extract-metagene-profiles`, `estimate-metagene-profile-bayes-factors`, `select-periodic-offsets`, `get-periodic-lengths-offsets`, `extract-orf-profiles`, `estimate-orf-bayes-factors` and `select-final-prediction-set` from your `--fasta` / `--gtf` inputs without you having to author a YAML config. Tool CLI overrides are exposed via `--extra_rpbp_preparegenome_args` and `--extra_rpbp_predictorfs_args`. + +Per-sample final-prediction outputs - filtered BED of predicted ORFs (with Bayes factor in column 5), plus matched nucleotide and protein FASTAs - are published under `/orf_predictions/rpbp/`. + +**Annotation.** Rp-Bp is given the full multi-isoform `--gtf` annotation, not the one-transcript-per-gene canonical backbone that the pipeline uses elsewhere to disambiguate P-site quantification. Rp-Bp enumerates candidate ORFs across every transcript isoform (deduplicating identical ORFs by genomic coordinate) and then resolves redundant and overlapping ORFs itself - the longest ORF per stop codon, then the highest Bayes factor among overlaps. Collapsing the annotation to one isoform per gene would silently remove ORFs that exist only on non-canonical isoforms (alternative-5'UTR uORFs, isoform-specific N-terminal extensions or truncations, retained-intron and alternative-exon ORFs) and bias the reported ORF types toward canonical CDS, with no compensating benefit; PRICE is handled the same way and for the same reason ([Malone et al., 2017](https://academic.oup.com/nar/article/45/6/2960/2953491)). Under `--extended_orf_analysis true` Rp-Bp instead receives the hybrid GTF, so novel transcripts are within discovery scope in the same way as Ribo-TISH `predict` and Ribotricer. + +> :information_source: **STAR alignment params vs upstream rpbp.** rpbp's own pipeline runs STAR with Ribo-seq-tuned settings (`outFilterMismatchNmax 1`, `outFilterMismatchNoverLmax 0.04`, `outFilterType BySJout`, `sjdbOverhang 33`, `winAnchorMultimapNmax 100`, `seedSearchStartLmaxOverLread 0.5`). We use the pipeline's standard STAR alignment (shared with the RNA-seq side of paired runs), which is more permissive. Practical impact: rpbp processes whatever alignments it gets, but periodicity / Bayes-factor distributions will differ from a standalone rpbp run on the same FASTQs. If you need bit-identical-to-standalone-rpbp output, override with `--extra_star_align_args '--outFilterMismatchNmax 1 --outFilterMismatchNoverLmax 0.04 --outFilterType BySJout --winAnchorMultimapNmax 100 --seedSearchStartLmaxOverLread 0.5'`. Note that `sjdbOverhang` is baked into the STAR index and cannot be changed post-hoc - it would require regenerating the index with `--sjdbOverhang 33`, and that change would only be appropriate for a Ribo-seq-only run (RNA-seq reads are too long for that setting). Tracked for future work: [#173](https://github.com/nf-core/riboseq/issues/173). + ### PRICE (opt-in) [PRICE](https://github.com/erhard-lab/gedi/wiki/Price) (Erhard et al., 2018) is a Bayesian ORF caller distributed as part of the [Gedi](https://github.com/erhard-lab/gedi) Java framework. Unlike the per-sample callers, PRICE estimates a shared codon-position model across the riboseq cohort by EM and is invoked one-shot rather than per-sample. Activate with `--run_price true`. @@ -342,6 +358,8 @@ Ribotricer is available as a third caller but is off by default. Enable it with The pipeline builds a binary `.oml` genome index via `gedi -e IndexGenome` once per run, then calls PRICE once across the cohort with the index plus the riboseq BAMs. PRICE's primary output is `${prefix}.orfs.tsv`, a table of all called ORFs with start-codon score, range score, p-value (uncorrected) and per-condition / total read counts. Tool CLI arguments can be appended via `--extra_price_indexgenome_args` and `--extra_price_price_args`. +**Annotation.** Like Rp-Bp, PRICE is given the full multi-isoform `--gtf` annotation rather than the one-transcript-per-gene canonical backbone: it resolves overlapping ORFs and rescues multimappers with its own EM, so restricting it to a single isoform per gene would only narrow ORF discovery and bias ORF-type classification toward canonical CDS. + When `--extended_orf_analysis true` is set, PRICE's IndexGenome receives the hybrid GTF so ORFs on novel intergenic transcripts are within its discovery scope. PRICE's CLI banner reports `Price version 1.0.4` while the Bioconda package is `gedi 1.0.6a` (Price is one tool inside the Gedi umbrella). The pipeline captures the package version via `gedi -e Version` for `versions.yml`. diff --git a/modules.json b/modules.json index 7bc5d64a..0e66a382 100644 --- a/modules.json +++ b/modules.json @@ -200,6 +200,46 @@ "git_sha": "b59f74e059a49fce82f19fbf684e2876da85ee39", "installed_by": ["modules"] }, + "rpbp/estimatemetagenebayesfactors": { + "branch": "master", + "git_sha": "617c552da369f63371648679983483736e52f3b8", + "installed_by": ["fasta_gtf_bam_rpbp", "modules"] + }, + "rpbp/estimateorfbayesfactors": { + "branch": "master", + "git_sha": "bbd642353cae3464d405ab6ae7366532648164e7", + "installed_by": ["fasta_gtf_bam_rpbp", "modules"] + }, + "rpbp/extractmetageneprofiles": { + "branch": "master", + "git_sha": "df05573454925ff87ff6ea6c4afaef70c64b7248", + "installed_by": ["fasta_gtf_bam_rpbp", "modules"] + }, + "rpbp/extractorfprofiles": { + "branch": "master", + "git_sha": "ad585e77451e4638e876bcf895185f0ac7f85fae", + "installed_by": ["fasta_gtf_bam_rpbp", "modules"] + }, + "rpbp/getperiodiclengthsoffsets": { + "branch": "master", + "git_sha": "caba7d6afa153b377581449609109fcd43775cab", + "installed_by": ["fasta_gtf_bam_rpbp", "modules"] + }, + "rpbp/preparegenome": { + "branch": "master", + "git_sha": "50b83e9601f411ef670b67b205db78e707a88f01", + "installed_by": ["fasta_gtf_bam_rpbp", "modules"] + }, + "rpbp/selectfinalpredictionset": { + "branch": "master", + "git_sha": "c2b3bb2252858a79872a27c7918cdb4b2a43f778", + "installed_by": ["fasta_gtf_bam_rpbp", "modules"] + }, + "rpbp/selectperiodicoffsets": { + "branch": "master", + "git_sha": "2f2f5662c0719221637811c6a6e443ec3414e0df", + "installed_by": ["fasta_gtf_bam_rpbp", "modules"] + }, "rsem/preparereference": { "branch": "master", "git_sha": "004e773fc35ebd24063ca4cbef057c94a24208aa", @@ -377,6 +417,11 @@ "git_sha": "2fc6aef2691483864904e31973ccafd2ed68fd56", "installed_by": ["subworkflows"] }, + "fasta_gtf_bam_rpbp": { + "branch": "master", + "git_sha": "2111854cad9111c2b8057a6e050e0656d41acf32", + "installed_by": ["subworkflows"] + }, "fastq_align_star": { "branch": "master", "git_sha": "cebe21bbd158c15c8fab172e37cfe97a239f4b77", diff --git a/modules/nf-core/rpbp/estimatemetagenebayesfactors/environment.yml b/modules/nf-core/rpbp/estimatemetagenebayesfactors/environment.yml new file mode 100644 index 00000000..7e2261ff --- /dev/null +++ b/modules/nf-core/rpbp/estimatemetagenebayesfactors/environment.yml @@ -0,0 +1,5 @@ +channels: + - conda-forge + - bioconda +dependencies: + - bioconda::rpbp=4.0.1 diff --git a/modules/nf-core/rpbp/estimatemetagenebayesfactors/main.nf b/modules/nf-core/rpbp/estimatemetagenebayesfactors/main.nf new file mode 100644 index 00000000..bfb3b047 --- /dev/null +++ b/modules/nf-core/rpbp/estimatemetagenebayesfactors/main.nf @@ -0,0 +1,42 @@ +process RPBP_ESTIMATEMETAGENEBAYESFACTORS { + tag "$meta.id" + label 'process_medium' + + conda "${moduleDir}/environment.yml" + container "${ workflow.containerEngine in ['singularity', 'apptainer'] && !task.ext.singularity_pull_docker_container ? + 'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/14/146c3f15abf184a5ec13531d2a040ba7b9235c1091723aa37c7a119817411367/data' : + 'community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13b' }" + + input: + tuple val(meta), path(profile_csv) + + output: + tuple val(meta), path("${prefix}.csv.gz"), emit: bayes_factors + tuple val("${task.process}"), val('rpbp'), eval('python -c "import rpbp; print(rpbp.__version__)"'), emit: versions_rpbp, topic: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + prefix = task.ext.prefix ?: "${meta.id}.metagene-bayes" + """ + RPBP_MODELS_BASE=\$(python3 -c "import os, inspect, rpbp; print(os.path.join(os.path.dirname(inspect.getfile(rpbp)), 'models'))") + PERIODIC=\$(ls "\$RPBP_MODELS_BASE"/periodic/*.stan) + NONPERIODIC=\$(ls "\$RPBP_MODELS_BASE"/nonperiodic/*.stan) + + estimate-metagene-profile-bayes-factors \\ + ${profile_csv} \\ + ${prefix}.csv.gz \\ + --periodic-models \$PERIODIC \\ + --nonperiodic-models \$NONPERIODIC \\ + --num-cpus ${task.cpus} \\ + ${args} + """ + + stub: + prefix = task.ext.prefix ?: "${meta.id}.metagene-bayes" + """ + echo "" | gzip > ${prefix}.csv.gz + """ +} diff --git a/modules/nf-core/rpbp/estimatemetagenebayesfactors/meta.yml b/modules/nf-core/rpbp/estimatemetagenebayesfactors/meta.yml new file mode 100644 index 00000000..5f7987b2 --- /dev/null +++ b/modules/nf-core/rpbp/estimatemetagenebayesfactors/meta.yml @@ -0,0 +1,79 @@ +name: "rpbp_estimatemetagenebayesfactors" +description: | + Score how strongly each per-read-length metagene profile shows the + 3-nucleotide periodicity expected of actively translating ribosomes. + For each candidate (read length, P-site offset) pair, Rp-Bp fits two + competing Bayesian models to the count window around annotated start + codons: a "periodic" model whose signal repeats every three + nucleotides, and a "non-periodic" background model. The Bayes factor + (ratio of the two marginal likelihoods) quantifies how much the data + prefer the periodic explanation. + + Returns one row per (length, offset) pair with the mean and variance of + the log Bayes factor across MCMC samples. Downstream, + `rpbp/selectperiodicoffsets` picks the best offset per length from + this table, and `rpbp/getperiodiclengthsoffsets` filters to the + high-confidence pairs that drive ORF-level scoring. + + Uses the Stan models bundled inside the rpbp Python package. +keywords: + - rpbp + - metagene + - bayes + - orf + - riboseq +tools: + - "rpbp": + description: "Rp-Bp - Bayesian inference of ribosome profiling data for identifying translated open reading frames" + homepage: "https://github.com/dieterich-lab/rp-bp" + documentation: "https://rp-bp.readthedocs.io" + tool_dev_url: "https://github.com/dieterich-lab/rp-bp" + doi: "10.1093/nar/gkw1350" + licence: + - "MIT" + identifier: "" +input: + - - meta: + type: map + description: | + Groovy Map containing sample information, e.g. `[ id:'sample1' ]`. + - profile_csv: + type: file + description: Metagene profile CSV produced by `rpbp/extractmetageneprofiles`. + pattern: "*.metagene-profile.csv.gz" + ontologies: [] +output: + bayes_factors: + - - meta: + type: map + description: Groovy Map inherited from input meta. + - "${prefix}.csv.gz": + type: file + description: Per-read-length metagene periodicity Bayes factors. + pattern: "*.csv.gz" + ontologies: [] + versions_rpbp: + - - ${task.process}: + type: string + description: The name of the process + - rpbp: + type: string + description: The name of the tool + - python -c "import rpbp; print(rpbp.__version__)": + type: eval + description: The expression to obtain the version of the tool +topics: + versions: + - - ${task.process}: + type: string + description: The name of the process + - rpbp: + type: string + description: The name of the tool + - python -c "import rpbp; print(rpbp.__version__)": + type: eval + description: The expression to obtain the version of the tool +authors: + - "@pinin4fjords" +maintainers: + - "@pinin4fjords" diff --git a/modules/nf-core/rpbp/estimatemetagenebayesfactors/tests/main.nf.test b/modules/nf-core/rpbp/estimatemetagenebayesfactors/tests/main.nf.test new file mode 100644 index 00000000..93ddba39 --- /dev/null +++ b/modules/nf-core/rpbp/estimatemetagenebayesfactors/tests/main.nf.test @@ -0,0 +1,70 @@ +nextflow_process { + + name "Test Process RPBP_ESTIMATEMETAGENEBAYESFACTORS" + script "../main.nf" + process "RPBP_ESTIMATEMETAGENEBAYESFACTORS" + + tag "modules" + tag "modules_nfcore" + tag "rpbp" + tag "rpbp/estimatemetagenebayesfactors" + + test("homo_sapiens chr20 - metagene bayes factors") { + + when { + process { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.metagene-profile.csv.gz", checkIfExists: true) + ]) + """ + } + } + + then { + def bayes_factors = file(process.out.bayes_factors[0][1]) + def rows = path(process.out.bayes_factors[0][1]).linesGzip + def header = rows[0].split(",") + // The p_periodic_*, p_nonperiodic_*, and bayes_factor_* columns are MCMC + // estimates whose floats differ between the conda and container toolchains; + // snapshot only the deterministic input-derived columns. + def stable_cols = ["offset", "profile_sum", "profile_peak", "length"].collect { name -> header.findIndexOf { it == name } } + def stable = rows.collect { line -> def fields = line.split(","); stable_cols.collect { fields[it] }.join(",") }.join("\n") + def stable_md5 = java.security.MessageDigest.getInstance("MD5").digest(stable.bytes).encodeHex().toString() + assertAll( + { assert process.success }, + { assert snapshot( + bayes_factors.name, + rows.size(), + rows[0], + stable_md5, + process.out.findAll { key, val -> key.startsWith('versions') } + ).match() } + ) + } + } + + test("homo_sapiens chr20 - metagene bayes factors - stub") { + + options '-stub' + + when { + process { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.metagene-profile.csv.gz", checkIfExists: true) + ]) + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } +} diff --git a/modules/nf-core/rpbp/estimatemetagenebayesfactors/tests/main.nf.test.snap b/modules/nf-core/rpbp/estimatemetagenebayesfactors/tests/main.nf.test.snap new file mode 100644 index 00000000..4205ed46 --- /dev/null +++ b/modules/nf-core/rpbp/estimatemetagenebayesfactors/tests/main.nf.test.snap @@ -0,0 +1,52 @@ +{ + "homo_sapiens chr20 - metagene bayes factors": { + "content": [ + "test.metagene-bayes.csv.gz", + 379, + "offset,p_periodic_mean,p_periodic_var,p_nonperiodic_mean,p_nonperiodic_var,profile_sum,profile_peak,bayes_factor_mean,bayes_factor_var,length", + "a239cf6aa3f7d18e11e7d4a59383a2d6", + { + "versions_rpbp": [ + [ + "RPBP_ESTIMATEMETAGENEBAYESFACTORS", + "rpbp", + "4.0.1" + ] + ] + } + ], + "timestamp": "2026-06-10T16:12:12.954610332", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + }, + "homo_sapiens chr20 - metagene bayes factors - stub": { + "content": [ + { + "bayes_factors": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.metagene-bayes.csv.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "versions_rpbp": [ + [ + "RPBP_ESTIMATEMETAGENEBAYESFACTORS", + "rpbp", + "4.0.1" + ] + ] + } + ], + "timestamp": "2026-06-11T11:11:17.3733716", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + } +} \ No newline at end of file diff --git a/modules/nf-core/rpbp/estimateorfbayesfactors/environment.yml b/modules/nf-core/rpbp/estimateorfbayesfactors/environment.yml new file mode 100644 index 00000000..7e2261ff --- /dev/null +++ b/modules/nf-core/rpbp/estimateorfbayesfactors/environment.yml @@ -0,0 +1,5 @@ +channels: + - conda-forge + - bioconda +dependencies: + - bioconda::rpbp=4.0.1 diff --git a/modules/nf-core/rpbp/estimateorfbayesfactors/main.nf b/modules/nf-core/rpbp/estimateorfbayesfactors/main.nf new file mode 100644 index 00000000..4951e66c --- /dev/null +++ b/modules/nf-core/rpbp/estimateorfbayesfactors/main.nf @@ -0,0 +1,44 @@ +process RPBP_ESTIMATEORFBAYESFACTORS { + tag "$meta.id" + label 'process_high' + + conda "${moduleDir}/environment.yml" + container "${ workflow.containerEngine in ['singularity', 'apptainer'] && !task.ext.singularity_pull_docker_container ? + 'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/14/146c3f15abf184a5ec13531d2a040ba7b9235c1091723aa37c7a119817411367/data' : + 'community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13b' }" + + input: + tuple val(meta), path(profiles) + tuple val(meta2), path(orfs_genomic_bed) + + output: + tuple val(meta), path("${prefix}.bed.gz"), emit: bayes_factors + tuple val("${task.process}"), val('rpbp'), eval('python -c "import rpbp; print(rpbp.__version__)"'), emit: versions_rpbp, topic: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + prefix = task.ext.prefix ?: "${meta.id}.bayes-factors" + """ + RPBP_MODELS_BASE=\$(python3 -c "import os, inspect, rpbp; print(os.path.join(os.path.dirname(inspect.getfile(rpbp)), 'models'))") + TRANSLATED=\$(ls "\$RPBP_MODELS_BASE"/translated/*.stan) + UNTRANSLATED=\$(ls "\$RPBP_MODELS_BASE"/untranslated/*.stan) + + estimate-orf-bayes-factors \\ + ${profiles} \\ + ${orfs_genomic_bed} \\ + ${prefix}.bed.gz \\ + --translated-models \$TRANSLATED \\ + --untranslated-models \$UNTRANSLATED \\ + --num-cpus ${task.cpus} \\ + ${args} + """ + + stub: + prefix = task.ext.prefix ?: "${meta.id}.bayes-factors" + """ + echo "" | gzip > ${prefix}.bed.gz + """ +} diff --git a/modules/nf-core/rpbp/estimateorfbayesfactors/meta.yml b/modules/nf-core/rpbp/estimateorfbayesfactors/meta.yml new file mode 100644 index 00000000..28faf19e --- /dev/null +++ b/modules/nf-core/rpbp/estimateorfbayesfactors/meta.yml @@ -0,0 +1,87 @@ +name: "rpbp_estimateorfbayesfactors" +description: | + Score every candidate ORF for evidence of active translation. For + each ORF, Rp-Bp fits two competing Bayesian models to its per-codon + P-site count vector: a "translated" model that expects P-site density + to concentrate at codon-start positions (the in-frame signal a + translating ribosome produces), and an "untranslated" / noise model + for the same data. The Bayes factor (ratio of marginal likelihoods) + quantifies how much the data favour the translated hypothesis. + + Emits a BED-style table with one row per ORF carrying genomic + coordinates plus the mean and variance of the log Bayes factor across + MCMC samples. Downstream, `rpbp/selectfinalpredictionset` applies + Bayes-factor, length and overlap rules to this table to produce the + final filtered prediction set. + + Uses the Stan models bundled inside the rpbp Python package. +keywords: + - rpbp + - orf + - bayes + - translation + - riboseq +tools: + - "rpbp": + description: "Rp-Bp - Bayesian inference of ribosome profiling data for identifying translated open reading frames" + homepage: "https://github.com/dieterich-lab/rp-bp" + documentation: "https://rp-bp.readthedocs.io" + tool_dev_url: "https://github.com/dieterich-lab/rp-bp" + doi: "10.1093/nar/gkw1350" + licence: + - "MIT" + identifier: "" +input: + - - meta: + type: map + description: | + Groovy Map containing sample information, e.g. `[ id:'sample1' ]`. + - profiles: + type: file + description: Per-ORF P-site profile matrix from `rpbp/extractorfprofiles`. + pattern: "*.profiles.mtx.gz" + ontologies: [] + - - meta2: + type: map + description: | + Groovy Map identifying the reference (e.g. `[ id:'reference' ]`). + - orfs_genomic_bed: + type: file + description: Per-ORF genomic BED from `rpbp/preparegenome`. + pattern: "*.orfs-genomic.annotated.bed.gz" + ontologies: [] +output: + bayes_factors: + - - meta: + type: map + description: Groovy Map inherited from input meta. + - "${prefix}.bed.gz": + type: file + description: Per-ORF translation Bayes factors (BED). + pattern: "*.bed.gz" + ontologies: [] + versions_rpbp: + - - ${task.process}: + type: string + description: The name of the process + - rpbp: + type: string + description: The name of the tool + - python -c "import rpbp; print(rpbp.__version__)": + type: eval + description: The expression to obtain the version of the tool +topics: + versions: + - - ${task.process}: + type: string + description: The name of the process + - rpbp: + type: string + description: The name of the tool + - python -c "import rpbp; print(rpbp.__version__)": + type: eval + description: The expression to obtain the version of the tool +authors: + - "@pinin4fjords" +maintainers: + - "@pinin4fjords" diff --git a/modules/nf-core/rpbp/estimateorfbayesfactors/tests/main.nf.test b/modules/nf-core/rpbp/estimateorfbayesfactors/tests/main.nf.test new file mode 100644 index 00000000..b3e90065 --- /dev/null +++ b/modules/nf-core/rpbp/estimateorfbayesfactors/tests/main.nf.test @@ -0,0 +1,78 @@ +nextflow_process { + + name "Test Process RPBP_ESTIMATEORFBAYESFACTORS" + script "../main.nf" + process "RPBP_ESTIMATEORFBAYESFACTORS" + + tag "modules" + tag "modules_nfcore" + tag "rpbp" + tag "rpbp/estimateorfbayesfactors" + + test("homo_sapiens chr20 - estimate orf bayes factors") { + + when { + process { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.profiles.mtx.gz", checkIfExists: true) + ]) + input[1] = [ + [ id:'reference' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/reference.orfs-genomic.annotated.bed.gz", checkIfExists: true) + ] + """ + } + } + + then { + def bayes_factors = file(process.out.bayes_factors[0][1]) + def rows = path(process.out.bayes_factors[0][1]).linesGzip + def header = rows[0].split("\t") + // These columns are MCMC estimates whose floats differ between the conda and + // container toolchains; exclude them and snapshot only the deterministic columns. + def mcmc_cols = ["#p_translated_mean", "#p_translated_var", "#p_background_mean", "#p_background_var", "#bayes_factor_mean", "#bayes_factor_var"] + def stable_idx = (0.. def fields = line.split("\t"); stable_idx.collect { fields[it] }.join("\t") }.join("\n") + def stable_md5 = java.security.MessageDigest.getInstance("MD5").digest(stable.bytes).encodeHex().toString() + assertAll( + { assert process.success }, + { assert snapshot( + bayes_factors.name, + rows.size(), + rows[0], + stable_md5, + process.out.findAll { key, val -> key.startsWith('versions') } + ).match() } + ) + } + } + + test("homo_sapiens chr20 - estimate orf bayes factors - stub") { + + options '-stub' + + when { + process { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.profiles.mtx.gz", checkIfExists: true) + ]) + input[1] = [ + [ id:'reference' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/reference.orfs-genomic.annotated.bed.gz", checkIfExists: true) + ] + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } +} diff --git a/modules/nf-core/rpbp/estimateorfbayesfactors/tests/main.nf.test.snap b/modules/nf-core/rpbp/estimateorfbayesfactors/tests/main.nf.test.snap new file mode 100644 index 00000000..999a6567 --- /dev/null +++ b/modules/nf-core/rpbp/estimateorfbayesfactors/tests/main.nf.test.snap @@ -0,0 +1,52 @@ +{ + "homo_sapiens chr20 - estimate orf bayes factors": { + "content": [ + "test.bayes-factors.bed.gz", + 8219, + "#seqname\t#start\t#end\t#id\t#score\t#strand\t#thick_start\t#thick_end\t#color\t#num_exons\t#exon_lengths\t#exon_genomic_relative_starts\t#orf_num\t#orf_len\t#p_translated_mean\t#p_translated_var\t#p_background_mean\t#p_background_var\t#bayes_factor_mean\t#bayes_factor_var\t#chi_square_p\t#x_1_sum\t#x_2_sum\t#x_3_sum\t#profile_sum", + "1a17faf5a5771d0f306b3c9591d7f9b1", + { + "versions_rpbp": [ + [ + "RPBP_ESTIMATEORFBAYESFACTORS", + "rpbp", + "4.0.1" + ] + ] + } + ], + "timestamp": "2026-06-10T17:00:18.24957747", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + }, + "homo_sapiens chr20 - estimate orf bayes factors - stub": { + "content": [ + { + "bayes_factors": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.bayes-factors.bed.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "versions_rpbp": [ + [ + "RPBP_ESTIMATEORFBAYESFACTORS", + "rpbp", + "4.0.1" + ] + ] + } + ], + "timestamp": "2026-06-11T11:51:10.205143393", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + } +} \ No newline at end of file diff --git a/modules/nf-core/rpbp/extractmetageneprofiles/environment.yml b/modules/nf-core/rpbp/extractmetageneprofiles/environment.yml new file mode 100644 index 00000000..7e2261ff --- /dev/null +++ b/modules/nf-core/rpbp/extractmetageneprofiles/environment.yml @@ -0,0 +1,5 @@ +channels: + - conda-forge + - bioconda +dependencies: + - bioconda::rpbp=4.0.1 diff --git a/modules/nf-core/rpbp/extractmetageneprofiles/main.nf b/modules/nf-core/rpbp/extractmetageneprofiles/main.nf new file mode 100644 index 00000000..bffce53e --- /dev/null +++ b/modules/nf-core/rpbp/extractmetageneprofiles/main.nf @@ -0,0 +1,38 @@ +process RPBP_EXTRACTMETAGENEPROFILES { + tag "$meta.id" + label 'process_medium' + + conda "${moduleDir}/environment.yml" + container "${ workflow.containerEngine in ['singularity', 'apptainer'] && !task.ext.singularity_pull_docker_container ? + 'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/14/146c3f15abf184a5ec13531d2a040ba7b9235c1091723aa37c7a119817411367/data' : + 'community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13b' }" + + input: + tuple val(meta), path(bam), path(bai) + tuple val(meta2), path(transcript_bed) + + output: + tuple val(meta), path("${prefix}.csv.gz"), emit: metagene + tuple val("${task.process}"), val('rpbp'), eval('python -c "import rpbp; print(rpbp.__version__)"'), emit: versions_rpbp, topic: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + prefix = task.ext.prefix ?: "${meta.id}.metagene" + """ + extract-metagene-profiles \\ + ${bam} \\ + ${transcript_bed} \\ + ${prefix}.csv.gz \\ + --num-cpus ${task.cpus} \\ + ${args} + """ + + stub: + prefix = task.ext.prefix ?: "${meta.id}.metagene" + """ + echo "" | gzip > ${prefix}.csv.gz + """ +} diff --git a/modules/nf-core/rpbp/extractmetageneprofiles/meta.yml b/modules/nf-core/rpbp/extractmetageneprofiles/meta.yml new file mode 100644 index 00000000..f48b4774 --- /dev/null +++ b/modules/nf-core/rpbp/extractmetageneprofiles/meta.yml @@ -0,0 +1,90 @@ +name: "rpbp_extractmetageneprofiles" +description: | + Build per-read-length pileups of Ribo-seq read 5'-ends around annotated + start codons - the "metagene profile". For each read length, the profile + counts how many reads of that length have their 5' end at each position + in a window around every annotated start codon, summed across all + transcripts. Looking at the profile across the window reveals whether + reads of that length show the 3-nucleotide periodicity characteristic + of translating ribosomes. + + This per-length view matters because different ribosome footprint + lengths place the ribosomal P-site (the codon being decoded) at + different offsets from the read's 5' end, so each length needs its + own offset calibration. Output is consumed by + `rpbp/estimatemetagenebayesfactors`, which scores each (length, offset) + combination for periodicity. +keywords: + - rpbp + - metagene + - orf + - riboseq +tools: + - "rpbp": + description: "Rp-Bp - Bayesian inference of ribosome profiling data for identifying translated open reading frames" + homepage: "https://github.com/dieterich-lab/rp-bp" + documentation: "https://rp-bp.readthedocs.io" + tool_dev_url: "https://github.com/dieterich-lab/rp-bp" + doi: "10.1093/nar/gkw1350" + licence: + - "MIT" + identifier: "" +input: + - - meta: + type: map + description: | + Groovy Map containing sample information, e.g. `[ id:'sample1' ]`. + - bam: + type: file + description: Sorted Ribo-seq BAM. + pattern: "*.bam" + ontologies: [] + - bai: + type: file + description: BAM index. + pattern: "*.bai" + ontologies: [] + - - meta2: + type: map + description: | + Groovy Map identifying the reference (e.g. `[ id:'reference' ]`). + - transcript_bed: + type: file + description: Annotated transcripts BED produced by `rpbp/preparegenome`. + pattern: "*.annotated.bed.gz" + ontologies: [] +output: + metagene: + - - meta: + type: map + description: Groovy Map inherited from input meta. + - "${prefix}.csv.gz": + type: file + description: Per-read-length 5'-end metagene profile counts. + pattern: "*.csv.gz" + ontologies: [] + versions_rpbp: + - - ${task.process}: + type: string + description: The name of the process + - rpbp: + type: string + description: The name of the tool + - python -c "import rpbp; print(rpbp.__version__)": + type: eval + description: The expression to obtain the version of the tool +topics: + versions: + - - ${task.process}: + type: string + description: The name of the process + - rpbp: + type: string + description: The name of the tool + - python -c "import rpbp; print(rpbp.__version__)": + type: eval + description: The expression to obtain the version of the tool +authors: + - "@pinin4fjords" +maintainers: + - "@pinin4fjords" diff --git a/modules/nf-core/rpbp/extractmetageneprofiles/tests/main.nf.test b/modules/nf-core/rpbp/extractmetageneprofiles/tests/main.nf.test new file mode 100644 index 00000000..25b092d0 --- /dev/null +++ b/modules/nf-core/rpbp/extractmetageneprofiles/tests/main.nf.test @@ -0,0 +1,65 @@ +nextflow_process { + + name "Test Process RPBP_EXTRACTMETAGENEPROFILES" + script "../main.nf" + process "RPBP_EXTRACTMETAGENEPROFILES" + + tag "modules" + tag "modules_nfcore" + tag "rpbp" + tag "rpbp/extractmetageneprofiles" + + test("homo_sapiens chr20 - extract metagene") { + + when { + process { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/aligned_reads/SRX11780888_chr20.bam", checkIfExists: true), + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/aligned_reads/SRX11780888_chr20.bam.bai", checkIfExists: true) + ]) + input[1] = [ + [ id:'reference' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/reference.annotated.bed.gz", checkIfExists: true) + ] + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } + + test("homo_sapiens chr20 - extract metagene - stub") { + + options '-stub' + + when { + process { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/aligned_reads/SRX11780888_chr20.bam", checkIfExists: true), + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/aligned_reads/SRX11780888_chr20.bam.bai", checkIfExists: true) + ]) + input[1] = [ + [ id:'reference' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/reference.annotated.bed.gz", checkIfExists: true) + ] + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } +} diff --git a/modules/nf-core/rpbp/extractmetageneprofiles/tests/main.nf.test.snap b/modules/nf-core/rpbp/extractmetageneprofiles/tests/main.nf.test.snap new file mode 100644 index 00000000..f15d741e --- /dev/null +++ b/modules/nf-core/rpbp/extractmetageneprofiles/tests/main.nf.test.snap @@ -0,0 +1,58 @@ +{ + "homo_sapiens chr20 - extract metagene": { + "content": [ + { + "metagene": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.metagene.csv.gz:md5,05669bf4f874538edc72a24da629b0da" + ] + ], + "versions_rpbp": [ + [ + "RPBP_EXTRACTMETAGENEPROFILES", + "rpbp", + "4.0.1" + ] + ] + } + ], + "timestamp": "2026-06-10T15:18:56.914263745", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + }, + "homo_sapiens chr20 - extract metagene - stub": { + "content": [ + { + "metagene": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.metagene.csv.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "versions_rpbp": [ + [ + "RPBP_EXTRACTMETAGENEPROFILES", + "rpbp", + "4.0.1" + ] + ] + } + ], + "timestamp": "2026-06-10T15:19:02.14190141", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + } +} \ No newline at end of file diff --git a/modules/nf-core/rpbp/extractorfprofiles/environment.yml b/modules/nf-core/rpbp/extractorfprofiles/environment.yml new file mode 100644 index 00000000..7e2261ff --- /dev/null +++ b/modules/nf-core/rpbp/extractorfprofiles/environment.yml @@ -0,0 +1,5 @@ +channels: + - conda-forge + - bioconda +dependencies: + - bioconda::rpbp=4.0.1 diff --git a/modules/nf-core/rpbp/extractorfprofiles/main.nf b/modules/nf-core/rpbp/extractorfprofiles/main.nf new file mode 100644 index 00000000..f2ebb0b5 --- /dev/null +++ b/modules/nf-core/rpbp/extractorfprofiles/main.nf @@ -0,0 +1,45 @@ +process RPBP_EXTRACTORFPROFILES { + tag "$meta.id" + label 'process_medium' + + conda "${moduleDir}/environment.yml" + container "${ workflow.containerEngine in ['singularity', 'apptainer'] && !task.ext.singularity_pull_docker_container ? + 'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/14/146c3f15abf184a5ec13531d2a040ba7b9235c1091723aa37c7a119817411367/data' : + 'community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13b' }" + + input: + tuple val(meta), path(bam), path(bai), path(lengths_offsets) + tuple val(meta2), path(orfs_genomic_bed) + tuple val(meta3), path(exons_bed) + + output: + tuple val(meta), path("${prefix}.mtx.gz"), emit: profiles + tuple val("${task.process}"), val('rpbp'), eval('python -c "import rpbp; print(rpbp.__version__)"'), emit: versions_rpbp, topic: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + prefix = task.ext.prefix ?: "${meta.id}.profiles" + """ + LENGTHS=\$(tail -n +2 ${lengths_offsets} | cut -f1 | tr '\\n' ' ') + OFFSETS=\$(tail -n +2 ${lengths_offsets} | cut -f2 | tr '\\n' ' ') + + extract-orf-profiles \\ + ${bam} \\ + ${orfs_genomic_bed} \\ + ${exons_bed} \\ + ${prefix}.mtx.gz \\ + --lengths \$LENGTHS \\ + --offsets \$OFFSETS \\ + --num-cpus ${task.cpus} \\ + ${args} + """ + + stub: + prefix = task.ext.prefix ?: "${meta.id}.profiles" + """ + echo "" | gzip > ${prefix}.mtx.gz + """ +} diff --git a/modules/nf-core/rpbp/extractorfprofiles/meta.yml b/modules/nf-core/rpbp/extractorfprofiles/meta.yml new file mode 100644 index 00000000..f751a82e --- /dev/null +++ b/modules/nf-core/rpbp/extractorfprofiles/meta.yml @@ -0,0 +1,104 @@ +name: "rpbp_extractorfprofiles" +description: | + Build a per-ORF P-site count vector for every candidate open reading + frame (ORF) in the catalogue. For each ORF, walks the spliced exons + in 3-nucleotide codon steps and counts the P-site positions + (read 5'-end coordinate plus the length-specific offset selected + upstream) that fall in each codon. Counts are summed across all read + lengths that passed the periodicity filter from + `rpbp/getperiodiclengthsoffsets`. + + The resulting per-ORF vectors are the input to Bayesian translation + scoring in `rpbp/estimateorfbayesfactors`: a translated ORF should + show P-site density concentrated at codon-start positions, while a + non-translated region should look flat or noisy. Emitted as a sparse + matrix (one row per ORF, columns indexed by codon position). +keywords: + - rpbp + - orf + - psite + - profile + - riboseq +tools: + - "rpbp": + description: "Rp-Bp - Bayesian inference of ribosome profiling data for identifying translated open reading frames" + homepage: "https://github.com/dieterich-lab/rp-bp" + documentation: "https://rp-bp.readthedocs.io" + tool_dev_url: "https://github.com/dieterich-lab/rp-bp" + doi: "10.1093/nar/gkw1350" + licence: + - "MIT" + identifier: "" +input: + - - meta: + type: map + description: | + Groovy Map containing sample information, e.g. `[ id:'sample1' ]`. + - bam: + type: file + description: Sorted Ribo-seq BAM. + pattern: "*.bam" + ontologies: [] + - bai: + type: file + description: BAM index. + pattern: "*.bai" + ontologies: [] + - lengths_offsets: + type: file + description: Per-read-length offsets TSV from `rpbp/getperiodiclengthsoffsets`. + pattern: "*.periodic_lengths_offsets.tsv" + ontologies: [] + - - meta2: + type: map + description: | + Groovy Map identifying the reference (e.g. `[ id:'reference' ]`). + - orfs_genomic_bed: + type: file + description: Per-ORF genomic BED from `rpbp/preparegenome`. + pattern: "*.orfs-genomic.annotated.bed.gz" + ontologies: [] + - - meta3: + type: map + description: | + Groovy Map identifying the reference (e.g. `[ id:'reference' ]`). + - exons_bed: + type: file + description: Per-ORF exons BED from `rpbp/preparegenome`. + pattern: "*.orfs-exons.annotated.bed.gz" + ontologies: [] +output: + profiles: + - - meta: + type: map + description: Groovy Map inherited from input meta. + - "${prefix}.mtx.gz": + type: file + description: Per-ORF P-site profile sparse matrix. + pattern: "*.mtx.gz" + ontologies: [] + versions_rpbp: + - - ${task.process}: + type: string + description: The name of the process + - rpbp: + type: string + description: The name of the tool + - python -c "import rpbp; print(rpbp.__version__)": + type: eval + description: The expression to obtain the version of the tool +topics: + versions: + - - ${task.process}: + type: string + description: The name of the process + - rpbp: + type: string + description: The name of the tool + - python -c "import rpbp; print(rpbp.__version__)": + type: eval + description: The expression to obtain the version of the tool +authors: + - "@pinin4fjords" +maintainers: + - "@pinin4fjords" diff --git a/modules/nf-core/rpbp/extractorfprofiles/tests/main.nf.test b/modules/nf-core/rpbp/extractorfprofiles/tests/main.nf.test new file mode 100644 index 00000000..0f57805e --- /dev/null +++ b/modules/nf-core/rpbp/extractorfprofiles/tests/main.nf.test @@ -0,0 +1,75 @@ +nextflow_process { + + name "Test Process RPBP_EXTRACTORFPROFILES" + script "../main.nf" + process "RPBP_EXTRACTORFPROFILES" + + tag "modules" + tag "modules_nfcore" + tag "rpbp" + tag "rpbp/extractorfprofiles" + + test("homo_sapiens chr20 - extract orf profiles") { + + when { + process { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/aligned_reads/SRX11780888_chr20.bam", checkIfExists: true), + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/aligned_reads/SRX11780888_chr20.bam.bai", checkIfExists: true), + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.periodic_lengths_offsets.tsv", checkIfExists: true) + ]) + input[1] = [ + [ id:'reference' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/reference.orfs-genomic.annotated.bed.gz", checkIfExists: true) + ] + input[2] = [ + [ id:'reference' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/reference.orfs-exons.annotated.bed.gz", checkIfExists: true) + ] + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } + + test("homo_sapiens chr20 - extract orf profiles - stub") { + + options '-stub' + + when { + process { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/aligned_reads/SRX11780888_chr20.bam", checkIfExists: true), + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/aligned_reads/SRX11780888_chr20.bam.bai", checkIfExists: true), + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.periodic_lengths_offsets.tsv", checkIfExists: true) + ]) + input[1] = [ + [ id:'reference' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/reference.orfs-genomic.annotated.bed.gz", checkIfExists: true) + ] + input[2] = [ + [ id:'reference' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/reference.orfs-exons.annotated.bed.gz", checkIfExists: true) + ] + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } +} diff --git a/modules/nf-core/rpbp/extractorfprofiles/tests/main.nf.test.snap b/modules/nf-core/rpbp/extractorfprofiles/tests/main.nf.test.snap new file mode 100644 index 00000000..b0ba0bfc --- /dev/null +++ b/modules/nf-core/rpbp/extractorfprofiles/tests/main.nf.test.snap @@ -0,0 +1,58 @@ +{ + "homo_sapiens chr20 - extract orf profiles - stub": { + "content": [ + { + "profiles": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.profiles.mtx.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "versions_rpbp": [ + [ + "RPBP_EXTRACTORFPROFILES", + "rpbp", + "4.0.1" + ] + ] + } + ], + "timestamp": "2026-06-10T16:44:43.898520726", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + }, + "homo_sapiens chr20 - extract orf profiles": { + "content": [ + { + "profiles": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.profiles.mtx.gz:md5,7e1ca4eecd50b189f5f4fc362a896d8b" + ] + ], + "versions_rpbp": [ + [ + "RPBP_EXTRACTORFPROFILES", + "rpbp", + "4.0.1" + ] + ] + } + ], + "timestamp": "2026-06-10T16:44:38.531400548", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + } +} diff --git a/modules/nf-core/rpbp/getperiodiclengthsoffsets/environment.yml b/modules/nf-core/rpbp/getperiodiclengthsoffsets/environment.yml new file mode 100644 index 00000000..7e2261ff --- /dev/null +++ b/modules/nf-core/rpbp/getperiodiclengthsoffsets/environment.yml @@ -0,0 +1,5 @@ +channels: + - conda-forge + - bioconda +dependencies: + - bioconda::rpbp=4.0.1 diff --git a/modules/nf-core/rpbp/getperiodiclengthsoffsets/main.nf b/modules/nf-core/rpbp/getperiodiclengthsoffsets/main.nf new file mode 100644 index 00000000..ce9f55ab --- /dev/null +++ b/modules/nf-core/rpbp/getperiodiclengthsoffsets/main.nf @@ -0,0 +1,36 @@ +process RPBP_GETPERIODICLENGTHSOFFSETS { + tag "$meta.id" + label 'process_single' + + conda "${moduleDir}/environment.yml" + container "${ workflow.containerEngine in ['singularity', 'apptainer'] && !task.ext.singularity_pull_docker_container ? + 'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/14/146c3f15abf184a5ec13531d2a040ba7b9235c1091723aa37c7a119817411367/data' : + 'community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13b' }" + + input: + tuple val(meta), path(periodic_offsets) + + output: + tuple val(meta), path("${prefix}.tsv"), emit: lengths_offsets + path "versions.yml" , emit: versions_rpbp, topic: versions + + when: + task.ext.when == null || task.ext.when + + script: + task_ext_args = task.ext.args ?: '' + prefix = task.ext.prefix ?: "${meta.id}.lengths-offsets" + template 'get_periodic_lengths_and_offsets.py' + + stub: + prefix = task.ext.prefix ?: "${meta.id}.lengths-offsets" + """ + touch ${prefix}.tsv + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + python: \$(python --version | sed -e "s/Python //g") + rpbp: \$(python -c "import rpbp; print(rpbp.__version__)") + END_VERSIONS + """ +} diff --git a/modules/nf-core/rpbp/getperiodiclengthsoffsets/meta.yml b/modules/nf-core/rpbp/getperiodiclengthsoffsets/meta.yml new file mode 100644 index 00000000..b7453577 --- /dev/null +++ b/modules/nf-core/rpbp/getperiodiclengthsoffsets/meta.yml @@ -0,0 +1,69 @@ +name: "rpbp_getperiodiclengthsoffsets" +description: | + Filter the per-read-length P-site offset table down to the + (length, offset) pairs that will actually drive ORF-level scoring. + Drops read lengths whose metagene profile is too sparsely populated, + or whose periodicity Bayes factor is too low / too uncertain, so that + downstream P-site counting only uses read lengths with a clean + 3-nucleotide signal. + + Wraps Rp-Bp's `get_periodic_lengths_and_offsets` Python helper directly. + Thresholds are configured via named flags in `ext.args`: + `--min-count` (default: 1000), `--min-bf-mean` (default: 5), + `--max-bf-var` (default: no limit), `--min-bf-likelihood` (default: 0.5). + Defaults mirror `rpbp.defaults.metagene_options`. +keywords: + - rpbp + - psite + - offset + - filter + - riboseq +tools: + - "rpbp": + description: "Rp-Bp - Bayesian inference of ribosome profiling data for identifying translated open reading frames" + homepage: "https://github.com/dieterich-lab/rp-bp" + documentation: "https://rp-bp.readthedocs.io" + tool_dev_url: "https://github.com/dieterich-lab/rp-bp" + doi: "10.1093/nar/gkw1350" + licence: + - "MIT" + identifier: "" +input: + - - meta: + type: map + description: | + Groovy Map containing sample information, e.g. `[ id:'sample1' ]`. + - periodic_offsets: + type: file + description: Selected periodic offsets CSV from `rpbp/selectperiodicoffsets`. + pattern: "*.periodic-offsets.csv.gz" + ontologies: [] +output: + lengths_offsets: + - - meta: + type: map + description: Groovy Map inherited from input meta. + - "${prefix}.tsv": + type: file + description: Per-read-length offsets (length, offset) that pass the periodicity filter. + pattern: "*.tsv" + ontologies: [] + versions_rpbp: + - versions.yml: + type: file + description: File containing software versions. + pattern: "versions.yml" + ontologies: + - edam: "http://edamontology.org/format_3750" # YAML +topics: + versions: + - versions.yml: + type: file + description: File containing software versions. + pattern: "versions.yml" + ontologies: + - edam: "http://edamontology.org/format_3750" # YAML +authors: + - "@pinin4fjords" +maintainers: + - "@pinin4fjords" diff --git a/modules/nf-core/rpbp/getperiodiclengthsoffsets/templates/get_periodic_lengths_and_offsets.py b/modules/nf-core/rpbp/getperiodiclengthsoffsets/templates/get_periodic_lengths_and_offsets.py new file mode 100644 index 00000000..ae6f7c85 --- /dev/null +++ b/modules/nf-core/rpbp/getperiodiclengthsoffsets/templates/get_periodic_lengths_and_offsets.py @@ -0,0 +1,62 @@ +#!/usr/bin/env python3 +# Written by Jonathan Manning (@pinin4fjords). Released under the MIT license. +"""Wrap rpbp's `get_periodic_lengths_and_offsets` helper. + +Recreates the `rpbp_work/metagene-profiles/-unique.periodic-offsets.csv.gz` +layout the helper expects, calls it with thresholds passed in from the +Nextflow process, and writes the resulting per-read-length offsets to a TSV. +""" + +import argparse +import os +import platform +import shlex +import shutil + +import pandas as pd +import rpbp +import yaml +from rpbp.ribo_utils.utils import get_periodic_lengths_and_offsets + + +prefix = "${prefix}" + +parser = argparse.ArgumentParser() +parser.add_argument("--min-count", type=int, default=1000, dest="min_count") +parser.add_argument("--min-bf-mean", type=float, default=5.0, dest="min_bf_mean") +parser.add_argument("--max-bf-var", type=float, default=None, dest="max_bf_var") +parser.add_argument("--min-bf-likelihood", type=float, default=0.5, dest="min_bf_likelihood") +args = parser.parse_args(shlex.split("${task_ext_args}")) + +work_dir = os.path.join("rpbp_work", "metagene-profiles") +os.makedirs(work_dir, exist_ok=True) +shutil.copy( + "${periodic_offsets}", + os.path.join(work_dir, f"{prefix}-unique.periodic-offsets.csv.gz"), +) + +config = dict( + riboseq_data="rpbp_work", + min_metagene_profile_count=args.min_count, + min_metagene_bf_mean=args.min_bf_mean, + max_metagene_bf_var=args.max_bf_var, + min_metagene_bf_likelihood=args.min_bf_likelihood, +) + +lengths, offsets = get_periodic_lengths_and_offsets(config, prefix, is_unique=True) +if len(lengths) == 0: + raise SystemExit( + "No periodic read lengths passed filters; " + "check --min-count/--min-bf-mean thresholds and metagene Bayes-factor output." + ) +pd.DataFrame({"length": lengths, "offset": offsets}).to_csv( + f"{prefix}.tsv", + sep="\\t", + index=False, +) + +with open("versions.yml", "w") as f: + yaml.safe_dump( + {"${task.process}": {"python": platform.python_version(), "rpbp": rpbp.__version__}}, + f, + ) diff --git a/modules/nf-core/rpbp/getperiodiclengthsoffsets/tests/main.nf.test b/modules/nf-core/rpbp/getperiodiclengthsoffsets/tests/main.nf.test new file mode 100644 index 00000000..90066158 --- /dev/null +++ b/modules/nf-core/rpbp/getperiodiclengthsoffsets/tests/main.nf.test @@ -0,0 +1,79 @@ +nextflow_process { + + name "Test Process RPBP_GETPERIODICLENGTHSOFFSETS" + script "../main.nf" + process "RPBP_GETPERIODICLENGTHSOFFSETS" + config "./nextflow.config" + + tag "modules" + tag "modules_nfcore" + tag "rpbp" + tag "rpbp/getperiodiclengthsoffsets" + + test("homo_sapiens chr20 - get periodic lengths offsets") { + + when { + process { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.periodic-offsets.csv.gz", checkIfExists: true) + ]) + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } + + test("homo_sapiens chr20 - get periodic lengths offsets - loose thresholds") { + + config "./nextflow.loose.config" + + when { + process { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.periodic-offsets.csv.gz", checkIfExists: true) + ]) + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } + + test("homo_sapiens chr20 - get periodic lengths offsets - stub") { + + options '-stub' + + when { + process { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.periodic-offsets.csv.gz", checkIfExists: true) + ]) + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } +} diff --git a/modules/nf-core/rpbp/getperiodiclengthsoffsets/tests/main.nf.test.snap b/modules/nf-core/rpbp/getperiodiclengthsoffsets/tests/main.nf.test.snap new file mode 100644 index 00000000..1749beea --- /dev/null +++ b/modules/nf-core/rpbp/getperiodiclengthsoffsets/tests/main.nf.test.snap @@ -0,0 +1,74 @@ +{ + "homo_sapiens chr20 - get periodic lengths offsets - loose thresholds": { + "content": [ + { + "lengths_offsets": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.lengths-offsets.tsv:md5,008c79e476b229adecc2e1f786316278" + ] + ], + "versions_rpbp": [ + "versions.yml:md5,12a122fa44cf68a3c034a4d6a91206f3" + ] + } + ], + "timestamp": "2026-06-11T13:50:08.421404192", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + }, + "homo_sapiens chr20 - get periodic lengths offsets": { + "content": [ + { + "lengths_offsets": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.lengths-offsets.tsv:md5,e991e2a4df8ced041098f225958bccea" + ] + ], + "versions_rpbp": [ + "versions.yml:md5,12a122fa44cf68a3c034a4d6a91206f3" + ] + } + ], + "timestamp": "2026-06-11T13:50:02.858046905", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + }, + "homo_sapiens chr20 - get periodic lengths offsets - stub": { + "content": [ + { + "lengths_offsets": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.lengths-offsets.tsv:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ], + "versions_rpbp": [ + "versions.yml:md5,9d3d1f722325148872149e3e634971d7" + ] + } + ], + "timestamp": "2026-06-11T13:50:12.680132079", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + } +} \ No newline at end of file diff --git a/modules/nf-core/rpbp/getperiodiclengthsoffsets/tests/nextflow.config b/modules/nf-core/rpbp/getperiodiclengthsoffsets/tests/nextflow.config new file mode 100644 index 00000000..cdf1dd5b --- /dev/null +++ b/modules/nf-core/rpbp/getperiodiclengthsoffsets/tests/nextflow.config @@ -0,0 +1,5 @@ +process { + withName: 'RPBP_GETPERIODICLENGTHSOFFSETS' { + ext.args = '--min-count 10 --min-bf-mean 1 --min-bf-likelihood 0.0' + } +} diff --git a/modules/nf-core/rpbp/getperiodiclengthsoffsets/tests/nextflow.loose.config b/modules/nf-core/rpbp/getperiodiclengthsoffsets/tests/nextflow.loose.config new file mode 100644 index 00000000..c304da4d --- /dev/null +++ b/modules/nf-core/rpbp/getperiodiclengthsoffsets/tests/nextflow.loose.config @@ -0,0 +1,5 @@ +process { + withName: 'RPBP_GETPERIODICLENGTHSOFFSETS' { + ext.args = '--min-count 5' + } +} diff --git a/modules/nf-core/rpbp/preparegenome/environment.yml b/modules/nf-core/rpbp/preparegenome/environment.yml new file mode 100644 index 00000000..7e2261ff --- /dev/null +++ b/modules/nf-core/rpbp/preparegenome/environment.yml @@ -0,0 +1,5 @@ +channels: + - conda-forge + - bioconda +dependencies: + - bioconda::rpbp=4.0.1 diff --git a/modules/nf-core/rpbp/preparegenome/main.nf b/modules/nf-core/rpbp/preparegenome/main.nf new file mode 100644 index 00000000..75bd3821 --- /dev/null +++ b/modules/nf-core/rpbp/preparegenome/main.nf @@ -0,0 +1,40 @@ +process RPBP_PREPAREGENOME { + tag "$meta.id" + label 'process_medium' + + conda "${moduleDir}/environment.yml" + container "${ workflow.containerEngine in ['singularity', 'apptainer'] && !task.ext.singularity_pull_docker_container ? + 'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/14/146c3f15abf184a5ec13531d2a040ba7b9235c1091723aa37c7a119817411367/data' : + 'community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13b' }" + + input: + tuple val(meta), path(fasta), path(gtf) + + output: + tuple val(meta), path("${prefix}.annotated.bed.gz") , emit: transcript_bed + tuple val(meta), path("${prefix}.orfs-genomic.annotated.bed.gz"), emit: orfs_genomic_bed + tuple val(meta), path("${prefix}.orfs-exons.annotated.bed.gz") , emit: orfs_exons_bed + path "versions.yml" , emit: versions_rpbp, topic: versions + + when: + task.ext.when == null || task.ext.when + + script: + task_ext_args = task.ext.args ?: '' + prefix = task.ext.prefix ?: meta.id + template 'prepare_rpbp_genome.py' + + stub: + prefix = task.ext.prefix ?: meta.id + """ + echo "" | gzip > ${prefix}.annotated.bed.gz + echo "" | gzip > ${prefix}.orfs-genomic.annotated.bed.gz + echo "" | gzip > ${prefix}.orfs-exons.annotated.bed.gz + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + python: \$(python --version | sed -e "s/Python //g") + rpbp: \$(python -c "import rpbp; print(rpbp.__version__)") + END_VERSIONS + """ +} diff --git a/modules/nf-core/rpbp/preparegenome/meta.yml b/modules/nf-core/rpbp/preparegenome/meta.yml new file mode 100644 index 00000000..916cf270 --- /dev/null +++ b/modules/nf-core/rpbp/preparegenome/meta.yml @@ -0,0 +1,104 @@ +name: "rpbp_preparegenome" +description: | + Build the per-ORF reference files that Rp-Bp's downstream scoring needs, + starting from a genome FASTA and an annotation GTF. Enumerates every + candidate open reading frame (ORF) in the annotation (annotated CDSs plus + alternative start codons within transcript exons), records their genomic + and per-exon coordinates, and labels them with the transcript and gene + they belong to. + + Invokes Rp-Bp's `get_orfs` Python function directly, chaining the upstream + helpers `gtf-to-bed12`, `extract-bed-sequences`, `extract-orf-coordinates`, + `split-bed12-blocks` and `label-orfs`. Bypasses Rp-Bp's `prepare-rpbp-genome` + umbrella script, which would also build `bowtie2` (rRNA filtering) and + `STAR` (alignment) indices - neither is consumed by the Rp-Bp tools + wrapped here, since alignment is supplied externally as a BAM. + + A minimal `chrName.txt` (one contig name per line) is seeded from the + FASTA headers because `gtf-to-bed12` reads it via `--chr-name-file` to + control output sort order. + + Note: emits the `*.annotated.bed.gz` filenames produced by `get_orfs` + directly, rather than the `*.bed.gz`-renamed forms that the upstream + umbrella `prepare-rpbp-genome` script produces. The downstream module + outputs and consumers in this module set reference these names + explicitly, so the two are functionally equivalent. +keywords: + - rpbp + - orf + - prepare + - genome + - bed + - riboseq +tools: + - "rpbp": + description: "Rp-Bp - Bayesian inference of ribosome profiling data for identifying translated open reading frames" + homepage: "https://github.com/dieterich-lab/rp-bp" + documentation: "https://rp-bp.readthedocs.io" + tool_dev_url: "https://github.com/dieterich-lab/rp-bp" + doi: "10.1093/nar/gkw1350" + licence: + - "MIT" + identifier: "" +input: + - - meta: + type: map + description: | + Groovy Map containing identifiers, e.g. `[ id: 'reference' ]`. + - fasta: + type: file + description: Genome FASTA. + pattern: "*.{fa,fasta,fna}" + ontologies: [] + - gtf: + type: file + description: Annotation GTF (canonical or hybrid, depending on extended-ORF mode). + pattern: "*.gtf" + ontologies: [] +output: + transcript_bed: + - - meta: + type: map + description: Groovy Map inherited from input meta. + - "${prefix}.annotated.bed.gz": + type: file + description: Annotated transcripts BED used by `extract-metagene-profiles`. + pattern: "*.annotated.bed.gz" + ontologies: [] + orfs_genomic_bed: + - - meta: + type: map + description: Groovy Map inherited from input meta. + - "${prefix}.orfs-genomic.annotated.bed.gz": + type: file + description: Per-ORF genomic BED used by `extract-orf-profiles` / `estimate-orf-bayes-factors`. + pattern: "*.orfs-genomic.annotated.bed.gz" + ontologies: [] + orfs_exons_bed: + - - meta: + type: map + description: Groovy Map inherited from input meta. + - "${prefix}.orfs-exons.annotated.bed.gz": + type: file + description: Per-ORF exons BED used by `extract-orf-profiles`. + pattern: "*.orfs-exons.annotated.bed.gz" + ontologies: [] + versions_rpbp: + - versions.yml: + type: file + description: File containing software versions. + pattern: "versions.yml" + ontologies: + - edam: "http://edamontology.org/format_3750" # YAML +topics: + versions: + - versions.yml: + type: file + description: File containing software versions. + pattern: "versions.yml" + ontologies: + - edam: "http://edamontology.org/format_3750" # YAML +authors: + - "@pinin4fjords" +maintainers: + - "@pinin4fjords" diff --git a/modules/nf-core/rpbp/preparegenome/templates/prepare_rpbp_genome.py b/modules/nf-core/rpbp/preparegenome/templates/prepare_rpbp_genome.py new file mode 100644 index 00000000..6d618488 --- /dev/null +++ b/modules/nf-core/rpbp/preparegenome/templates/prepare_rpbp_genome.py @@ -0,0 +1,88 @@ +#!/usr/bin/env python3 +# Written by Jonathan Manning (@pinin4fjords). Released under the MIT license. + +"""Run rpbp's `get_orfs` reference-prep step. + +Seeds the index layout (`chrName.txt`, the `transcript-index/` subdir) and +then calls the rpbp Python API, standing in for rpbp's `prepare-rpbp-genome` +umbrella script (which would also build bowtie2/STAR indices that are not +consumed here). +""" + +import argparse +import gzip +import os +import platform +import shlex +import shutil + +import rpbp +import yaml +from pbiotools.misc import logging_utils +from rpbp.reference_preprocessing.prepare_rpbp_genome import get_orfs + + +def chr_names_from_fasta(fasta_path: str, out_path: str) -> None: + """Write one chromosome name per line from a (possibly gzipped) FASTA.""" + opener = gzip.open if fasta_path.endswith(".gz") else open + with opener(fasta_path, "rt") as fh, open(out_path, "w") as out: + for line in fh: + if line.startswith(">"): + # FASTA header: drop leading '>' and trim at first whitespace. + parts = line[1:].split() + if parts: + out.write(parts[0] + "\\n") + + +prefix = "${prefix}" +fasta = "${fasta}" +gtf = "${gtf}" + +base_path = "_rpbp_ref" +star_index = os.path.join(base_path, "star") +os.makedirs(os.path.join(base_path, "transcript-index"), exist_ok=True) +os.makedirs(star_index, exist_ok=True) +chr_names_from_fasta(fasta, os.path.join(star_index, "chrName.txt")) + +config = { + "genome_base_path": base_path, + "genome_name": prefix, + "fasta": fasta, + "star_index": star_index, +} + +# get_orfs takes config-level options (start_codons, stop_codons) via the config +# dict, not via argparse. Parse these from ext.args first, inject into config, +# then pass remaining tokens to rpbp's logging argparse. +ext_parser = argparse.ArgumentParser(add_help=False) +ext_parser.add_argument("--start-codons", nargs="+", dest="start_codons") +ext_parser.add_argument("--stop-codons", nargs="+", dest="stop_codons") +ext_known, rpbp_raw = ext_parser.parse_known_args(shlex.split("${task_ext_args}")) + +if ext_known.start_codons: + config["start_codons"] = ext_known.start_codons +if ext_known.stop_codons: + config["stop_codons"] = ext_known.stop_codons + +parser = argparse.ArgumentParser() +logging_utils.add_logging_options(parser) +rpbp_args = parser.parse_args(rpbp_raw) +rpbp_args.do_not_call = False +rpbp_args.overwrite = False +rpbp_args.num_cpus = int("${task.cpus}") + +get_orfs(gtf, rpbp_args, config, is_annotated=True, is_de_novo=False) + +for src in [ + os.path.join(base_path, f"{prefix}.annotated.bed.gz"), + os.path.join(base_path, "transcript-index", f"{prefix}.orfs-genomic.annotated.bed.gz"), + os.path.join(base_path, "transcript-index", f"{prefix}.orfs-exons.annotated.bed.gz"), +]: + shutil.move(src, os.path.basename(src)) +shutil.rmtree(base_path) + +with open("versions.yml", "w") as f: + yaml.safe_dump( + {"${task.process}": {"python": platform.python_version(), "rpbp": rpbp.__version__}}, + f, + ) diff --git a/modules/nf-core/rpbp/preparegenome/tests/main.nf.test b/modules/nf-core/rpbp/preparegenome/tests/main.nf.test new file mode 100644 index 00000000..691caecb --- /dev/null +++ b/modules/nf-core/rpbp/preparegenome/tests/main.nf.test @@ -0,0 +1,99 @@ +nextflow_process { + + name "Test Process RPBP_PREPAREGENOME" + script "../main.nf" + process "RPBP_PREPAREGENOME" + + tag "modules" + tag "modules_nfcore" + tag "rpbp" + tag "rpbp/preparegenome" + tag "gunzip" + + setup { + run("GUNZIP") { + script "modules/nf-core/gunzip/main.nf" + process { + """ + input[0] = [ + [ id:'reference' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/Homo_sapiens.GRCh38.dna.chromosome.20.fa.gz", checkIfExists: true) + ] + """ + } + } + } + + test("homo_sapiens chr20 - prepare genome") { + + when { + process { + """ + input[0] = GUNZIP.out.gunzip + .map { meta, fa -> [ + meta, + fa, + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/Homo_sapiens.GRCh38.111_chr20.gtf", checkIfExists: true) + ] } + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } + + test("homo_sapiens chr20 - prepare genome - extended start codons") { + + config "./nextflow.config" + + when { + process { + """ + input[0] = GUNZIP.out.gunzip + .map { meta, fa -> [ + meta, + fa, + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/Homo_sapiens.GRCh38.111_chr20.gtf", checkIfExists: true) + ] } + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } + + test("homo_sapiens chr20 - prepare genome - stub") { + + options '-stub' + + when { + process { + """ + input[0] = GUNZIP.out.gunzip + .map { meta, fa -> [ + meta, + fa, + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/Homo_sapiens.GRCh38.111_chr20.gtf", checkIfExists: true) + ] } + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } +} diff --git a/modules/nf-core/rpbp/preparegenome/tests/main.nf.test.snap b/modules/nf-core/rpbp/preparegenome/tests/main.nf.test.snap new file mode 100644 index 00000000..e6d5be67 --- /dev/null +++ b/modules/nf-core/rpbp/preparegenome/tests/main.nf.test.snap @@ -0,0 +1,116 @@ +{ + "homo_sapiens chr20 - prepare genome - stub": { + "content": [ + { + "orfs_exons_bed": [ + [ + { + "id": "reference" + }, + "reference.orfs-exons.annotated.bed.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "orfs_genomic_bed": [ + [ + { + "id": "reference" + }, + "reference.orfs-genomic.annotated.bed.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "transcript_bed": [ + [ + { + "id": "reference" + }, + "reference.annotated.bed.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "versions_rpbp": [ + "versions.yml:md5,ffa8fa1f5a063e25a6379ebdf9534a2e" + ] + } + ], + "timestamp": "2026-06-11T09:30:38.216958", + "meta": { + "nf-test": "0.9.3", + "nextflow": "25.10.4" + } + }, + "homo_sapiens chr20 - prepare genome": { + "content": [ + { + "orfs_exons_bed": [ + [ + { + "id": "reference" + }, + "reference.orfs-exons.annotated.bed.gz:md5,ace56b8464ca29aebbafa5409875ee82" + ] + ], + "orfs_genomic_bed": [ + [ + { + "id": "reference" + }, + "reference.orfs-genomic.annotated.bed.gz:md5,dd8cfaf09907bfb049ad6caaa2fd408e" + ] + ], + "transcript_bed": [ + [ + { + "id": "reference" + }, + "reference.annotated.bed.gz:md5,dad6b410693cfffe659a04742e806745" + ] + ], + "versions_rpbp": [ + "versions.yml:md5,d272365bca998c5eb461bad2bd142193" + ] + } + ], + "timestamp": "2026-06-11T08:57:59.111446774", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + }, + "homo_sapiens chr20 - prepare genome - extended start codons": { + "content": [ + { + "orfs_exons_bed": [ + [ + { + "id": "reference" + }, + "reference.orfs-exons.annotated.bed.gz:md5,dc592daa549c6364855e2e1573cb9779" + ] + ], + "orfs_genomic_bed": [ + [ + { + "id": "reference" + }, + "reference.orfs-genomic.annotated.bed.gz:md5,1747fef9c864bfde9a2fe7bd8b2c5012" + ] + ], + "transcript_bed": [ + [ + { + "id": "reference" + }, + "reference.annotated.bed.gz:md5,dad6b410693cfffe659a04742e806745" + ] + ], + "versions_rpbp": [ + "versions.yml:md5,d272365bca998c5eb461bad2bd142193" + ] + } + ], + "timestamp": "2026-06-11T09:03:46.644466938", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + } +} \ No newline at end of file diff --git a/modules/nf-core/rpbp/preparegenome/tests/nextflow.config b/modules/nf-core/rpbp/preparegenome/tests/nextflow.config new file mode 100644 index 00000000..60158bfe --- /dev/null +++ b/modules/nf-core/rpbp/preparegenome/tests/nextflow.config @@ -0,0 +1,5 @@ +process { + withName: 'RPBP_PREPAREGENOME' { + ext.args = '--start-codons ATG CTG TTG' + } +} diff --git a/modules/nf-core/rpbp/selectfinalpredictionset/environment.yml b/modules/nf-core/rpbp/selectfinalpredictionset/environment.yml new file mode 100644 index 00000000..7e2261ff --- /dev/null +++ b/modules/nf-core/rpbp/selectfinalpredictionset/environment.yml @@ -0,0 +1,5 @@ +channels: + - conda-forge + - bioconda +dependencies: + - bioconda::rpbp=4.0.1 diff --git a/modules/nf-core/rpbp/selectfinalpredictionset/main.nf b/modules/nf-core/rpbp/selectfinalpredictionset/main.nf new file mode 100644 index 00000000..e1fbc4e3 --- /dev/null +++ b/modules/nf-core/rpbp/selectfinalpredictionset/main.nf @@ -0,0 +1,43 @@ +process RPBP_SELECTFINALPREDICTIONSET { + tag "$meta.id" + label 'process_single' + + conda "${moduleDir}/environment.yml" + container "${ workflow.containerEngine in ['singularity', 'apptainer'] && !task.ext.singularity_pull_docker_container ? + 'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/14/146c3f15abf184a5ec13531d2a040ba7b9235c1091723aa37c7a119817411367/data' : + 'community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13b' }" + + input: + tuple val(meta), path(bayes_factors) + tuple val(meta2), path(genome_fasta) + + output: + tuple val(meta), path("${prefix}.bed.gz") , emit: predicted + tuple val(meta), path("${prefix}.dna.fa") , emit: dna_fasta + tuple val(meta), path("${prefix}.protein.fa") , emit: protein_fasta + tuple val("${task.process}"), val('rpbp'), eval('python -c "import rpbp; print(rpbp.__version__)"'), emit: versions_rpbp, topic: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + prefix = task.ext.prefix ?: "${meta.id}.predicted-orfs" + """ + select-final-prediction-set \\ + ${bayes_factors} \\ + ${genome_fasta} \\ + ${prefix}.bed.gz \\ + ${prefix}.dna.fa \\ + ${prefix}.protein.fa \\ + ${args} + """ + + stub: + prefix = task.ext.prefix ?: "${meta.id}.predicted-orfs" + """ + echo "" | gzip > ${prefix}.bed.gz + touch ${prefix}.dna.fa + touch ${prefix}.protein.fa + """ +} diff --git a/modules/nf-core/rpbp/selectfinalpredictionset/meta.yml b/modules/nf-core/rpbp/selectfinalpredictionset/meta.yml new file mode 100644 index 00000000..1742fc58 --- /dev/null +++ b/modules/nf-core/rpbp/selectfinalpredictionset/meta.yml @@ -0,0 +1,102 @@ +name: "rpbp_selectfinalpredictionset" +description: | + Produce the final filtered set of predicted translated ORFs from the + per-ORF Bayes factor table. Applies the standard Rp-Bp prediction + rules: a minimum Bayes-factor cutoff (favouring translated over + untranslated), a minimum ORF length, and overlap resolution so that + among overlapping candidates only the highest-scoring representative + is kept. + + Emits three files describing the same prediction set: a BED of ORF + genomic coordinates plus score, a FASTA of ORF DNA sequences + (extracted from the genome FASTA), and a FASTA of the corresponding + translated protein sequences. This is the terminal step of the Rp-Bp + per-sample chain. +keywords: + - rpbp + - orf + - bayes + - prediction + - riboseq +tools: + - "rpbp": + description: "Rp-Bp - Bayesian inference of ribosome profiling data for identifying translated open reading frames" + homepage: "https://github.com/dieterich-lab/rp-bp" + documentation: "https://rp-bp.readthedocs.io" + tool_dev_url: "https://github.com/dieterich-lab/rp-bp" + doi: "10.1093/nar/gkw1350" + licence: + - "MIT" + identifier: "" +input: + - - meta: + type: map + description: | + Groovy Map containing sample information, e.g. `[ id:'sample1' ]`. + - bayes_factors: + type: file + description: Per-ORF Bayes factor BED from `rpbp/estimateorfbayesfactors`. + pattern: "*.bayes-factors.bed.gz" + ontologies: [] + - - meta2: + type: map + description: | + Groovy Map identifying the reference (e.g. `[ id:'reference' ]`). + - genome_fasta: + type: file + description: Genome FASTA (matching the one used at `rpbp/preparegenome`). + pattern: "*.{fa,fasta,fna}" + ontologies: [] +output: + predicted: + - - meta: + type: map + description: Groovy Map inherited from input meta. + - "${prefix}.bed.gz": + type: file + description: Final filtered predicted-ORF BED. + pattern: "*.bed.gz" + ontologies: [] + dna_fasta: + - - meta: + type: map + description: Groovy Map inherited from input meta. + - "${prefix}.dna.fa": + type: file + description: DNA FASTA of predicted ORFs. + pattern: "*.dna.fa" + ontologies: [] + protein_fasta: + - - meta: + type: map + description: Groovy Map inherited from input meta. + - "${prefix}.protein.fa": + type: file + description: Protein FASTA of predicted ORFs. + pattern: "*.protein.fa" + ontologies: [] + versions_rpbp: + - - ${task.process}: + type: string + description: The name of the process + - rpbp: + type: string + description: The name of the tool + - python -c "import rpbp; print(rpbp.__version__)": + type: eval + description: The expression to obtain the version of the tool +topics: + versions: + - - ${task.process}: + type: string + description: The name of the process + - rpbp: + type: string + description: The name of the tool + - python -c "import rpbp; print(rpbp.__version__)": + type: eval + description: The expression to obtain the version of the tool +authors: + - "@pinin4fjords" +maintainers: + - "@pinin4fjords" diff --git a/modules/nf-core/rpbp/selectfinalpredictionset/tests/main.nf.test b/modules/nf-core/rpbp/selectfinalpredictionset/tests/main.nf.test new file mode 100644 index 00000000..570f740a --- /dev/null +++ b/modules/nf-core/rpbp/selectfinalpredictionset/tests/main.nf.test @@ -0,0 +1,73 @@ +nextflow_process { + + name "Test Process RPBP_SELECTFINALPREDICTIONSET" + script "../main.nf" + process "RPBP_SELECTFINALPREDICTIONSET" + config "./nextflow.config" + + tag "modules" + tag "modules_nfcore" + tag "rpbp" + tag "rpbp/selectfinalpredictionset" + tag "gunzip" + + setup { + run("GUNZIP") { + script "modules/nf-core/gunzip/main.nf" + process { + """ + input[0] = [ + [ id:'reference' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/Homo_sapiens.GRCh38.dna.chromosome.20.fa.gz", checkIfExists: true) + ] + """ + } + } + } + + test("homo_sapiens chr20 - select final prediction set") { + + when { + process { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.bayes-factors.bed.gz", checkIfExists: true) + ]) + input[1] = GUNZIP.out.gunzip.first() + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } + + test("homo_sapiens chr20 - select final prediction set - stub") { + + options '-stub' + + when { + process { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.bayes-factors.bed.gz", checkIfExists: true) + ]) + input[1] = GUNZIP.out.gunzip.first() + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } +} diff --git a/modules/nf-core/rpbp/selectfinalpredictionset/tests/main.nf.test.snap b/modules/nf-core/rpbp/selectfinalpredictionset/tests/main.nf.test.snap new file mode 100644 index 00000000..b4940bb5 --- /dev/null +++ b/modules/nf-core/rpbp/selectfinalpredictionset/tests/main.nf.test.snap @@ -0,0 +1,98 @@ +{ + "homo_sapiens chr20 - select final prediction set": { + "content": [ + { + "dna_fasta": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.predicted-orfs.dna.fa:md5,81bed68c719073b4ba565d039b04aa24" + ] + ], + "predicted": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.predicted-orfs.bed.gz:md5,86c595dc0568e1136d5e95363c705203" + ] + ], + "protein_fasta": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.predicted-orfs.protein.fa:md5,69699124b1694b2c5b471fa9664a9099" + ] + ], + "versions_rpbp": [ + [ + "RPBP_SELECTFINALPREDICTIONSET", + "rpbp", + "4.0.1" + ] + ] + } + ], + "timestamp": "2026-06-10T16:45:47.740094071", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + }, + "homo_sapiens chr20 - select final prediction set - stub": { + "content": [ + { + "dna_fasta": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.predicted-orfs.dna.fa:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ], + "predicted": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.predicted-orfs.bed.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "protein_fasta": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.predicted-orfs.protein.fa:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ], + "versions_rpbp": [ + [ + "RPBP_SELECTFINALPREDICTIONSET", + "rpbp", + "4.0.1" + ] + ] + } + ], + "timestamp": "2026-06-10T16:45:53.949001689", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + } +} diff --git a/modules/nf-core/rpbp/selectfinalpredictionset/tests/nextflow.config b/modules/nf-core/rpbp/selectfinalpredictionset/tests/nextflow.config new file mode 100644 index 00000000..dabe2b61 --- /dev/null +++ b/modules/nf-core/rpbp/selectfinalpredictionset/tests/nextflow.config @@ -0,0 +1,5 @@ +process { + withName: 'RPBP_SELECTFINALPREDICTIONSET' { + ext.args = '--select-longest-by-stop --select-best-overlapping' + } +} diff --git a/modules/nf-core/rpbp/selectperiodicoffsets/environment.yml b/modules/nf-core/rpbp/selectperiodicoffsets/environment.yml new file mode 100644 index 00000000..7e2261ff --- /dev/null +++ b/modules/nf-core/rpbp/selectperiodicoffsets/environment.yml @@ -0,0 +1,5 @@ +channels: + - conda-forge + - bioconda +dependencies: + - bioconda::rpbp=4.0.1 diff --git a/modules/nf-core/rpbp/selectperiodicoffsets/main.nf b/modules/nf-core/rpbp/selectperiodicoffsets/main.nf new file mode 100644 index 00000000..1f62ce22 --- /dev/null +++ b/modules/nf-core/rpbp/selectperiodicoffsets/main.nf @@ -0,0 +1,35 @@ +process RPBP_SELECTPERIODICOFFSETS { + tag "$meta.id" + label 'process_single' + + conda "${moduleDir}/environment.yml" + container "${ workflow.containerEngine in ['singularity', 'apptainer'] && !task.ext.singularity_pull_docker_container ? + 'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/14/146c3f15abf184a5ec13531d2a040ba7b9235c1091723aa37c7a119817411367/data' : + 'community.wave.seqera.io/library/rpbp:4.0.1--71297b462026e13b' }" + + input: + tuple val(meta), path(bayes_factors) + + output: + tuple val(meta), path("${prefix}.csv.gz"), emit: periodic + tuple val("${task.process}"), val('rpbp'), eval('python -c "import rpbp; print(rpbp.__version__)"'), emit: versions_rpbp, topic: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + prefix = task.ext.prefix ?: "${meta.id}.offsets" + """ + select-periodic-offsets \\ + ${bayes_factors} \\ + ${prefix}.csv.gz \\ + ${args} + """ + + stub: + prefix = task.ext.prefix ?: "${meta.id}.offsets" + """ + echo "" | gzip > ${prefix}.csv.gz + """ +} diff --git a/modules/nf-core/rpbp/selectperiodicoffsets/meta.yml b/modules/nf-core/rpbp/selectperiodicoffsets/meta.yml new file mode 100644 index 00000000..28e63050 --- /dev/null +++ b/modules/nf-core/rpbp/selectperiodicoffsets/meta.yml @@ -0,0 +1,77 @@ +name: "rpbp_selectperiodicoffsets" +description: | + Pick the single best P-site offset for each read length from the + per-(length, offset) Bayes factor table produced upstream. For each + read length, the offset with the highest periodicity Bayes-factor mean + is selected - this is the offset that, when added to a read's 5' end, + is estimated to land closest to the ribosomal P-site (the codon being + decoded). Downstream, these offsets are used to convert raw read + 5'-end coordinates into P-site positions when counting reads against + candidate ORFs. + + Emits one row per read length (length, best offset, supporting Bayes + factor statistics). The next step + (`rpbp/getperiodiclengthsoffsets`) filters this table to the high-quality + pairs that pass user-specified count / signal thresholds before + P-site counting in `rpbp/extractorfprofiles`. +keywords: + - rpbp + - psite + - offset + - orf + - riboseq +tools: + - "rpbp": + description: "Rp-Bp - Bayesian inference of ribosome profiling data for identifying translated open reading frames" + homepage: "https://github.com/dieterich-lab/rp-bp" + documentation: "https://rp-bp.readthedocs.io" + tool_dev_url: "https://github.com/dieterich-lab/rp-bp" + doi: "10.1093/nar/gkw1350" + licence: + - "MIT" + identifier: "" +input: + - - meta: + type: map + description: | + Groovy Map containing sample information, e.g. `[ id:'sample1' ]`. + - bayes_factors: + type: file + description: Metagene periodicity Bayes factors CSV from `rpbp/estimatemetagenebayesfactors`. + pattern: "*.metagene-periodicity-bayes-factors.csv.gz" + ontologies: [] +output: + periodic: + - - meta: + type: map + description: Groovy Map inherited from input meta. + - "${prefix}.csv.gz": + type: file + description: Selected periodic offsets per read length. + pattern: "*.csv.gz" + ontologies: [] + versions_rpbp: + - - ${task.process}: + type: string + description: The name of the process + - rpbp: + type: string + description: The name of the tool + - python -c "import rpbp; print(rpbp.__version__)": + type: eval + description: The expression to obtain the version of the tool +topics: + versions: + - - ${task.process}: + type: string + description: The name of the process + - rpbp: + type: string + description: The name of the tool + - python -c "import rpbp; print(rpbp.__version__)": + type: eval + description: The expression to obtain the version of the tool +authors: + - "@pinin4fjords" +maintainers: + - "@pinin4fjords" diff --git a/modules/nf-core/rpbp/selectperiodicoffsets/tests/main.nf.test b/modules/nf-core/rpbp/selectperiodicoffsets/tests/main.nf.test new file mode 100644 index 00000000..643e6255 --- /dev/null +++ b/modules/nf-core/rpbp/selectperiodicoffsets/tests/main.nf.test @@ -0,0 +1,55 @@ +nextflow_process { + + name "Test Process RPBP_SELECTPERIODICOFFSETS" + script "../main.nf" + process "RPBP_SELECTPERIODICOFFSETS" + + tag "modules" + tag "modules_nfcore" + tag "rpbp" + tag "rpbp/selectperiodicoffsets" + + test("homo_sapiens chr20 - select periodic offsets") { + + when { + process { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.metagene-periodicity-bayes-factors.csv.gz", checkIfExists: true) + ]) + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } + + test("homo_sapiens chr20 - select periodic offsets - stub") { + + options '-stub' + + when { + process { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/rpbp/SRX11780888_chr20.metagene-periodicity-bayes-factors.csv.gz", checkIfExists: true) + ]) + """ + } + } + + then { + assertAll( + { assert process.success }, + { assert snapshot(sanitizeOutput(process.out)).match() } + ) + } + } +} diff --git a/modules/nf-core/rpbp/selectperiodicoffsets/tests/main.nf.test.snap b/modules/nf-core/rpbp/selectperiodicoffsets/tests/main.nf.test.snap new file mode 100644 index 00000000..401f74af --- /dev/null +++ b/modules/nf-core/rpbp/selectperiodicoffsets/tests/main.nf.test.snap @@ -0,0 +1,58 @@ +{ + "homo_sapiens chr20 - select periodic offsets": { + "content": [ + { + "periodic": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.offsets.csv.gz:md5,63de43036af73a7cc396ad5a8395eea8" + ] + ], + "versions_rpbp": [ + [ + "RPBP_SELECTPERIODICOFFSETS", + "rpbp", + "4.0.1" + ] + ] + } + ], + "timestamp": "2026-06-10T16:43:02.361389569", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + }, + "homo_sapiens chr20 - select periodic offsets - stub": { + "content": [ + { + "periodic": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.offsets.csv.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "versions_rpbp": [ + [ + "RPBP_SELECTPERIODICOFFSETS", + "rpbp", + "4.0.1" + ] + ] + } + ], + "timestamp": "2026-06-10T16:43:07.649531735", + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.3" + } + } +} diff --git a/nextflow.config b/nextflow.config index 4ca8109a..2a2cd5f0 100644 --- a/nextflow.config +++ b/nextflow.config @@ -147,6 +147,11 @@ params { extra_price_indexgenome_args = null extra_price_price_args = null + // Rp-Bp ORF caller (opt-in, overnight) + run_rpbp = false + extra_rpbp_preparegenome_args = null + extra_rpbp_predictorfs_args = null + // MultiQC options multiqc_config = null multiqc_title = null diff --git a/nextflow_schema.json b/nextflow_schema.json index 5210bbff..ba55ba58 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -790,6 +790,22 @@ "description": "Opt in to running PRICE as an additional ORF caller (Bayesian, slow).", "help_text": "PRICE (Probabilistic inference of codon activities, Erhard et al. 2018) is a Bayesian ORF caller distributed as part of the Gedi Java framework (`gedi -e Price`). It is opt-in because the EM fit dominates runtime at full scale. PRICE is invoked once across the riboseq cohort (it estimates a shared codon-position model) so the whole BAM cohort is staged into a single task; expect higher disk I/O than per-sample callers on local executors. PRICE joins the cross-caller ORF catalogue alongside Ribo-TISH, RiboCode and Ribotricer.", "fa_icon": "fas fa-fast-forward" + }, + "run_rpbp": { + "type": "boolean", + "description": "Opt in to running Rp-Bp as an additional ORF caller (Bayesian, slow).", + "help_text": "Rp-Bp (Malone et al. 2017) is a Bayesian-strict ORF caller that complements RiboCode's permissive canonical-CDS calls. Runtime is on the order of 20-24h per replicate at genome-wide scale because the MCMC fit dominates, hence it is opt-in. Plan accordingly. Rp-Bp's score column (Bayes factor) participates in the cross-caller rank aggregation alongside RiboCode and Ribo-TISH.", + "fa_icon": "fas fa-fast-forward" + }, + "extra_rpbp_preparegenome_args": { + "type": "string", + "description": "Extra CLI arguments passed to `prepare-rpbp-genome`.", + "fa_icon": "fas fa-terminal" + }, + "extra_rpbp_predictorfs_args": { + "type": "string", + "description": "Extra CLI arguments passed to `select-final-prediction-set`.", + "fa_icon": "fas fa-terminal" } } }, diff --git a/subworkflows/local/orf_caller_dispatch/main.nf b/subworkflows/local/orf_caller_dispatch/main.nf index 900ce236..a358f825 100644 --- a/subworkflows/local/orf_caller_dispatch/main.nf +++ b/subworkflows/local/orf_caller_dispatch/main.nf @@ -2,14 +2,14 @@ // Conditional ORF-caller dispatch for the riboseq pipeline. // // Runs each enabled caller against the appropriate annotation: when extended -// ORF analysis is active, genome-BAM callers (Ribo-TISH predict, Ribotricer, -// PRICE) receive the hybrid GTF and RiboCode receives the hybrid transcriptome -// BAM + hybrid GTF. Otherwise everything stays on the canonical backbone. -// Ribo-TISH additionally takes the canonical backbone via -a for background + -// classification. +// ORF analysis is active, genome-BAM callers (Ribo-TISH +// predict, Ribotricer, Rp-Bp, PRICE) receive the hybrid GTF and RiboCode +// receives the hybrid transcriptome BAM + hybrid GTF. Otherwise everything +// stays on the canonical backbone. Ribo-TISH additionally takes the canonical +// backbone via -a for background + classification. // -// Per-caller gating (params.skip_ribotish / params.run_ribotricer / -// params.run_price) lives here. +// Per-caller gating (params.skip_ribotish / params.run_*) lives here; the +// downstream catalogue gating still lives at the call site. // // Emits one prediction channel per caller plus collected versions and // multiqc files. Predictions are empty channels for callers that did not run. @@ -24,6 +24,7 @@ include { RIBOCODE_GTFUPDATE } from '../../../modul include { RIBOCODE_PREPARE } from '../../../modules/nf-core/ribocode/prepare' include { RIBOCODE_METAPLOTS } from '../../../modules/nf-core/ribocode/metaplots' include { RIBOCODE_RIBOCODE } from '../../../modules/nf-core/ribocode/ribocode' +include { FASTA_GTF_BAM_RPBP } from '../../nf-core/fasta_gtf_bam_rpbp/main' include { GEDI_INDEXGENOME } from '../../../modules/nf-core/gedi/indexgenome/main' include { GEDI_PRICE } from '../../../modules/nf-core/gedi/price/main' @@ -41,8 +42,8 @@ workflow ORF_CALLER_DISPATCH { main: - ch_versions = channel.empty() - ch_multiqc_files = channel.empty() + ch_versions = Channel.empty() + ch_multiqc_files = Channel.empty() // Annotation channels. Canonical for ORF calling / P-site / DTE; the full // ch_gtf is reserved for genome-guided alignment elsewhere in the pipeline. @@ -64,7 +65,7 @@ workflow ORF_CALLER_DISPATCH { // // Ribo-TISH // - ch_ribotish_predictions = channel.empty() + ch_ribotish_predictions = Channel.empty() if (!params.skip_ribotish) { RIBOTISH_QUALITY_RIBOSEQ( ch_bams_for_analysis, @@ -115,7 +116,7 @@ workflow ORF_CALLER_DISPATCH { // // Ribotricer // - ch_ribotricer_predictions = channel.empty() + ch_ribotricer_predictions = Channel.empty() if (params.run_ribotricer) { log.warn "Ribotricer is enabled via --run_ribotricer. Its per-ORF scores are unstable across biological replicates, so its binary calls contribute to cross-caller agreement but its scores are excluded from the rank aggregation." @@ -136,10 +137,36 @@ workflow ORF_CALLER_DISPATCH { ch_ribotricer_predictions = RIBOTRICER_DETECTORFS.out.orfs } + // + // Rp-Bp + // + ch_rpbp_predictions = Channel.empty() + if (params.run_rpbp) { + log.warn "Rp-Bp is enabled via --run_rpbp. Expect roughly 20-24h per replicate at genome-wide scale because the Bayesian MCMC fit dominates; plan compute accordingly. Its score column (Bayes factor) is stable and is retained in the cross-caller rank aggregation." + + // Rp-Bp enumerates candidate ORFs per transcript isoform and resolves + // redundancy/overlaps itself (longest-per-stop, best Bayes factor per + // overlap; Malone et al. 2017), so it takes the full multi-isoform + // annotation rather than the one-transcript-per-gene backbone, which + // exists to disambiguate P-site quantification. Restricting it to + // canonical would drop isoform-specific ORFs and bias ORF-type + // classification to canonical CDS. In extended mode it takes the hybrid + // GTF to bring novel transcripts into scope. + def ch_rpbp_annotation = extended_orf_active ? + ch_fasta_gtf_extended : + ch_fasta.combine(ch_gtf).map { fasta, gtf -> [ [id: 'reference'], fasta, gtf ] }.first() + + FASTA_GTF_BAM_RPBP( + ch_bams_for_analysis, + ch_rpbp_annotation + ) + ch_rpbp_predictions = FASTA_GTF_BAM_RPBP.out.predicted + } + // // PRICE // - ch_price_predictions = channel.empty() + ch_price_predictions = Channel.empty() if (params.run_price) { log.warn "PRICE is enabled via --run_price. PRICE (Erhard et al. 2018) estimates a shared cohort-level codon-position model via EM and is opt-in because its genome-wide runtime is substantial. Plan compute accordingly." @@ -174,7 +201,7 @@ workflow ORF_CALLER_DISPATCH { // // RiboCode // - ch_ribocode_predictions = channel.empty() + ch_ribocode_predictions = Channel.empty() if (!params.skip_ribocode) { // RiboCode requires transcriptome-coordinate BAMs. When extended-ORF // analysis is active, swap in the hybrid transcriptome BAM @@ -239,6 +266,7 @@ workflow ORF_CALLER_DISPATCH { ribotish_predictions = ch_ribotish_predictions // [ meta, predictions.txt ] or empty ribocode_predictions = ch_ribocode_predictions // [ meta, orf.txt ] or empty ribotricer_predictions = ch_ribotricer_predictions // [ meta, orfs ] or empty + rpbp_predictions = ch_rpbp_predictions // [ meta, predicted ] or empty price_predictions = ch_price_predictions // [ meta, orfs.tsv ] or empty multiqc_files = ch_multiqc_files versions = ch_versions diff --git a/subworkflows/local/orf_caller_dispatch/meta.yml b/subworkflows/local/orf_caller_dispatch/meta.yml index 1982e4a2..ec06d4fc 100644 --- a/subworkflows/local/orf_caller_dispatch/meta.yml +++ b/subworkflows/local/orf_caller_dispatch/meta.yml @@ -2,17 +2,19 @@ # yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/subworkflows/yaml-schema.json name: "orf_caller_dispatch" description: | - Conditional dispatch of ORF callers (Ribo-TISH, Ribotricer, RiboCode) based - on params.skip_* / params.run_* flags. Routes the genome-BAM callers - (Ribo-TISH predict, Ribotricer) to the hybrid GTF when extended-ORF analysis - is active, otherwise to the canonical backbone. Emits one prediction channel - per caller (empty when that caller is disabled). + Conditional dispatch of ORF callers (Ribo-TISH, Ribotricer, Rp-Bp, PRICE, + RiboCode) based on params.skip_* / params.run_* flags. Routes annotation + inputs to the hybrid GTF / hybrid transcriptome BAM when extended-ORF + analysis is active, otherwise to the canonical backbone. Emits one + prediction channel per caller (empty when that caller is disabled). keywords: - riboseq - orf calling - ribotish - ribocode - ribotricer + - rpbp + - price components: - ribotish/quality - ribotish/predict @@ -22,6 +24,9 @@ components: - ribocode/prepare - ribocode/metaplots - ribocode/ribocode + - fasta_gtf_bam_rpbp + - gedi/indexgenome + - gedi/price input: - ch_bams_for_analysis: description: Per-sample Ribo-seq genome BAM + BAI @@ -29,6 +34,9 @@ input: - ch_transcriptome_bam: description: Canonical transcriptome BAMs (all sample types - subworkflow filters to riboseq for RiboCode) pattern: "*.bam" + - ch_hybrid_transcriptome_bam: + description: Hybrid transcriptome BAMs (Ribo-seq only) - used by RiboCode when extended-ORF analysis is active + pattern: "*.bam" - ch_fasta: description: Genome FASTA pattern: "*.fasta" @@ -39,10 +47,10 @@ input: description: Canonical + filtered novel hybrid GTF (equals canonical when no novel source configured) pattern: "*.gtf" - ch_gtf: - description: Full multi-isoform reference GTF (used by RiboCode) + description: Full multi-isoform reference GTF (used by RiboCode when not extended) pattern: "*.gtf" - extended_orf_active: - description: Whether to route the genome-BAM callers to the hybrid annotation + description: Whether to route callers to the hybrid annotation output: - ribotish_predictions: description: Per-sample Ribo-TISH predictions (empty when --skip_ribotish) @@ -53,6 +61,12 @@ output: - ribotricer_predictions: description: Per-sample Ribotricer ORF calls (empty unless --run_ribotricer) pattern: "*.tsv" + - rpbp_predictions: + description: Per-sample Rp-Bp ORF predictions (empty unless --run_rpbp) + pattern: "*.bed.gz" + - price_predictions: + description: Cohort-level PRICE ORF TSV (empty unless --run_price) + pattern: "*.tsv" - multiqc_files: description: MultiQC-facing files (e.g. Ribo-TISH quality distributions) pattern: "*" diff --git a/subworkflows/nf-core/fasta_gtf_bam_rpbp/main.nf b/subworkflows/nf-core/fasta_gtf_bam_rpbp/main.nf new file mode 100644 index 00000000..6935178a --- /dev/null +++ b/subworkflows/nf-core/fasta_gtf_bam_rpbp/main.nf @@ -0,0 +1,74 @@ +include { RPBP_PREPAREGENOME } from '../../../modules/nf-core/rpbp/preparegenome/main' +include { RPBP_EXTRACTMETAGENEPROFILES } from '../../../modules/nf-core/rpbp/extractmetageneprofiles/main' +include { RPBP_ESTIMATEMETAGENEBAYESFACTORS } from '../../../modules/nf-core/rpbp/estimatemetagenebayesfactors/main' +include { RPBP_SELECTPERIODICOFFSETS } from '../../../modules/nf-core/rpbp/selectperiodicoffsets/main' +include { RPBP_GETPERIODICLENGTHSOFFSETS } from '../../../modules/nf-core/rpbp/getperiodiclengthsoffsets/main' +include { RPBP_EXTRACTORFPROFILES } from '../../../modules/nf-core/rpbp/extractorfprofiles/main' +include { RPBP_ESTIMATEORFBAYESFACTORS } from '../../../modules/nf-core/rpbp/estimateorfbayesfactors/main' +include { RPBP_SELECTFINALPREDICTIONSET } from '../../../modules/nf-core/rpbp/selectfinalpredictionset/main' + +workflow FASTA_GTF_BAM_RPBP { + + take: + ch_bam // channel: [ val(meta), path(bam), path(bai) ] - Ribo-seq BAMs + ch_fasta_gtf // channel (single value): [ val(meta), path(fasta), path(gtf) ] + + main: + + RPBP_PREPAREGENOME(ch_fasta_gtf) + + ch_transcript_bed = RPBP_PREPAREGENOME.out.transcript_bed.first() + ch_orfs_genomic_bed = RPBP_PREPAREGENOME.out.orfs_genomic_bed.first() + ch_orfs_exons_bed = RPBP_PREPAREGENOME.out.orfs_exons_bed.first() + ch_genome_fasta = ch_fasta_gtf.map { meta, fasta, _gtf -> [ meta, fasta ] }.first() + + RPBP_EXTRACTMETAGENEPROFILES ( + ch_bam, + ch_transcript_bed + ) + + RPBP_ESTIMATEMETAGENEBAYESFACTORS ( + RPBP_EXTRACTMETAGENEPROFILES.out.metagene + ) + + RPBP_SELECTPERIODICOFFSETS ( + RPBP_ESTIMATEMETAGENEBAYESFACTORS.out.bayes_factors + ) + + RPBP_GETPERIODICLENGTHSOFFSETS ( + RPBP_SELECTPERIODICOFFSETS.out.periodic + ) + + ch_extract_in = ch_bam + .join(RPBP_GETPERIODICLENGTHSOFFSETS.out.lengths_offsets, by: 0) + + RPBP_EXTRACTORFPROFILES ( + ch_extract_in, + ch_orfs_genomic_bed, + ch_orfs_exons_bed + ) + + RPBP_ESTIMATEORFBAYESFACTORS ( + RPBP_EXTRACTORFPROFILES.out.profiles, + ch_orfs_genomic_bed + ) + + RPBP_SELECTFINALPREDICTIONSET ( + RPBP_ESTIMATEORFBAYESFACTORS.out.bayes_factors, + ch_genome_fasta + ) + + emit: + transcript_bed = RPBP_PREPAREGENOME.out.transcript_bed // channel: [ val(meta), path(*.annotated.bed.gz) ] + orfs_genomic_bed = RPBP_PREPAREGENOME.out.orfs_genomic_bed // channel: [ val(meta), path(*.orfs-genomic.annotated.bed.gz) ] + orfs_exons_bed = RPBP_PREPAREGENOME.out.orfs_exons_bed // channel: [ val(meta), path(*.orfs-exons.annotated.bed.gz) ] + metagene = RPBP_EXTRACTMETAGENEPROFILES.out.metagene // channel: [ val(meta), path(*.metagene.csv.gz) ] + metagene_bf = RPBP_ESTIMATEMETAGENEBAYESFACTORS.out.bayes_factors // channel: [ val(meta), path(*.metagene-bayes.csv.gz) ] + periodic = RPBP_SELECTPERIODICOFFSETS.out.periodic // channel: [ val(meta), path(*.offsets.csv.gz) ] + lengths_offsets = RPBP_GETPERIODICLENGTHSOFFSETS.out.lengths_offsets // channel: [ val(meta), path(*.lengths-offsets.tsv) ] + orf_profiles = RPBP_EXTRACTORFPROFILES.out.profiles // channel: [ val(meta), path(*.profiles.mtx.gz) ] + orf_bayes = RPBP_ESTIMATEORFBAYESFACTORS.out.bayes_factors // channel: [ val(meta), path(*.bayes-factors.bed.gz) ] + predicted = RPBP_SELECTFINALPREDICTIONSET.out.predicted // channel: [ val(meta), path(*.predicted-orfs.bed.gz) ] + dna_fasta = RPBP_SELECTFINALPREDICTIONSET.out.dna_fasta // channel: [ val(meta), path(*.predicted-orfs.dna.fa) ] + protein_fasta = RPBP_SELECTFINALPREDICTIONSET.out.protein_fasta // channel: [ val(meta), path(*.predicted-orfs.protein.fa) ] +} diff --git a/subworkflows/nf-core/fasta_gtf_bam_rpbp/meta.yml b/subworkflows/nf-core/fasta_gtf_bam_rpbp/meta.yml new file mode 100644 index 00000000..655b47d7 --- /dev/null +++ b/subworkflows/nf-core/fasta_gtf_bam_rpbp/meta.yml @@ -0,0 +1,108 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/subworkflows/yaml-schema.json +name: "fasta_gtf_bam_rpbp" +description: | + End-to-end translated-ORF discovery from ribosome profiling (Ribo-seq) data + with Rp-Bp (Malone et al. 2017). Ribo-seq sequences the short mRNA + footprints protected by translating ribosomes; Rp-Bp uses a Bayesian model + to score which open reading frames (ORFs) on the transcriptome show the + characteristic 3-nucleotide periodicity of active translation. + + Given a genome FASTA, annotation GTF and one or more pre-aligned Ribo-seq + BAMs, the subworkflow: + 1. Builds the per-ORF reference (rpbp/preparegenome) once per run - + enumerates candidate ORFs from the GTF. + 2. For each sample, builds per-read-length metagene profiles around + annotated start codons (extract-metagene-profiles), scores their + three-nucleotide periodicity with Bayesian models + (estimate-metagene-profile-bayes-factors), picks the best P-site + offset per periodic read length (select-periodic-offsets), and + filters down to the high-confidence read-length/offset pairs that + will drive ORF-level scoring (get-periodic-lengths-offsets). + 3. Builds per-ORF P-site count vectors with those selected offsets + (extract-orf-profiles), scores each ORF as "translated" vs + "untranslated" with a Bayesian model (estimate-orf-bayes-factors), + and emits the final filtered set of predicted translated ORFs as + BED + DNA FASTA + protein FASTA (select-final-prediction-set). + + Each Rp-Bp step is a separate Nextflow process so `-resume` only re-runs + what changed, and STAR/bowtie alignment is supplied via the BAM input + (this subworkflow does not realign). +keywords: + - rpbp + - orf + - bayesian + - translation + - riboseq +components: + - rpbp/preparegenome + - rpbp/extractmetageneprofiles + - rpbp/estimatemetagenebayesfactors + - rpbp/selectperiodicoffsets + - rpbp/getperiodiclengthsoffsets + - rpbp/extractorfprofiles + - rpbp/estimateorfbayesfactors + - rpbp/selectfinalpredictionset +input: + - ch_bam: + description: | + Per-sample Ribo-seq BAM with index. Filter to Ribo-seq libraries + upstream. + Structure: [ val(meta), path(bam), path(bai) ] + - ch_fasta_gtf: + description: | + Single-value channel of genome FASTA + annotation GTF. The GTF may + be canonical or a hybrid (e.g. extended-ORF mode). + Structure: [ val(meta), path(fasta), path(gtf) ] +output: + - transcript_bed: + description: | + Annotated transcripts BED produced by `rpbp/preparegenome`. + Structure: [ val(meta), path(*.annotated.bed.gz) ] + - orfs_genomic_bed: + description: | + Per-ORF genomic BED produced by `rpbp/preparegenome`. + Structure: [ val(meta), path(*.orfs-genomic.annotated.bed.gz) ] + - orfs_exons_bed: + description: | + Per-ORF exons BED produced by `rpbp/preparegenome`. + Structure: [ val(meta), path(*.orfs-exons.annotated.bed.gz) ] + - metagene: + description: | + Per-sample metagene profiles. + Structure: [ val(meta), path(*.metagene.csv.gz) ] + - metagene_bf: + description: | + Per-sample metagene periodicity Bayes factors. + Structure: [ val(meta), path(*.metagene-bayes.csv.gz) ] + - periodic: + description: | + Per-sample selected periodic offsets. + Structure: [ val(meta), path(*.offsets.csv.gz) ] + - orf_profiles: + description: | + Per-sample per-ORF P-site profile matrix. + Structure: [ val(meta), path(*.profiles.mtx.gz) ] + - lengths_offsets: + description: | + Per-sample selected lengths/offsets actually used by extract-orf-profiles. + Structure: [ val(meta), path(*.lengths-offsets.tsv) ] + - orf_bayes: + description: | + Per-sample per-ORF translation Bayes factors. + Structure: [ val(meta), path(*.bayes-factors.bed.gz) ] + - predicted: + description: | + Per-sample final filtered predicted-ORF BED. + Structure: [ val(meta), path(*.predicted-orfs.bed.gz) ] + - dna_fasta: + description: | + Per-sample predicted-ORF DNA FASTA. + Structure: [ val(meta), path(*.predicted-orfs.dna.fa) ] + - protein_fasta: + description: | + Per-sample predicted-ORF protein FASTA. + Structure: [ val(meta), path(*.predicted-orfs.protein.fa) ] +authors: + - "@pinin4fjords" +maintainers: + - "@pinin4fjords" diff --git a/subworkflows/nf-core/fasta_gtf_bam_rpbp/tests/main.nf.test b/subworkflows/nf-core/fasta_gtf_bam_rpbp/tests/main.nf.test new file mode 100644 index 00000000..5d7a9b3f --- /dev/null +++ b/subworkflows/nf-core/fasta_gtf_bam_rpbp/tests/main.nf.test @@ -0,0 +1,100 @@ +nextflow_workflow { + + name "Test Subworkflow FASTA_GTF_BAM_RPBP" + script "../main.nf" + workflow "FASTA_GTF_BAM_RPBP" + config "./nextflow.config" + + tag "subworkflows" + tag "subworkflows_nfcore" + tag "subworkflows/fasta_gtf_bam_rpbp" + tag "fasta_gtf_bam_rpbp" + tag "rpbp" + tag "rpbp/preparegenome" + tag "rpbp/extractmetageneprofiles" + tag "rpbp/estimatemetagenebayesfactors" + tag "rpbp/selectperiodicoffsets" + tag "rpbp/getperiodiclengthsoffsets" + tag "rpbp/extractorfprofiles" + tag "rpbp/estimateorfbayesfactors" + tag "rpbp/selectfinalpredictionset" + tag "gunzip" + + setup { + run("GUNZIP") { + script "modules/nf-core/gunzip/main.nf" + process { + """ + input[0] = [ + [ id:'reference' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/Homo_sapiens.GRCh38.dna.chromosome.20.fa.gz", checkIfExists: true) + ] + """ + } + } + } + + test("homo_sapiens chr20 - end-to-end rpbp") { + + when { + workflow { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/aligned_reads/SRX11780888_chr20.bam", checkIfExists: true), + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/aligned_reads/SRX11780888_chr20.bam.bai", checkIfExists: true) + ]) + input[1] = GUNZIP.out.gunzip.map { meta, fa -> [ + meta, + fa, + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/Homo_sapiens.GRCh38.111_chr20.gtf", checkIfExists: true) + ] } + """ + } + } + + then { + assertAll( + { assert workflow.success }, + { assert snapshot( + file(workflow.out.predicted[0][1]).name, + file(workflow.out.dna_fasta[0][1]).name, + file(workflow.out.protein_fasta[0][1]).name, + file(workflow.out.orf_bayes[0][1]).name, + file(workflow.out.transcript_bed[0][1]).name, + file(workflow.out.orfs_genomic_bed[0][1]).name, + file(workflow.out.orfs_exons_bed[0][1]).name + ).match() } + ) + } + } + + test("homo_sapiens chr20 - end-to-end rpbp - stub") { + + options "-stub" + + when { + workflow { + """ + input[0] = Channel.of([ + [ id:'test', single_end:true, strandedness:'forward' ], + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/aligned_reads/SRX11780888_chr20.bam", checkIfExists: true), + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/aligned_reads/SRX11780888_chr20.bam.bai", checkIfExists: true) + ]) + input[1] = GUNZIP.out.gunzip.map { meta, fa -> [ + meta, + fa, + file(params.modules_testdata_base_path + "genomics/homo_sapiens/riboseq_expression/Homo_sapiens.GRCh38.111_chr20.gtf", checkIfExists: true) + ] } + """ + } + } + + then { + assertAll( + { assert workflow.success }, + { assert snapshot(workflow.out).match() } + ) + } + } +} diff --git a/subworkflows/nf-core/fasta_gtf_bam_rpbp/tests/main.nf.test.snap b/subworkflows/nf-core/fasta_gtf_bam_rpbp/tests/main.nf.test.snap new file mode 100644 index 00000000..17a9cae1 --- /dev/null +++ b/subworkflows/nf-core/fasta_gtf_bam_rpbp/tests/main.nf.test.snap @@ -0,0 +1,257 @@ +{ + "homo_sapiens chr20 - end-to-end rpbp - stub": { + "content": [ + { + "0": [ + [ + { + "id": "reference" + }, + "reference.annotated.bed.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "1": [ + [ + { + "id": "reference" + }, + "reference.orfs-genomic.annotated.bed.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "10": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.predicted-orfs.dna.fa:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ], + "11": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.predicted-orfs.protein.fa:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ], + "2": [ + [ + { + "id": "reference" + }, + "reference.orfs-exons.annotated.bed.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "3": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.metagene.csv.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "4": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.metagene-bayes.csv.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "5": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.offsets.csv.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "6": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.lengths-offsets.tsv:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ], + "7": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.profiles.mtx.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "8": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.bayes-factors.bed.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "9": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.predicted-orfs.bed.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "dna_fasta": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.predicted-orfs.dna.fa:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ], + "lengths_offsets": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.lengths-offsets.tsv:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ], + "metagene": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.metagene.csv.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "metagene_bf": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.metagene-bayes.csv.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "orf_bayes": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.bayes-factors.bed.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "orf_profiles": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.profiles.mtx.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "orfs_exons_bed": [ + [ + { + "id": "reference" + }, + "reference.orfs-exons.annotated.bed.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "orfs_genomic_bed": [ + [ + { + "id": "reference" + }, + "reference.orfs-genomic.annotated.bed.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "periodic": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.offsets.csv.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "predicted": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.predicted-orfs.bed.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ], + "protein_fasta": [ + [ + { + "id": "test", + "single_end": true, + "strandedness": "forward" + }, + "test.predicted-orfs.protein.fa:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ], + "transcript_bed": [ + [ + { + "id": "reference" + }, + "reference.annotated.bed.gz:md5,68b329da9893e34099c7d8ad5cb9c940" + ] + ] + } + ], + "meta": { + "nf-test": "0.9.3", + "nextflow": "25.10.4" + }, + "timestamp": "2026-06-12T16:30:54.993206" + }, + "homo_sapiens chr20 - end-to-end rpbp": { + "content": [ + "test.predicted-orfs.bed.gz", + "test.predicted-orfs.dna.fa", + "test.predicted-orfs.protein.fa", + "test.bayes-factors.bed.gz", + "reference.annotated.bed.gz", + "reference.orfs-genomic.annotated.bed.gz", + "reference.orfs-exons.annotated.bed.gz" + ], + "meta": { + "nf-test": "0.9.5", + "nextflow": "26.04.1" + }, + "timestamp": "2026-05-20T10:29:22.715089396" + } +} \ No newline at end of file diff --git a/subworkflows/nf-core/fasta_gtf_bam_rpbp/tests/nextflow.config b/subworkflows/nf-core/fasta_gtf_bam_rpbp/tests/nextflow.config new file mode 100644 index 00000000..9e812865 --- /dev/null +++ b/subworkflows/nf-core/fasta_gtf_bam_rpbp/tests/nextflow.config @@ -0,0 +1,8 @@ +process { + withName: 'RPBP_GETPERIODICLENGTHSOFFSETS' { + ext.args = '--min-count 10 --min-bf-mean 1 --min-bf-likelihood 0.0' + } + withName: 'RPBP_SELECTFINALPREDICTIONSET' { + ext.args = '--select-longest-by-stop --select-best-overlapping' + } +} diff --git a/workflows/riboseq/main.nf b/workflows/riboseq/main.nf index b100933f..1441bb80 100644 --- a/workflows/riboseq/main.nf +++ b/workflows/riboseq/main.nf @@ -421,16 +421,18 @@ workflow RIBOSEQ { // Dynamic ORF-caller set for cross-caller agreement. // The enabled list reflects which callers ran at runtime; the agreement // threshold and rank-aggregation set are derived from it so the logic - // works whether 2 (default) or 3 callers are active. + // works whether 2 (default) or 3+ callers are active. // def enabled_orf_callers = [] if (!params.skip_ribotish) { enabled_orf_callers << 'ribotish' } if (!params.skip_ribocode) { enabled_orf_callers << 'ribocode' } if ( params.run_ribotricer) { enabled_orf_callers << 'ribotricer' } + if ( params.run_rpbp) { enabled_orf_callers << 'rpbp' } if ( params.run_price) { enabled_orf_callers << 'price' } // Ribotricer contributes binary calls only; its scores are excluded from - // the cross-caller rank aggregation due to known rank instability. + // the cross-caller rank aggregation due to known rank instability. Rp-Bp's + // Bayes factor is stable and is retained for ranking. def rank_aggregation_callers = enabled_orf_callers - 'ribotricer' // Strict-majority of enabled callers (floor(N/2)+1): N=2 -> 2 (both must // agree), N=3 -> 2 (majority). Adapts as the caller set grows.