Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
79c735b
add READSUBMIT workflow
ochkalova Apr 29, 2026
0648352
delete incorrect config for reads test
ochkalova May 12, 2026
8cfcfd7
multiple fixes
ochkalova May 12, 2026
e232200
Merge branch 'dev' into feat/readsubmit
ochkalova May 12, 2026
651ca32
update multiqc in READSUBMIT
ochkalova May 12, 2026
00307d0
remove --fasta-dir because it doesn't work for fastqs staged to diffe…
ochkalova May 13, 2026
5678294
disable publishDir for CREATE_READS_MANIFEST
ochkalova May 14, 2026
2555546
Perform create_reads_manifest with python script rather than bash com…
timrozday-mgnify May 18, 2026
e84e7d9
update modules and workflows to use new webin_cli_hander.py arguments…
timrozday-mgnify May 18, 2026
43587ea
Tidy up parsing of samplesheet to avoid positional indexing
timrozday-mgnify May 18, 2026
88f2cfb
Add workflow checks that a study is provided otherwise pipeline could…
timrozday-mgnify May 18, 2026
84f625c
Document and enforce a one-study-only requirement
timrozday-mgnify May 18, 2026
0bb3540
Added extra validation of ENA fields using ENA docs
timrozday-mgnify May 18, 2026
5aa4bf1
Change bin/create_reads_manifest.py to keep the full fastq read path.…
timrozday-mgnify May 19, 2026
a9ba936
Tidy up flags and minor fixes
timrozday-mgnify May 19, 2026
852d89e
Remove test_upload argument from CREATE_READS_MANIFEST since it doesn…
timrozday-mgnify May 19, 2026
5bde07a
Move bin/create_reads_manifest.py to mgnify-pipelines-toolkit
timrozday-mgnify May 19, 2026
85541b5
Update mgnify-pipelines-toolkit version to 1.5.1
KateSakharova Jun 4, 2026
b26b07d
add single quotes for metadata values
ochkalova Jun 4, 2026
486ae28
update pipeline metromap
ochkalova Jun 4, 2026
dd7feec
update snapshots
ochkalova Jun 4, 2026
acf1ee0
update containers
ochkalova Jun 4, 2026
40db261
update samplesheet path
ochkalova Jun 4, 2026
1d94d7d
update .nftignore
ochkalova Jun 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 36 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,12 @@
## Introduction

**nf-core/seqsubmit** is a Nextflow pipeline for submitting sequence data to [ENA](https://www.ebi.ac.uk/ena/browser/home).
Currently, the pipeline supports three submission modes, each routed to a dedicated workflow and requiring its own input samplesheet structure:
Currently, the pipeline supports four submission modes, each routed to a dedicated workflow and requiring its own input samplesheet structure:

- `mags` for Metagenome Assembled Genomes (MAGs) submission with `GENOMESUBMIT` workflow
- `bins` for bins submission with `GENOMESUBMIT` workflow
- `metagenomic_assemblies` for assembly submission with `ASSEMBLYSUBMIT` workflow
- `reads` for raw sequencing reads submission with `READSUBMIT` workflow

![seqsubmit workflow diagram](assets/seqsubmit_schema.png)

Expand Down Expand Up @@ -123,6 +124,38 @@ assembly_2,data/contigs_2.fasta.gz,,,42.7,ERR011323,MEGAHIT,1.2.9
> [!IMPORTANT]
> **Samplesheet column requirements**: All columns shown in the example above must be present in your samplesheet, even if some values are empty. Columns must be in exactly the same order as shown.

### `reads` mode (`READSUBMIT`)

The input must follow `assets/schema_input_reads.json`.

Required columns:

- `sample`
- `sample_accession`
- `fastq_1`
- `fastq_2`
- `platform`
- `instrument`
- `library_source`
- `library_selection`
- `library_strategy`

Optional columns:

- `insert_size`
- `library_name`
- `description`

Example `samplesheet_reads.csv`:

```csv
sample,sample_accession,fastq_1,fastq_2,platform,instrument,library_source,library_selection,library_strategy,insert_size,library_name,description
illumina_run_001,SAMEA1234567,data/reads_R1.fastq.gz,data/reads_R2.fastq.gz,ILLUMINA,Illumina HiSeq 2000,GENOMIC,RANDOM,WGS,500,HiSeq_library_001,Illumina sequencing of sample XYZ
```

> [!IMPORTANT]
> **Samplesheet column requirements**: All columns shown in the example above must be present in your samplesheet, even if some values are empty. Columns must be in exactly the same order as shown.

## Usage

> [!NOTE]
Expand All @@ -142,7 +175,7 @@ The `mags`/`bins` workflow requires databases for completeness/contamination est

| Parameter | Description |
| ------------------------------------------ | ----------------------------------------------------------------------------------------------------------------- |
| `--mode` | Type of the data to be submitted. Options: `[mags, bins, metagenomic_assemblies]` |
| `--mode` | Type of the data to be submitted. Options: `[mags, bins, metagenomic_assemblies, reads]` |
| `--input` | Path to the samplesheet describing the data to be submitted |
| `--outdir` | Path to the output directory for pipeline results |
| `--submission_study` OR `--study_metadata` | ENA study accession (PRJ/ERP) to submit the data to OR metadata file in JSON/TSV/CSV format to register new study |
Expand All @@ -161,7 +194,7 @@ General command template:
```bash
nextflow run nf-core/seqsubmit \
-profile <docker/singularity/...> \
--mode <mags|bins|metagenomic_assemblies> \
--mode <mags|bins|metagenomic_assemblies|reads> \
--input <samplesheet.csv> \
--centre_name <your_centre> \
--submission_study <your_study> \
Expand Down
223 changes: 223 additions & 0 deletions assets/schema_input_reads.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://raw.githubusercontent.com/nf-core/seqsubmit/main/assets/schema_input_reads.json",
"title": "nf-core/seqsubmit pipeline - params.input schema",
"description": "Schema for the sample sheet provided with params.input if params.mode is set to 'reads'",
"type": "array",
"items": {
"type": "object",
"properties": {
"sample": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Sample must be provided and cannot contain spaces",
"meta": ["id"],
"description": "Unique experiment/run name"
},
"sample_accession": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Sample accession must be provided and cannot contain spaces",
"description": "ENA sample accession of the sample used to generate the reads"
},
"fastq_1": {
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.(fq|fastq)(\\.gz)?$",
"errorMessage": "FASTQ file must have extension '.fq' or '.fastq' (optionally gzipped)",
"description": "Forward reads FASTQ file (single-end or paired-end)"
},
"fastq_2": {
"anyOf": [
{
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.(fq|fastq)(\\.gz)?$"
},
{
"type": "string",
"maxLength": 0
}
],
"errorMessage": "FASTQ file for reverse reads must have extension '.fq' or '.fastq' (optionally gzipped)",
"description": "Reverse reads FASTQ file if paired-end. Leave empty for single-end reads"
},
"platform": {
"type": "string",
"enum": [
"BGISEQ",
"CAPILLARY",
"DNBSEQ",
"ELEMENT",
"GENAPSYS",
"GENEMIND",
"HELICOS",
"ILLUMINA",
"ION_TORRENT",
"LS454",
"OXFORD_NANOPORE",
"PACBIO_SMRT",
"TAPESTRI",
"ULTIMA",
"VELA_DIAGNOSTICS"
],
"description": "Sequencing platform. Must be one of the ENA controlled vocabulary values listed in the enum."
},
"instrument": {
"type": "string",
"pattern": "^[^\\n]+$",
"errorMessage": "Instrument must be provided and cannot span multiple lines",
"description": "Sequencer model (e.g., 'Illumina HiSeq 2000', 'PacBio Sequel')"
},
"library_source": {
"type": "string",
"enum": [
"GENOMIC",
"GENOMIC SINGLE CELL",
"TRANSCRIPTOMIC",
"TRANSCRIPTOMIC SINGLE CELL",
"METAGENOMIC",
"METATRANSCRIPTOMIC",
"SYNTHETIC",
"VIRAL RNA",
"OTHER"
],
"description": "Library source. Must be one of the ENA controlled vocabulary values listed in the enum."
},
"library_selection": {
"type": "string",
"enum": [
"RANDOM",
"PCR",
"RANDOM PCR",
"RT-PCR",
"HMPR",
"MF",
"repeat fractionation",
"size fractionation",
"MSLL",
"cDNA",
"cDNA_randomPriming",
"cDNA_oligo_dT",
"PolyA",
"Oligo-dT",
"Inverse rRNA",
"Inverse rRNA selection",
"ChIP",
"ChIP-Seq",
"MNase",
"DNase",
"Hybrid Selection",
"Reduced Representation",
"Restriction Digest",
"5-methylcytidine antibody",
"MBD2 protein methyl-CpG binding domain",
"CAGE",
"RACE",
"MDA",
"padlock probes capture method",
"other",
"unspecified"
],
"description": "Library selection. Must be one of the ENA controlled vocabulary values listed in the enum."
},
"library_strategy": {
"type": "string",
"enum": [
"WGS",
"WGA",
"WXS",
"RNA-Seq",
"snRNA-seq",
"ssRNA-seq",
"miRNA-Seq",
"ncRNA-Seq",
"FL-cDNA",
"EST",
"Hi-C",
"ATAC-seq",
"WCS",
"RAD-Seq",
"CLONE",
"POOLCLONE",
"AMPLICON",
"CLONEEND",
"FINISHING",
"ChIP-Seq",
"MNase-Seq",
"Ribo-Seq",
"DNase-Hypersensitivity",
"Bisulfite-Seq",
"CTS",
"ChM-Seq",
"GBS",
"MRE-Seq",
"MeDIP-Seq",
"MBD-Seq",
"NOMe-Seq",
"Tn-Seq",
"VALIDATION",
"FAIRE-seq",
"SELEX",
"RIP-Seq",
"ChIA-PET",
"Synthetic-Long-Read",
"Targeted-Capture",
"Tethered Chromatin Conformation Capture",
"OTHER"
],
"description": "Library strategy. Must be one of the ENA controlled vocabulary values listed in the enum."
},
"insert_size": {
"anyOf": [
{
"type": "number",
"minimum": 0
},
{
"type": "string",
"maxLength": 0
}
],
"errorMessage": "Insert size must be a positive number or empty",
"description": "Fragment/insert size for paired-end reads (optional)"
},
"library_name": {
"anyOf": [
{
"type": "string"
},
{
"type": "string",
"maxLength": 0
}
],
"description": "Descriptive library name (optional)"
},
"description": {
"anyOf": [
{
"type": "string"
},
{
"type": "string",
"maxLength": 0
}
],
"description": "Free-text description of the experiment (optional)"
}
},
"required": [
"sample",
"sample_accession",
"fastq_1",
"platform",
"instrument",
"library_source",
"library_selection",
"library_strategy"
]
}
}
Binary file modified assets/seqsubmit_schema.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ process {
]
}

withName: 'REGISTERSTUDY|GENERATE_ASSEMBLY_MANIFEST' {
withName: 'REGISTERSTUDY|GENERATE_ASSEMBLY_MANIFEST|CREATE_READS_MANIFEST' {
publishDir = [
enabled: false
]
Expand Down
34 changes: 34 additions & 0 deletions conf/test_reads_paired.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nextflow config file for running minimal tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Defines input files and everything required to run a fast and simple pipeline test.

Use as follows:
nextflow run nf-core/seqsubmit -profile test_reads_paired,<docker/singularity> --outdir <OUTDIR>

----------------------------------------------------------------------------------------
*/

process {
resourceLimits = [
cpus: 2,
memory: '8.GB',
time: '1.h'
]
}

params {
config_profile_name = 'Test --mode reads profile'
config_profile_description = 'Minimal test profile for reads submission'

// Input data
input = params.pipelines_testdata_base_path + 'seqsubmit/samplesheets/samplesheet_reads.csv'
outdir = 'test_output'

mode = "reads"
submission_study = "PRJEB98843"
centre_name = "TEST_CENTER"

test_upload = true
}
16 changes: 15 additions & 1 deletion docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The directories listed below will be created in the results directory (set with

## Pipeline overview

The pipeline is built using [Nextflow](https://www.nextflow.io/) and performs automated submission of sequence data to ENA. Exact steps and generated outputs depend on the data type and `--mode` executed (`mags`, `bins` or `metagenomic_assemblies`).
The pipeline is built using [Nextflow](https://www.nextflow.io/) and performs automated submission of sequence data to ENA. Exact steps and generated outputs depend on the data type and `--mode` executed (`mags`, `bins`, `metagenomic_assemblies` or `reads`).

## `mags` and `bins` outputs

Expand Down Expand Up @@ -59,6 +59,20 @@ Assembly study registration, manifest generation, and Webin-CLI submission are e
> Users should read the ENA documentation on referencing submitted data: \
> metagenomic assemblies: https://ena-docs.readthedocs.io/en/latest/submit/assembly/metagenome/primary.html#assigned-accession-numbers

## `reads` outputs

When `--mode reads` is used, results are written under `reads/`.

<details markdown="1">
<summary>Output files</summary>

- `reads/`
- `upload/reads_accessions.tsv`: run accessions assigned to submitted reads.

</details>

Manifest generation and Webin-CLI submission are executed by the workflow, but their intermediate outputs are not currently published into `--outdir` by the pipeline.

## Common outputs

### MultiQC
Expand Down
Loading
Loading