IKIM-Essen · adriandrr · Dec 2, 2025 · Nov 25, 2025 · Dec 2, 2025 · Dec 2, 2025
diff --git a/.github/data/card_db/aro_index.tsv b/.github/data/card_db/aro_index.tsv
diff --git a/.github/test_config.yaml b/.github/test_config.yaml
@@ -4,17 +4,14 @@ fastq_dir: ".github/data/fastq"
 
 min_similarity: "0.8" # threshold to pre-filter blast hits by percentage identity
 
-silva:
-  download_path_seq: ".github/data/silva_db/sub_silva_seq_RNA.fasta.gz"
-  download_path_tax: ".github/data/silva_db/taxmap_slv_ssu_ref_nr_138.2.txt.gz"
-
-card:
-  download_path: ".github/data/card_db/card_seq.tar.bz2"
+download_silva: ".github/data/silva_db/sub_silva_seq_RNA.fasta.gz"
+download_card: ".github/data/card_db/card_seq.tar.bz2"
 
 num_parts: 1 # number of chunks the fastqs are split into
 max_threads: 4
 
-github_action_test: True
 similarity_search_mode: "test" # Put here "test" or "full" for strand/s to be included in the similarity search
 
-seq_tech: "Illumina" # Put here "Illumina" or "ONT"
+seq_tech: "Illumina" # Put here "Illumina" or "ONT"
+
+add_uniref_targets: False
diff --git a/.test_steps/test_ERMA.ipynb b/.test_steps/test_ERMA.ipynb
diff --git a/README.md b/README.md
@@ -3,16 +3,16 @@
 [![Snakemake](https://img.shields.io/badge/snakemake-≥9.0-brightgreen.svg)](https://snakemake.bitbucket.io)
 [![Snakemake CI](https://github.com/IKIM-Essen/ERMA/actions/workflows/snakemake-ci.yml/badge.svg)](https://github.com/IKIM-Essen/ERMA/actions/workflows/snakemake-ci.yml)
 
-This **Snakemake**-based pipeline processes sequencing reads from epicPCR experiments to link antimicrobial resistance (AMR) genes with genus-specific 16S rRNA gene sequences from prokaryotic cells. It orchestrates all workflow steps, including database downloads, sequence alignment, read filtering, and the generation of visual reports, ensuring reproducibility and streamlined analysis.
+This **Snakemake**-based pipeline processes sequencing reads from epicPCR experiments to link antimicrobial resistance (AMR) or other target genes with genus-specific 16S rRNA gene sequences from prokaryotic cells. It orchestrates all workflow steps, including database downloads, sequence alignment, read filtering, and the generation of visual reports, ensuring reproducibility and streamlined analysis.
 
-## Features
-- Downloads and prepares SILVA and CARD databases for use in the analysis.
-- Converts raw FASTQ sequencing files to FASTA format.
-- Performs Diamond/Usearch sequence alignments against both CARD (AMR genes) and SILVA (16S rRNA).
-- Integrates similarity search results to identify linked AMR and microbial markers.
-- Generates filtered results based on alignment quality and similarity thresholds.
-- Produces graphical outputs such as genus abundance plots, boxplots, and tables.
-- Generates an HTML report summarizing the analysis.
+## Feature overview
+- Auto-downloads and prepares SILVA and CARD databases for use in the analysis
+- Auto-downloads user-specified targets (UniRef )and adds it to the CARD database
+- Performs Diamond/Usearch sequence alignments against both databases
+- Integrates similarity search results to identify linked AMR and microbial markers
+- Generates filtered results based on alignment quality and similarity thresholds
+- Produces graphical outputs such as genus abundance plots, boxplots, and tables
+- Generates an HTML report summarizing the analysis
 
 ## Prerequisites
 
@@ -21,14 +21,13 @@ This **Snakemake**-based pipeline processes sequencing reads from epicPCR experi
 This pipeline is possible to deploy with snakedeploy. We recommend to only use this with already present experience with snakemake and snakedeploy. If this is not the case, we recommend to use the full installation guide.
 
 For usage:
-1. Install snakedeploy and snakemake≥v9 (if necessary)
+1. Make sure snakedeploy and snakemake≥v9 is installed
 2. deploy minimum workflow distribution (with desired destination)
 ```bash
 snakedeploy deploy-workflow https://github.com/IKIM-Essen/ERMA --dest . --branch snakedeploy
 ```
 3. prepare setup
 ```bash
-mkdir -p data/fastq
 cp "files-to-analyze" data/fastq
 ```
 4. start pipeline (change config if necessary)
@@ -48,51 +47,80 @@ Alternatively, follow the official [Snakemake installation](https://snakemake.re
 
 Install Dependencies: This pipeline uses conda environments to manage its dependencies. Snakemake will automatically create and manage these environments when run with the `--use-conda` flag (default in profile).
 
-## Usage Instructions
+### Preparing Databases
 
-First, clone the pipeline repository to your local machine:
+ERMA relies on two primary reference databases: SILVA for taxonomic assignment and CARD for antimicrobial resistance gene detection. Both databases are handled directly through the workflow configuration and can be fetched automatically or provided locally, depending on user requirements.
 
-```bash
-git clone https://github.com/IKIM-essen/ERMA.git
-cd ERMA
+#### Automatic Download (Default)
+
+By default, ERMA downloads both databases using the URLs defined in the configuration file:
+
+```yaml
+download_silva: "https://www.arb-silva.de/fileadmin/silva_databases/release_138_2/Exports/SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz"
+download_card: "https://card.mcmaster.ca/download/0/broadstreet-v4.0.1.tar.bz2"
 ```
-Prepare Data Folder: You need to place your raw sequencing files (fastq.gz format) in the data/fastq/ directory or change this path to the desired directory.
 
-Modify the Config File: Open the config/config.yaml file and change the base_dir parameter to the base directory where the pipeline is located. The config file should look like this:
+When these entries remain set to URLs, the workflow retrieves and processes the databases automatically during execution. No additional preparation steps are required. Once downloaded, the workflow uses the present files for reiterations.
+
+#### Using Local Database Files
+
+If the user prefers handling the downloads manually or already has the required databases available on disk, the workflow can operate entirely from local paths. To do so, replace the URLs in the configuration file with absolute file paths:
 
 ```yaml
-runname: "ERMA_runname123"
-# setting up base directory and location of input and output. Generally, no changes needed here.
-base_dir: "."
-fastq_dir: "data/fastq" # copy target fastq.gz files in ERMA/data/fastq or change this path
-outdir: "results" # Output directory of the final report
+download_silva: "/path/to/local/SILVA_138.2.fasta.gz"
+download_card: "/path/to/local/broadstreet-v4.0.1.tar.bz2"
+```
 
-min_similarity: "0.8" # threshold to filter blast hits by percentage identity
-min_abundance: "0.01" # genera with lower abundance will be binned as "Other" in stacked bar abundance plot
+Once local paths are set, ERMA will skip remote downloads and use the supplied files without modification of the pipeline structure.
 
-silva:
-  download_path_seq: "path/to/silva_db"
-  download_path_tax: "path/to/silva_taxmap"
+#### Adding UniRef Targets of Interest (Optional)
 
-card:
-  download_path: "https://card.mcmaster.ca/download/0/broadstreet-v3.3.0.tar.bz2"
+ERMA provides an optional mechanism to incorporate UniRef clusters representing additional genes or functions relevant to the analysis. This process is fully automated and controlled through the following section of the configuration file:
 
-num_parts: 1 # number of subfiles the fastqs are split into
-max_threads: 16 
+```yaml
+add_uniref_targets: False
+uniprot_cluster: "100"
+uniprot_targets: ["int1", "inti1", "class_1_integron"]
+max_entry_count: 1000
+low_freq_threshold: 0.01
+```
+
+To activate UniRef integration, set:
 
-similarity_search_mode: "full" # Put here "test" or "full" for strand/s to be included in the similarity search
+```yaml
+add_uniref_targets: True
+```
 
-### Preprocessing ###
-# if data is already in format 'one fastq.gz per sample', this section can be ignored
+Then adjust the uniprot_targets list to include any desired search terms or gene names. ERMA downloads the corresponding UniRef entries, filters them according to the configured uniprot_cluster level (e.g., UniRef100), and integrates them directly into the CARD-derived reference database used for similarity searches.
+This option enables users to extend the AMR and functional screening capabilities of ERMA without requiring manual reference curation.
 
-seq_tech: "Illumina" # Put here "Illumina" or "ONT" befor using rule prepare_fastqs
+#### Manual Addition of New Targets (Alternative Workflow)
 
-# In case of Demultiplexing ONT data, provide information for this section
-ONT:
-  fastq_pass_path: "data/ONT/fastq_pass" # copy your fastq_pass folder here
-  sample_name_path: "data/ONT/barcode-rename.csv" # change this file with your barcode-sample name combinations
-  target_fragment_length: 1250 # Length of the theoretical fragment after nested PCR
-  filter_intervall: 0.1 # +/- Intervall used to filter too large/small fragments; 0.1 filters in a +/- 10% intervall
+Users who prefer to add custom targets manually can follow this approach:
+1. Retrieve the protein FASTA sequences for the target gene(s) from UniProt, CARD, or any external source
+2. Append the new sequences to the CARD protein FASTA file used by ERMA (protein_fasta_protein_homolog_model.fasta)
+3. Rezip the beforehand unzipped files to the same format as before
+4. Modify the workflow configuration to point to the updated local database files
+This approach offers full control over the source, format, and curation of custom targets while keeping the surrounding pipeline unchanged
+
+## Pipeline Usage
+
+First, clone the pipeline repository to your local machine:
+
+```bash
+git clone https://github.com/IKIM-essen/ERMA.git
+cd ERMA
+```
+Prepare Data Folder: You need to place your raw sequencing files (fastq.gz format) in the data/fastq/ directory or change this path to the desired directory.
+
+Modify the Config File: Open the config/config.yaml file and change the base_dir parameter to the base directory where the pipeline is located. The most relevant parameters for the standard user are:
+
+```yaml
+runname: "runname123"
+min_similarity: "0.9" # threshold to filter blast hits by percentage identity
+num_parts: 1 # number of subfiles the fastqs are split into (prevents crashing with very large input)
+max_threads: 16 
+seq_tech: "Illumina" # Put here "Illumina" or "ONT" according to used technology (only important for using rule prepare_fastqs)
 ```
 
 ### Illumina input
@@ -124,7 +152,7 @@ When starting with raw ONT output, this routine can be used to demultiplex the s
 5. Run from the ERMA root folder:
 
 ```bash
-snakemake prepare_fastqs
+snakemake prepare_fastqs --cores N
 ```
 
 This will execute a python script that filters and demultiplex the ONT output in one fastq file per sample.
@@ -148,19 +176,23 @@ snakemake --profile profile
 
 ### Testing
 
-For testing the workflow you can copy the provided dummy data:
+For testing the workflow the user can copy the provided dummy data:
 
 ```
 cp .github/data/fastq/test_epic_data.fastq.gz data/fastq/
 ```
 
-In this case, the similarity search mode in the config file can be changed to "test"
+In this case, the similarity search mode in the config file can be changed to "test", searching only one of the two strands.
 
 ## Additional Notes
 
-The pipeline is designed to handle large sequencing datasets in parallel, so it's recommended to run it on a machine with sufficient computational resources. However, to run the pipeline on machines with less resources, it is recommended to split the fastq files or the tables in smaller chunks to prevent RAM  overflow. This can be done by increase num_parts in the config file. However, the higher the number of parts per FASTQ file, the higher the chance some blast results for the same read will be split and lost.
+The pipeline is designed to handle large sequencing datasets in parallel, so it's recommended to run it on a machine with sufficient computational resources. However, to run the pipeline on machines with less resources, it is recommended to split the fastq files or the tables in smaller chunks to prevent RAM overflow. This can be done by increase num_parts in the config file.
 If any errors occur during the pipeline run, Snakemake will provide detailed logs, allowing you to debug and troubleshoot any issues. You are most welcome to create an Issue when running into problems.
 
-License
+## Connected 16S Analysis
+
+We acknowledge that many users perform a regular 16S experiment additional to the epicPCR experiment. However, we have decided not to include an extra analysis  featue for this in ERMA. We would like to encourage the user to use RiboSnake (10.46471/gigabyte.132), a validated, automated, reproducible QIIME2-based pipeline implemented in Snakemake for analysing 16S rRNA gene amplicon sequencing data.
+
+## License
 
 This project is licensed under the MIT License.
diff --git a/config/config.yaml b/config/config.yaml
@@ -1,45 +1,41 @@
 runname: "test"
-
 # Setting up base directory and location of fastq files. Generally, no changes needed here.
 
 base_dir: "."
 fastq_dir: "data/fastq" # Copy target fastq.gz files in ERMA/data/fastq or change this path
 outdir: "results" # Output directory of the final report
 
+similarity_search_mode: "full" # Put here "test" or "full" for strand/s to be included in the similarity search
+
+num_parts: 1 # number of subfiles the fastqs are split into (prevents crashing with very large input)
+max_threads: 16
+
 min_similarity: "0.80" # threshold to pre-filter blast hits by percentage identity
 min_abundance: "0.01" # genera with lower abundance will be binned as "Other" in stacked bar abundance plot
 
-silva:
-  download_path_seq: "https://www.arb-silva.de/fileadmin/silva_databases/release_138_2/Exports/SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz"
-  #download_path_seq: "https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz"
-  #download_path_seq: "https://www.arb-silva.de/fileadmin/silva_databases/release_138/Exports/SILVA_138_SSURef_NR99_tax_silva.fasta.gz"
-  #download_path_seq: "https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/SILVA_128_SSURef_Nr99_tax_silva.fasta.gz"
-  download_path_tax: "https://www.arb-silva.de/fileadmin/silva_databases/release_138_2/Exports/taxonomy/taxmap_slv_ssu_ref_nr_138.2.txt.gz"
-  #download_path_tax: "https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/taxonomy/taxmap_slv_ssu_ref_nr_138.1.txt.gz"
-  #download_path_tax: "https://www.arb-silva.de/fileadmin/silva_databases/release_138/Exports/taxonomy/taxmap_slv_ssu_ref_nr_138.txt.gz"
-  #download_path_tax: "https://www.arb-silva.de/fileadmin/silva_databases/release_128/Exports/taxonomy/taxmap_slv_ssu_ref_nr_128.txt.gz"
-
-card:
-  #download_path: "https://card.mcmaster.ca/download/0/broadstreet-v3.3.0.tar.bz2"
-  download_path: "https://card.mcmaster.ca/download/0/broadstreet-v4.0.1.tar.bz2"
-
-add_uniref_targets:
-  using_mixed_db: "no"
-  uniprot_cluster: "100"
-  uniprot_targets: ["int1","inti1"]
-  max_entry_count: 500
-
-num_parts: 1 # number of chunks the fastqs are split into
-max_threads: 16
+input_validation: True
 
-similarity_search_mode: "full" # Put here "test" or "full" for strand/s to be included in the similarity search
+#################
+### Databases ###
+#################
 
-input_validation: "yes"
+download_silva: "https://www.arb-silva.de/fileadmin/silva_databases/release_138_2/Exports/SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz"
+download_card: "https://card.mcmaster.ca/download/0/broadstreet-v4.0.1.tar.bz2"
+# Use the default links, different versions of the same database or insert an absolute path to a local distribution ("/path/to/database")
 
+add_uniref_targets: False # True for activating the feature of adding additional targets to the card database
+uniprot_cluster: "100" # Chooses which Uniref Cluster is used for the targets
+uniprot_targets: ["int1","inti1","class_1_integron"] 
+max_entry_count: 1000 # Only use the best N database entries for every target
+low_freq_threshold: 0.01 # necessary to not overload two summary plots
+
+#####################
 ### Preprocessing ###
+#####################
+
 # if data is already in format 'one fastq.gz per sample', this section can be ignored
 
-seq_tech: "Illumina" # Put here "Illumina" or "ONT" befor using rule prepare_fastqs
+seq_tech: "Illumina" # Put here "Illumina" or "ONT" according to used technology (only important for using rule prepare_fastqs)
 
 # In case of Demultiplexing ONT data, provide information for this section
 ONT:

diff --git a/workflow/Snakefile b/workflow/Snakefile
@@ -22,7 +22,7 @@ samples = [
 ]
 
 # Validate sample input if not started from snakedeploy
-if config["input_validation"] == "yes":
+if config["input_validation"]:
     sys.path.append("workflow/scripts")
     from validate_inputs import validate_samples
     validate_samples(samples, os.path.join(config["base_dir"], config["fastq_dir"]))
@@ -46,7 +46,6 @@ rule all:
         local(f"{outdir}/{runname}_report.zip"),
         local("results/single_sample_similarity_search_data.tar.gz"),
 
-
 report: "../report/workflow.rst"