update docs and citation

AroneyS · Dec 2, 2024 · 3338652 · 3338652
1 parent cdf57b8
commit 3338652
Show file tree

Hide file tree

Showing 6 changed files with 76 additions and 27 deletions.
diff --git a/CITATION.cff b/CITATION.cff
@@ -16,7 +16,7 @@ authors:
   - family-names: Woodcroft
     given-names: Ben J.
     orcid: https://orcid.org/0000-0003-0670-7480
-title: "Bin Chicken: targeted recovery of low abundance metagenome assembled genomes through intelligent coassembly"
-version: 0.12.5
-doi: 10.5281/zenodo.10511708
-date-released: 2024-09-06
+title: "Bin Chicken: targeted metagenomic coassembly for the efficient recovery of novel genomes"
+version: 0.12.6
+doi: 10.1101/2024.11.24.625082
+date-released: 2024-12-02
diff --git a/README.md b/README.md
@@ -14,3 +14,9 @@ Bin Chicken - recovery of low abundance and taxonomically targeted metagenome as
 Documentation can be found at https://AroneyS.github.io/binchicken/
 
 Logo by Georgina H. Joyce | www.georginajoyce.com
+
+## Citation
+
+Samuel T. N. Aroney, Rhys J. P. Newell, Gene W. Tyson and Ben J. Woodcroft.
+Bin Chicken: targeted metagenomic coassembly for the efficient recovery of novel genomes.
+bioRxiv (2024): 2024-11. https://doi.org/10.1101/2024.11.24.625082
diff --git a/docs/tools/coassemble.md b/docs/tools/coassemble.md
@@ -41,10 +41,10 @@ Important options:
 - The taxa of the considered sequences can be filtered to target a specific taxon (e.g. `--taxa-of-interest "p__Planctomycetota"`).
 - Differential-abundance binning samples for single-assembly can also be found (`--single-assembly`)
 
-Paired end reads of form reads_1.1.fq, reads_1_1.fq and reads_1_R1.fq, where reads_1 is the sample name are automatically detected and matched to their basename.
+Paired end reads of form \*.1.fq, \*_1.fq and \*_R1.fq, where \* represents the sample name are automatically detected and matched to their basename.
 Most intermediate files can be provided to skip intermediate steps (e.g. SingleM otu tables, read sizes or genome transcripts; see `binchicken coassemble --full-help`).
 
-## Abundance weighting
+## Abundance weighting (experimental)
 
 By default, coassemblies are ranked by the number of feasibly-recovered target sequences they contain.
 Instead, `--abundance-weighted` can be used to weight target sequences by their average abundance across samples.
@@ -58,13 +58,6 @@ Kmer preclustering can be used (default if >1000 samples are provided, or use `-
 This greatly reduces memory usage and allows scaling up to at least 250k samples.
 Kmer preclustering can be disabled with `--kmer-precluster never`.
 
-## Cluster submission
-
-Snakemake profiles can be used to automatically submit jobs to HPC clusters (`--snakemake-profile`).
-Note that Aviary assemble commands are submitted to the cluster, while Aviary recover commands are run locally such that Aviary handles cluster submission.
-The `--cluster-submission` flag sets the local Aviary recover thread usage to 1, to enable multiple runs in parallel by setting `--local-cores` to greater than 1.
-This is required to prevent `--local-cores` from limiting the number of threads per submitted job.
-
 # OPTIONS
 
 # BASE INPUT ARGUMENTS
@@ -206,14 +199,15 @@ This is required to prevent `--local-cores` from limiting the number of threads
 **\--taxa-of-interest** *TAXA_OF_INTEREST*
 
   Only consider sequences from this GTDB taxa (e.g.
-    p\_\_Planctomycetota) [default: all]
+    p\_\_Planctomycetota, or
 
 <!-- -->
 
 **\--appraise-sequence-identity** *APPRAISE_SEQUENCE_IDENTITY*
 
   Minimum sequence identity for SingleM appraise against reference
-    database [default: 86%, Genus-level]
+    database. e.g. 96% for Species-level or 86% Genus-level [default:
+    0.96]
 
 <!-- -->
 
@@ -300,6 +294,13 @@ This is required to prevent `--local-cores` from limiting the number of threads
 
 <!-- -->
 
+**\--precluster-distances** *PRECLUSTER_DISTANCES*
+
+  Distance file in the format of \`sourmash scripts pairwise\`. If
+    provided, kmer sketching and clustering is skipped.
+
+<!-- -->
+
 **\--precluster-size** *PRECLUSTER_SIZE*
 
   \# of samples within each sample\'s precluster [default: 5 \*
@@ -353,6 +354,14 @@ This is required to prevent `--local-cores` from limiting the number of threads
 
 <!-- -->
 
+**\--prior-assemblies** *PRIOR_ASSEMBLIES*
+
+  Prior assemblies to use for Aviary recovery. tsv file with header:
+    name [tab] assembly. Only possible with single-sample or update.
+    [default: generate assemblies through Aviary assemble]
+
+<!-- -->
+
 **\--cluster-submission**
 
   Flag that cluster submission will occur through

diff --git a/docs/tools/iterate.md b/docs/tools/iterate.md
@@ -16,6 +16,7 @@ binchicken iterate --coassemble-output coassemble_dir \
 ```
 
 Defaults to using genomes (from the provided coassemble outputs) with at least 70% complete and at most 10% contamination as estimated by CheckM2.
+Alternatively, selected genomes can be provided directly with `--new-genomes`.
 Automatically excludes previous coassemblies.
 
 # OPTIONS
@@ -237,14 +238,15 @@ Automatically excludes previous coassemblies.
 **\--taxa-of-interest** *TAXA_OF_INTEREST*
 
   Only consider sequences from this GTDB taxa (e.g.
-    p\_\_Planctomycetota) [default: all]
+    p\_\_Planctomycetota, or
 
 <!-- -->
 
 **\--appraise-sequence-identity** *APPRAISE_SEQUENCE_IDENTITY*
 
   Minimum sequence identity for SingleM appraise against reference
-    database [default: 86%, Genus-level]
+    database. e.g. 96% for Species-level or 86% Genus-level [default:
+    0.96]
 
 <!-- -->
 
@@ -331,6 +333,13 @@ Automatically excludes previous coassemblies.
 
 <!-- -->
 
+**\--precluster-distances** *PRECLUSTER_DISTANCES*
+
+  Distance file in the format of \`sourmash scripts pairwise\`. If
+    provided, kmer sketching and clustering is skipped.
+
+<!-- -->
+
 **\--precluster-size** *PRECLUSTER_SIZE*
 
   \# of samples within each sample\'s precluster [default: 5 \*
@@ -384,6 +393,14 @@ Automatically excludes previous coassemblies.
 
 <!-- -->
 
+**\--prior-assemblies** *PRIOR_ASSEMBLIES*
+
+  Prior assemblies to use for Aviary recovery. tsv file with header:
+    name [tab] assembly. Only possible with single-sample or update.
+    [default: generate assemblies through Aviary assemble]
+
+<!-- -->
+
 **\--cluster-submission**
 
   Flag that cluster submission will occur through

diff --git a/docs/tools/single.md b/docs/tools/single.md
@@ -27,7 +27,7 @@ Important options:
   - Run assemblies with differential-abundance-binning samples with the tool of your choice (see `coassemble/target/elusive_clusters.tsv` in output)
 - The taxa of the considered sequences can be filtered to target a specific taxon (e.g. `--taxa-of-interest "p__Planctomycetota"`).
 
-Paired end reads of form reads_1.1.fq, reads_1_1.fq and reads_1_R1.fq, where reads_1 is the sample name are automatically detected and matched to their basename.
+Paired end reads of form \*.1.fq, \*_1.fq and \*_R1.fq, where \* represents the sample name are automatically detected and matched to their basename.
 Most intermediate files can be provided to skip intermediate steps (e.g. SingleM otu tables, read sizes or genome transcripts; see `binchicken coassemble --full-help`).
 
 ## Kmer preclustering
@@ -37,13 +37,6 @@ Kmer preclustering can be used (default if >1000 samples are provided, or use `-
 This greatly reduces memory usage and allows scaling up to at least 250k samples.
 Kmer preclustering can be disabled with `--kmer-precluster never`.
 
-## Cluster submission
-
-Snakemake profiles can be used to automatically submit jobs to HPC clusters (`--snakemake-profile`).
-Note that Aviary assemble commands are submitted to the cluster, while Aviary recover commands are run locally such that Aviary handles cluster submission.
-The `--cluster-submission` flag sets the local Aviary recover thread usage to 1, to enable multiple runs in parallel by setting `--local-cores` to greater than 1.
-This is required to prevent `--local-cores` from limiting the number of threads per submitted job.
-
 # OPTIONS
 
 # BASE INPUT ARGUMENTS
@@ -185,14 +178,15 @@ This is required to prevent `--local-cores` from limiting the number of threads
 **\--taxa-of-interest** *TAXA_OF_INTEREST*
 
   Only consider sequences from this GTDB taxa (e.g.
-    p\_\_Planctomycetota) [default: all]
+    p\_\_Planctomycetota, or
 
 <!-- -->
 
 **\--appraise-sequence-identity** *APPRAISE_SEQUENCE_IDENTITY*
 
   Minimum sequence identity for SingleM appraise against reference
-    database [default: 86%, Genus-level]
+    database. e.g. 96% for Species-level or 86% Genus-level [default:
+    0.96]
 
 <!-- -->
 
@@ -279,6 +273,13 @@ This is required to prevent `--local-cores` from limiting the number of threads
 
 <!-- -->
 
+**\--precluster-distances** *PRECLUSTER_DISTANCES*
+
+  Distance file in the format of \`sourmash scripts pairwise\`. If
+    provided, kmer sketching and clustering is skipped.
+
+<!-- -->
+
 **\--precluster-size** *PRECLUSTER_SIZE*
 
   \# of samples within each sample\'s precluster [default: 5 \*
@@ -332,6 +333,14 @@ This is required to prevent `--local-cores` from limiting the number of threads
 
 <!-- -->
 
+**\--prior-assemblies** *PRIOR_ASSEMBLIES*
+
+  Prior assemblies to use for Aviary recovery. tsv file with header:
+    name [tab] assembly. Only possible with single-sample or update.
+    [default: generate assemblies through Aviary assemble]
+
+<!-- -->
+
 **\--cluster-submission**
 
   Flag that cluster submission will occur through

diff --git a/docs/tools/update.md b/docs/tools/update.md
@@ -199,6 +199,14 @@ binchicken update --coassemble-output coassemble_dir --sra \
 
 <!-- -->
 
+**\--prior-assemblies** *PRIOR_ASSEMBLIES*
+
+  Prior assemblies to use for Aviary recovery. tsv file with header:
+    name [tab] assembly. Only possible with single-sample or update.
+    [default: generate assemblies through Aviary assemble]
+
+<!-- -->
+
 **\--cluster-submission**
 
   Flag that cluster submission will occur through