Skip to content

New --multi_cram option to produce a multi-query CRAM file combining all the alignments#114

Merged
charles-plessy merged 21 commits into
devfrom
multi-cram-issue-60
Jun 2, 2026
Merged

New --multi_cram option to produce a multi-query CRAM file combining all the alignments#114
charles-plessy merged 21 commits into
devfrom
multi-cram-issue-60

Conversation

@charles-plessy

@charles-plessy charles-plessy commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Main changes to the code:

  • Addition, configuration and patching of the samtools/merge module.
  • Streamlining of the output of a local subworkflow.
  • Implementation of the option in the main workflow of the pipeline.

Closes #60.

Details on the new feature:

The merged CRAM file is neither a pangenome nor a multiple sequence alignment, but I find it very useful.

Temporary CRAM files are produced but not exported. Their header indicates only the name of the query genomes in the read group fields.

The files are merged in a single CRAM file, where each read group represents one genome. Each target-query alignment is a one-to-one relationship so a base in the target is aligned at most once to each query.

Care is taken to ensure that the path to the reference genome is relative to the current directory. The multi-query CRAM file is output in the same directory as its index and the BGZIpped genome, indexed too.

Thus the multi-query CRAM file can be loaded and visualised in the IGV. The coverage plot shows how many query genomes align to the target at a given location. Expanded track view allows to visualise all the sequence differences.

image

You can stabilise the order of the genomes, but IGV enforces alphanumeric sorting. You can work around this limitation by prefixing the sample IDs with numbers in the sample sheet.

Custom scripts can (and have) be written to slice a pieces of the multi-query CRAM file and turn these pieces into real MSAs…

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/pairgenomealign branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

The merged CRAM file is neither a pangenome nor a multiple sequence
alignment, but I find it very useful.

Temporarly CRAM files are produced but not exported.  Their header
indicates only the name of the query genomes in the read group fields.

The files are merged in a single CRAM file, where each read group
represents one genome.  Each target-query alignment is a one-to-one
relationship so a base in the target is aligned at most once to each
query.

Care is taken to ensure that the path to the reference genome is
relative to the current directory.  The multi-query CRAM file is output
in the same directory as its index and the BGZIpped genome, indexed too.

Thus the multi-query CRAM file can be loaded and visualised in the IGV.
The coverage plot shows how many query genomes align to the target at a
given location.  Expanded track view allows to visualise all the
sequence differences.  You can stabilise the order of the genomes, but
IGV enforces alphanumeric sorting.  You can work around this limitation
by prefixing the sample IDs with numbers in the sample sheet.

Custom scripts can (and have) be written to slice a pieces of the
multi-query CRAM file and turn these pieces into real MSAs…
Will change to CRAM 3.1 in pairgenomealign 3.0.0.

@Joon-Klaps Joon-Klaps left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor things mostly on readabality.

Comment thread subworkflows/local/fasta_bgzip_index_dict_samtools/main.nf Outdated
Comment thread workflows/pairgenomealign.nf
Comment thread workflows/pairgenomealign.nf
Comment thread workflows/pairgenomealign.nf Outdated
Comment thread workflows/pairgenomealign.nf Outdated
Comment thread workflows/pairgenomealign.nf Outdated
Comment thread workflows/pairgenomealign.nf
charles-plessy and others added 7 commits June 2, 2026 09:51
Co-authored-by: Joon Klaps <joon.klaps@kuleuven.be>
Co-authored-by: Joon Klaps <joon.klaps@kuleuven.be>
Co-authored-by: Joon Klaps <joon.klaps@kuleuven.be>
Co-authored-by: Joon Klaps <joon.klaps@kuleuven.be>
Co-authored-by: Joon Klaps <joon.klaps@kuleuven.be>
…which I submitted recently based on the local version.
@charles-plessy

Copy link
Copy Markdown
Collaborator Author

@Joon-Klaps Thanks to your comments I made big changes, can I ask you to have a look?

@Joon-Klaps Joon-Klaps left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@charles-plessy charles-plessy merged commit 8fac8f7 into dev Jun 2, 2026
9 checks passed
@charles-plessy charles-plessy deleted the multi-cram-issue-60 branch June 2, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants