New --multi_cram option to produce a multi-query CRAM file combining all the alignments#114
Merged
Conversation
The merged CRAM file is neither a pangenome nor a multiple sequence alignment, but I find it very useful. Temporarly CRAM files are produced but not exported. Their header indicates only the name of the query genomes in the read group fields. The files are merged in a single CRAM file, where each read group represents one genome. Each target-query alignment is a one-to-one relationship so a base in the target is aligned at most once to each query. Care is taken to ensure that the path to the reference genome is relative to the current directory. The multi-query CRAM file is output in the same directory as its index and the BGZIpped genome, indexed too. Thus the multi-query CRAM file can be loaded and visualised in the IGV. The coverage plot shows how many query genomes align to the target at a given location. Expanded track view allows to visualise all the sequence differences. You can stabilise the order of the genomes, but IGV enforces alphanumeric sorting. You can work around this limitation by prefixing the sample IDs with numbers in the sample sheet. Custom scripts can (and have) be written to slice a pieces of the multi-query CRAM file and turn these pieces into real MSAs…
Will change to CRAM 3.1 in pairgenomealign 3.0.0.
Joon-Klaps
approved these changes
Jun 1, 2026
Joon-Klaps
left a comment
Contributor
There was a problem hiding this comment.
Very minor things mostly on readabality.
Co-authored-by: Joon Klaps <joon.klaps@kuleuven.be>
Co-authored-by: Joon Klaps <joon.klaps@kuleuven.be>
Co-authored-by: Joon Klaps <joon.klaps@kuleuven.be>
Co-authored-by: Joon Klaps <joon.klaps@kuleuven.be>
Co-authored-by: Joon Klaps <joon.klaps@kuleuven.be>
…which I submitted recently based on the local version.
a5267b2 to
15df6b6
Compare
Collaborator
Author
|
@Joon-Klaps Thanks to your comments I made big changes, can I ask you to have a look? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Main changes to the code:
samtools/mergemodule.Closes #60.
Details on the new feature:
The merged CRAM file is neither a pangenome nor a multiple sequence alignment, but I find it very useful.
Temporary CRAM files are produced but not exported. Their header indicates only the name of the query genomes in the read group fields.
The files are merged in a single CRAM file, where each read group represents one genome. Each target-query alignment is a one-to-one relationship so a base in the target is aligned at most once to each query.
Care is taken to ensure that the path to the reference genome is relative to the current directory. The multi-query CRAM file is output in the same directory as its index and the BGZIpped genome, indexed too.
Thus the multi-query CRAM file can be loaded and visualised in the IGV. The coverage plot shows how many query genomes align to the target at a given location. Expanded track view allows to visualise all the sequence differences.
You can stabilise the order of the genomes, but IGV enforces alphanumeric sorting. You can work around this limitation by prefixing the sample IDs with numbers in the sample sheet.
Custom scripts can (and have) be written to slice a pieces of the multi-query CRAM file and turn these pieces into real MSAs…
PR checklist
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).