-
Notifications
You must be signed in to change notification settings - Fork 3
Input and Output files
The most common input to mumemto
are genome assemblies, typically comprising a pangenome. These should be in FASTA format, and are passed as positional arguments to mumemto:
mumemto /path/to/fastas/*.fa
Alternatively, a file containing a list of paths to input FASTAs (one per line) can be supplied with -i
mumemto -i filelist.txt
We recommend each FASTA file contain a single sequence. If multiple sequences are present, they will be concatenated together. If there are multiple chromosomes in each assembly, we recommend splitting each chromosome into a seperate FASTA and running mumemto on each chromosome separately.
Note
We highly recommend removing Ns from the input sequences. They could potentially appear as multi-MUMs in certain cases. This would likely not affect results, however it may appear in visualizations as an unintended synteny block.
The main output of mumemto
is the *.mums
(or *.mems
) file. A *.lengths
is also produced, defining the order of sequences in the outputs, and also including the length of each input sequence.
If the maximum number of occurences per sequence (-f
) is set to 1 (indicating MUMs), a *.mums
file is generated [default].
[MUM length] [comma-delimited list of offsets in each sequence, in order of filelist] [comma-delimited strand indicators (one of +/-)]
Each line in the *.mums
file represents a multi-MUM. It appears exactly once in each sequence (or not at all for partial MUMs,-k
set). The offsets and strand information are listed in order of the sequences in the *.lengths
file. If a MUM is not present in a sequence, the field is left blank. NOTE: multi-MUMs are sorted in the output file lexicographically based on the match sequence.
If more than one occurence of match is allowed per sequence (-f
> 1), then a *.mems
file is generated. It has a similar format:
[MEM length] [comma-delimited list of offsets for each occurence] [comma-delimited list of sequence IDs, as defined in the filelist] [comma-delimited strand indicators (one of +/-)]
The order of offsets is no longer defined, but an extra list field indicates the input sequence ID of origin for each offset (again, index order defined by the *.lengths
file). Similarly, the multi-MEMs are ordered lexicographically.
If there are any questions or suggestions, please submit a github issue or contact me at vshivak1 [at] jhu.edu.