Skip to content

Input and Output files

Vikram Shivakumar edited this page Jan 21, 2025 · 3 revisions

Input files

The most common input to mumemto are genome assemblies, typically comprising a pangenome. These should be in FASTA format, and are passed as positional arguments to mumemto:

mumemto /path/to/fastas/*.fa

Alternatively, a file containing a list of paths to input FASTAs (one per line) can be supplied with -i

mumemto -i filelist.txt

We recommend each FASTA file contain a single sequence. If multiple sequences are present, they will be concatenated together. If there are multiple chromosomes in each assembly, we recommend splitting each chromosome into a seperate FASTA and running mumemto on each chromosome separately.

Note

We highly recommend removing Ns from the input sequences. They could potentially appear as multi-MUMs in certain cases. This would likely not affect results, however it may appear in visualizations as an unintended synteny block.

Output files

The main output of mumemto is the *.mums (or *.mems) file. A *.lengths is also produced, defining the order of sequences in the outputs, and also including the length of each input sequence.

*.mums file

If the maximum number of occurences per sequence (-f) is set to 1 (indicating MUMs), a *.mums file is generated [default].

[MUM length] [comma-delimited list of offsets in each sequence, in order of filelist] [comma-delimited strand indicators (one of +/-)]

Each line in the *.mums file represents a multi-MUM. It appears exactly once in each sequence (or not at all for partial MUMs,-k set). The offsets and strand information are listed in order of the sequences in the *.lengths file. If a MUM is not present in a sequence, the field is left blank. NOTE: multi-MUMs are sorted in the output file lexicographically based on the match sequence.

*.mems file

If more than one occurence of match is allowed per sequence (-f > 1), then a *.mems file is generated. It has a similar format:

[MEM length] [comma-delimited list of offsets for each occurence] [comma-delimited list of sequence IDs, as defined in the filelist] [comma-delimited strand indicators (one of +/-)]

The order of offsets is no longer defined, but an extra list field indicates the input sequence ID of origin for each offset (again, index order defined by the *.lengths file). Similarly, the multi-MEMs are ordered lexicographically.