Mumemto is a tool for analyzing pangenome sequence collections. It identifies maximal unique/exact matches (multi-MUMs and multi-MEMs) present across a collection of sequences. Mumemto can visualize pangenome synteny, identify misassemblies, and provide a unifiying structure to a pangenome.
This method is uses the prefix-free parse (PFP) algorithm for suffix array construction on large, repetitive collections of text. The main workflow of mumemto
is to compute the PFP over a collection of sequences, and identify multi-MUMs while computing the SA/LCP/BWT of the input collection. Note that this works best with highly repetitive texts (such as a collection of closely related genomes, likely intra-species such as a pangenome).
Mumemto is available on bioconda
, or can be installed with pip:
### conda ###
conda install -c bioconda mumemto
### pip ###
git clone https://github.com/vshiv18/mumemto
pip install .
Mumemto is available on docker
and singularity
. Note: this will only install the main mumemto tool, not the python scripts (which can be run separately from the mumemto/
directory).
### if using docker ###
docker pull vshiv123/mumemto:latest
docker run vshiv123/mumemto:latest mumemto -h
### if using singularity ###
singularity pull docker://vshiv123/mumemto:latest
./mumemto_latest.sif mumemto -h
To build from scratch, download the source code and use cmake/make. After running the make command below,
the mumemto
executable will be found in the build/
folder. The following are dependencies: cmake, g++, gcc
git clone https://github.com/vshiv18/mumemto
cd mumemto
mkdir build
cd build && cmake ..
make install
Note: downstream python scripts will not be in the appropriate $PYTHONPATH. For these scripts, run the relevant python script directly from the mumemto/
directory (you may need to install dependencies separately).
To visualize the synteny across the FASTA files in a directory assemblies/
(each sequence is a separate fasta file):
mumemto assemblies/*.fa -o pangenome
mumemto viz -i pangenome
By default, mumemto
computes multi-MUMs across a collection, without additional parameters.
mumemto -o <output_prefix> [input_fasta [...]]
Use the -h
flag to list additional options and usage: mumemto -h
.
Mumemto options enable the computation of various different classes of exact matches:
The multi-MUM properties can be loosened to find different types of matches with three main flags:
-k
determines the minimum number of sequences a match must occur in (e.g. for finding MUMs across smaller subsets)-f
controls the maximum number of occurences in each sequence (e.g. finding duplication regions)-F
controls the total number of occurences in the collection (e.g. filtering out matches that occur frequently due to low complexity)
-k
is flexible in input format. The user can specify a positive integer, indicating the minimum number of sequences a match should appear in. Passing a negative integer indicates a subset size relative to N, the number of sequences in the collection (i.e. N - k). For instance, to specify a match must appear in at least all sequences except one, we could pass -k -1
. Similarly, passing negative values to -F
specifies limits relative to N. Note: when setting -F
and -f
together, the max total limit will be the smaller of F
and N * f
.
Here are some example use cases:
# Find all strict multi-MUMs across a collection
mumemto [OPTIONS] [input_fasta [...]] (equivalently -k 0 -f 1 -F 0)
# Find partial multi-MUMs in all sequences but one
mumemto -k -1 [OPTIONS] [input_fasta [...]]
# Find multi-MEMs that appear at most 3 times in each sequence
mumemto -f 3 [OPTIONS] [input_fasta [...]]
# Find all MEMs that appear at most 100 times within a collection
mumemto -f 0 -k 2 -F 100 [OPTIONS] [input_fasta [...]]
The mumemto
command takes in a list of fasta files as positional arguments and then generates output files using the output prefix. Alternatively, you can provide a file-list, which specifies a list of fastas and which document/class each file belongs in. Passing in fastas as positional arguments will auto-generate a filelist that defines the order of the sequences in the output.
Example of file-list file:
/path/to/ecoli_1.fna 1
/path/to/salmonella_1.fna 2
/path/to/bacillus_1.fna 3
/path/to/staph_2.fna 4
Format of the output *.mums file:
[MUM length] [comma-delimited list of offsets within each sequence, in order of filelist] [comma-delimited strand indicators (one of +/-)]
If the maximum number of occurences per sequence is set to 1 (indiciating MUMs), a *.mums
file is generated. This contains each MUM as a separate line, where the first value is the match length, and the second is
a comma-delimited list of positions where the match begins in each sequence. An empty entry indicates that the MUM was not found in that sequence (only applicable with -k flag). The MUMs are sorted in the output file
lexicographically based on the match sequence.
Format of the output *.mems file:
[MEM length] [comma-delimited list of offsets for each occurence] [comma-delimited list of sequence IDs, as defined in the filelist] [comma-delimited strand indicators (one of +/-)]
If more than one occurence is allowed per sequence, the output format is in *.mems
format. This contains each MEM as a separate line with the following fields: (1) the match length, (2)
a comma-delimited list of offsets within a sequence, (3) the corresponding sequence ID for each offset given in (2). Similar to above, MEMs are sorted in the output file
lexicographically based on the match sequence.
Potato pangenome (assemblies from [Tang et al., 2022])
Mumemto can visualize multi-MUMs in a synteny-like format, highlighting conservation and genomic structural diversity within a collection of sequences.After running mumemto
on a collection of FASTAs, you can generate a visualization using:
mumemto viz (-i PREFIX | -m MUMFILE)
Use mumemto viz -h
to see options for customizability. As of now, only strict and partial multi-MUMs are supported (rare multi-MEM support coming soon), thus a *.mums
output is required.
An interactive plot (with plotly) can be generated with mumemto viz --interactive
.
If you run into any issues or have any questions, please feel free to reach out to us either (1) through GitHub Issues or (2) reach out to me at vshivak1 [at] jhu.edu
Portions of code from this repo were adapted from pfp-thresholds, written by Massimiliano Rossi and cliffy, written by Omar Ahmed.
Preprint coming soon!