Skip to content

Pan-genome inference and genotyping with long noisy or short accurate reads


Notifications You must be signed in to change notification settings



Repository files navigation

Build Status master

Build Status dev




Pandora is a tool for bacterial genome analysis using a pangenome reference graph (PanRG). It allows gene presence/absence detection and genotyping of SNPs, indels and longer variants in one or a number of samples. Pandora works with Illumina or Nanopore data. Core ideas behind the method are:

  • new genomes look like recombinants (plus mutations) of things seen before
  • we should be analysing nucleotide-level variation everywhere, not just in core genes
  • arbitrary single reference genomes are unnatural and limit comparisons of diverse sets of genomes

The pangenome reference graph (PanRG) is a collection of 'floating' local graphs, each representing some orthologous region of interest (e.g. genes, mobile elements or intergenic regions). See for a pipeline which can construct these PRGs from a set of aligned sequence files.

Pandora can do the following for a single sample (read dataset):

  • Output inferred mosaic of reference sequences for loci (eg genes) from the PanRG which are present
  • Output a VCF showing the variation found within these loci, with respect to any reference path in the PRG.

Soon, in a galaxy not so far away, it will allow:

  • discovery of new variation not in the PRG

For a collection of samples, it can:

  • Output a matrix showing inferred copy-number of each locus in each sample genome
  • Output a multisample pangenome VCF showing how including genotype calls for each sample in each of the loci
  • Output one VCF per orthologous-chunk, showing how samples which contained this chunk differed in their gene sequence. Variation is shown with respect to the most informative recombinant path in the PRG.

Warning - this code is still in development.

Quick Start

Index PanRG file:

pandora index -t 8 <panrg.fa>

Compare first 30X of each Illumina sample to get pangenome matrix and VCF

pandora compare -p <panrg.fa> -r <> --genotype --illumina --max_covg 30

Map Nanopore reads from a single sample to get approximate sequence for genes present

pandora map -p <panrg.fa> -r <reads.fq>


Singularity Container

We highly recommend that you download and use the singularity container:

singularity pull shub://rmcolq/pandora:pandora

or download direcly as you run:

singularity exec shub://rmcolq/pandora:pandora pandora <command>

Installation from source

This is not recommended because the required zlib and boost system installs do not always play nicely. If you want to take the risk:

  • Requires a Unix or Mac OS.
  • Requires a system install of zlib. If this is not already installed, this tutorial is helpful or try the following.
wget -O - | tar xzf -
cd zlib-1.2.11
./configure [--prefix=/prefix/path]
make install
  • Requires a system installation of boost containing the system, filesystem, log (which also depends on thread and date_time) and iostreams libraries. If not already installed use the following or look at this guide.
wget -O - | tar xzf -
cd boost_1_62_0
./ [--prefix=/prefix/path] --with-libraries=system,filesystem,iostreams,log,thread,date_time
./b2 install
  • Download and install pandora as follows:
git clone --single-branch --recursive
cd pandora
mkdir -p build
cd build
cmake ..
ctest -VV


Population Reference Graphs

Pandora assumes you have already constructed a fasta-like file of graphs, one entry for each gene/ genome region of interest. If you haven't, you will need a multiple sequence alignment for each graph. Precompiled collections of MSA representing othologous gene clusters for a number of species can be downloaded from here and converted to graphs using the pipeline from here.

Build index

Takes a fasta-like file of PanRG sequences and constructs an index, and a directory of gfa files to be used by pandora map or pandora compare. These are output in the same directory as the PanRG file.

  Usage: pandora index [options] <PanRG>
    -h,--help                       Show this help message
    -w W                            Window size for (w,k)-minimizers, default 14
    -k K                            K-mer size for (w,k)-minimizers, default 15
    -t T                            Number of concurrent threads, default 1

The index stores (w,k)-minimizers for each PanRG path found. These parameters can be specified, but default to w=14, k=15.

Map reads to index

This takes a fasta/q of Nanopore or Illumina reads and compares to the index. It infers which of the PanRG genes/elements is present, and for those that are present it outputs the inferred sequence and a genotyped VCF.

  Usage: pandora map -p PanRG_FILE -r READ_FILE -o OUTDIR <option(s)>
   -h,--help                        Show this help message
   -p,--prg_file PanRG_FILE         Specify a fasta-style PanRG file
   -r,--read_file READ_FILE         Specify a file of reads in fasta/q format
   -o,--outdir OUTDIR               Specify directory of output
   -w W                             Window size for (w,k)-minimizers, must be <=k, default 14
   -k K                             K-mer size for (w,k)-minimizers, default 15
   -m,--max_diff INT                Maximum distance between consecutive hits within a cluster, default 250 bps
   -e,--error_rate FLOAT            Estimated error rate for reads, default 0.11/0.001 for Nanopore/Illumina
   -c,--min_cluster_size INT        Minimum number of hits in a cluster to consider a locus present, default 10
   --genome_size NUM_BP             Estimated length of genome, used for coverage estimation, default 5000000
   --vcf_refs REF_FASTA             A fasta file with an entry for each loci in the PanRG in order, giving 
                                    reference sequence to be used as VCF ref. Must have a perfect match to a 
                                    path in the graph and the same name as the locus in the graph.
   --illumina                       Data is from Illumina, not Nanopore, so is shorter with low error rate
   --bin                            Use binomial model for kmer coverages, default is negative binomial
   --max_covg INT                   Maximum average coverage from reads to accept, default first 300
   --genotype                       Output a genotyped VCF
   --discover                       Add denovo discovery
   --denovo_kmer_size INT           Kmer size to use for denovo discovery, default 11
   --log_level LEVEL                Verbosity for logging, use "debug" for more output

Compare reads from several samples

This takes Nanopore or Illumina read fasta/q for a number of samples, mapping each to the index. It infers which of the PanRG genes/elements is present in each sample, and outputs a presence/absence pangenome matrix, the inferred sequences for each sample and a genotyped multisample pangenome VCF.

  Usage: pandora compare -p PanRG_FILE -r READ_INDEX -o OUTDIR <option(s)>
   -h,--help                        Show this help message
   -p,--prg_file PanRG_FILE         Specify a fasta-style PanRG file
   -r,--read_index READ_INDEX       Specify a tab delimited file with a line per sample, detailing sample id 
                                    and read fasta/q
   -o,--outdir OUTDIR               Specify directory of output
   -w W                             Window size for (w,k)-minimizers, must be <=k, default 14
   -k K                             K-mer size for (w,k)-minimizers, default 15
   -m,--max_diff INT                Maximum distance between consecutive hits within a cluster, default 250 bps
   -e,--error_rate FLOAT            Estimated error rate for reads, default 0.11/0.001 for Nanopore/Illumina
   -c,--min_cluster_size INT        Minimum number of hits in a cluster to consider a locus present, default 10
   --genome_size NUM_BP             Estimated length of genome, used for coverage estimation, default 5000000
   --vcf_refs REF_FASTA             A fasta file with an entry for each loci in the PanRG in order, giving 
                                    reference sequence to be used as VCF ref. Must have a perfect match to a 
                                    path in the graph and the same name as the locus in the graph.
   --illumina                       Data is from Illumina, not Nanopore, so is shorter with low error rate
   --bin                            Use binomial model for kmer coverages, default is negative binomial
   --max_covg INT                   Maximum average coverage from reads to accept, default first 300
   --genotype                       Output a genotyped VCF
   --log_level LEVEL                Verbosity for logging, use "debug" for more output


Pan-genome inference and genotyping with long noisy or short accurate reads







No packages published


  • C++ 98.4%
  • Other 1.6%