For Genspectrum we would like to be able to annotate all influenza A sequences with their segment and their subtype.
This is not trivial as NCBI no longer assigns new influenza sequences to a subtype taxon and NCBI annotations in file names and qualifiers are not standardized and sometimes have errors.
Here I take the approach suggested by @cornelius-roemer and use nextclade sort to perform fast local alignment of sequence k-mers to find the highest scoring sequence matches. To use nextclade sort I need to create a minimizer-index, this must contain references for all the sequences that I would like to annotate. In this case this means I need at least a reference for each segment, but also for all the subtypes. In influenza A subtypes are determined by the HA and NA segments, there are a total of 18 HA subtypes and 11 NA subtypes. To annotate the HA and NA subtypes I use reference sequences used previously in the literature.
For HA I relied solely on the excellently annotated: Abdulrahman DA, Meng X, Veit M. S-Acylation of Proteins of Coronavirus and Influenza Virus: Conservation of Acylation Sites in Animal Viruses and DHHC Acyltransferases in Their Animal Reservoirs. Pathogens. 2021 May 29;10(6):669. link
For NA I used a combination of Wohlbold TJ, Krammer F. In the shadow of hemagglutinin: a growing interest in influenza viral neuraminidase and its role as a vaccine antigen. Viruses. 2014 Jun 23;6(6):2465-94 link and Jang YH, Seong BL. The Quest for a Truly Universal Influenza Vaccine. Front Cell Infect Microbiol. 2019 Oct 10;9:344. link. Wohlbold et al. supply a long list of NA reference segments and I use the visualization aid in Jang et al.'s paper to condense the set. Specifically I use only two of the listed N2 references, choosing two N2 references from each subcluster found by Jang et al., and I chose the H5N1-N1 reference that was the center of the N1 cluster for the N1 reference.
I was initially uncertain which references to use for the other segments, so I first used the full reference assembly for the subtypes H5N1, H1N1, H2N2, H3N2, H7N9 and H9N2 to see the diversity of the other segments. Here I also see some visual clades, normally there are about three larger clades corresponding to the reference sequences from H5N1, H1N1 and H3N2 - which I chose to use in my minimizer index to improve hits (these trees can be viewed in the pre-work folder).
I then created auspice trees using a subsample of the annotated segments for a visual verification. I additionally annotated trees with the NCBI assigned subtype (ncbiSubTypeHA
and ncbiSubTypeNA
) if it was contained in the sequence description in a parsable format. From visual inspection my annotation appears to be significantly better than the available NCBI annotations for the NA segment and HA segment.
You can rerun the analysis using:
micromamba create -f environment.yml
micromamba activate influenza-db
snakemake all_trees
Alternatively you can run auspice view
in the auspice directory or drop the individual auspice.json files into https://auspice.us/ to visualize the results.