Skip to content

ohnosequences/db.taxonomy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

db.taxonomy

Taxonomic trees derived from the NCBI one.

Installation

Just add

resolvers += "Era7 maven releases" at "https://s3-eu-west-1.amazonaws.com/releases.era7.com"
libraryDependencies += "ohnosequences" %% "db.taxonomy" % "x.y.z"

to your sbt dependencies, where x.y.z is the version of the latest release.

Data

Let's give a couple of preliminary definitions:

A taxonomic node is said to be:

  • unclassified if its scientific name contains the word unclassified.
  • environmental if its scientific name has prefix environmental samples.
  • good if its neither unclassified nor environmental.

The covering tree of a set of nodes in a tree is defined as the subtree which includes all the ancestors and all the descendants for the nodes in that set.

Having said that, we take the full taxonomic tree made available through db.ncbitaxonomy and we generate three other taxonomic trees from there:

  • A good taxonomic tree: given by nodes whose ancestors (including itself and up to the root) are neither unclassified nor environmental.
  • An environmental taxonomic tree: the covering tree of all the environmental nodes.
  • An unclassified taxonomic tree: the covering tree of all the unclassified nodes.

How to access the data

Versions

All the data in db.taxonomy is versioned, based on a version of our NCBI taxonomy package db.ncbitaxonomy. A version includes inputVersion, which points to a NCBI taxonomy version, and consists of a bunch of serialized trees in S3 with a data.tree file and a shape.tree file. Those files exist in s3://resources.ohnosequences.com/db/taxonomy/ for each of the aforementioned trees, except for the full taxonomy, which is a pointer to the files in db.ncbitaxonomy:

  • full tree: s3://resources.ohnosequences.com/db/ncbitaxonomy/<inputVersion>
  • good tree: s3://resources.ohnosequences.com/db/taxonomy/unstable/<version>/good
  • environmental tree: s3://resources.ohnosequences.com/db/taxonomy/unstable/<version>/environmental
  • unclassified tree: s3://resources.ohnosequences.com/db/taxonomy/unstable/<version>/unclassified

Each of these versions is encoded as an object that extends the sealed class Version in data.scala.

The Set Version.all includes all the supported releases. Version.latest is a pointer to the latest released version.

Files

The module db.taxonomy.data contains the paths of the S3 object corresponding to the tree data and shape files for a Version and a TreeType. They can be accessed evaluating the following methods over a Version and TreeType objects:

treeData : (Version, TreeType) => S3Object
treeShape: (Version, TreeType) => S3Object

Tree

The taxonomic tree for a Version and TreeType, in case it can be retrieved / read from local files, is made available through the db.taxonomy.tree method.

Folder for downloaded data is given by data.localFolder.

Example

import ohnosequences.db.taxonomy
import taxonomy.{Version, TreeType}

val maybeTree = taxonomy.tree(Version.latest, TreeType.Good)

maybeTree.map { tree =>
    // do something with tree
    
}

License

  • The code which generates the database is licensed under the AGPLv3 license: license/code
  • The database itself is made available under the ODbLv1 license: license/db
  • The database contents are available under their respective licenses. As far as we can tell all data included in db.fragments16s could be considered free for any use; do note that sequences and annotations coming from SILVA, which has a restrictive license, are excluded from db.fragments16s.

See the open data commons FAQ for more on this distinction between database and contents.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages