Skip to content
Daniel Stoeckel edited this page Feb 20, 2015 · 1 revision

Redesign the FragmentDB

The Problem

The current implementation has various drawbacks:

  • It does not allow for advanced matching of structures. This effectively causes bug #96.
  • The NormalizeNamesProcessor is very slow due to the time it takes to prepare queries. See bug #155
  • Its xml is not well defined and thus prone to errors.
  • AFAIK, it does not support lazy loading.
  • Renaming rules are too simple:BR Rules cannot be cascaded, which leads to a bloat of rules, because every rule needs to be copied for every variant out there
  • Variants cannot be combined.BRAn interesting example is Methylcytosine. Incorporating this base would require to copy every already existing Varaint and adopting it to Methylcytosine. Additionally new renaming rules would be needed.
  • It is mostly undocumented
  • It is hard to maintain.

Currently this puts my affords to properly incorporate DNA almost completely to an end. Thus I consider this issue to be a major one which should be fixed ASAP.

How to solve this?

Pretty much a rewrite of the Fragment DB is needed in order to solve these issues. At least a new XML file format that allows for the generation of a trigger mechanism should be implemented. These trigger should allow to check conditions like:

  • Is this nucleic acid at the end of a chain?
  • Is this an DNA or an RNA molecule? etc.

For reducing the loading time of the FragmentDB a lazy loading mechanism as well as a threading scheme should be considered. However one might want to consider a binary file format or a database scheme that allows for faster access to the data. This binary file format or database scheme could be modeled closely after the XML format and in fact the data could be compiled into these optimized storage formats. Multiple storage backends for the FragmentDB would then allow to uses these file formats in an exchangable way.

Brainstorming

CML is an XML storage format for Chemical Data. It incorporates a good deal of what the fragmentDB would need (multiple names for structures, IChI representations, arbitrary property storage). However, it does not allow for the Ontology-Like approach that seems to be desired for the fragmentDB (e.g. hierarchies and variants of nucleic acids etc...), and it is very verbose with its type-oriented specifications, e.g. {{{ C 8.9703 -9.0918 }}} might be more readable if written as: {{{ C 8.9703 -9.0918 }}}

Automatic Detection of Naming conventions can only be improved if we find a way to avoid doing DFS on the Structures to find a majority vote on the naming convention (Monte Carlo?). At the very least a bit of threading and feedback could be done (by maybe avoiding recursive DFS and using a queue...) to alleviate the percieved slowness.

So far, variant matching is strictly Name-Based (name_to_variants_ map). If the naming cannot be fixed by normalizing names, there is no way a proper matching can occur. Suggestions for Molecular Matching include comparing Atom Pairs (via hashes (2)) or doing String-Based analysis on the flattened structure of a molecule (a good canonical flattening might be provided by InChI (3)), an overview can be found in (1).

Fully-Fledged Graph-Based similarity might be an option as well. (5)

Literature

(1) Willett, Chemical Similarity Searching http://pubs.acs.org/doi/abs/10.1021/ci9800211

(2) Daylight Theory Manual: Fingerprinting http://www.daylight.com/dayhtml/doc/theory/theory.finger.html

(3) InChI http://www.iupac.org/inchi/

(4) Hagadone, Molecular substructure similarity searching http://pubs.acs.org/doi/abs/10.1021/ci00009a019

(5) Yan, Yu, Han, Substructure similarity search in graph databases http://portal.acm.org/citation.cfm?id=1066244&dl=

Clone this wiki locally