Skip to content

Molecular JSON Draft Spec

Aaron Virshup edited this page Aug 31, 2016 · 15 revisions

THIS IS AN INCOMPLETE WORK IN PROGRESS!

Aims

  1. Facilitate interchange between computational chemistry and computational materials programs
  2. Unambiguous, easy-to-parse data storage for geometry, dynamics, topology, and calculated properties
  3. Easy extensibility for specific workflows
  4. Human readable and editable data

Technical Specifications

  • Format:JSON (with 1-to-1 conversion to/from HDF5)
  • Encoding: UTF-8 (part of JSON spec; could limit to ASCII subset for maximum compatibility)
  • Large file limits: ???

Scope

The file format is designed specifically to facilitate interchange between molecular software packages. 

It is designed specifically to support common input and output for these applications:

  • Molecular dynamics: both simulation (OpenMM, DESMOND, etc.) and analysis (MDTraj, PyTraj, etc.)
  • Quantum chemistry (PySCF, Psi4, NWChem, etc.)
  • Docking (UCSF-, Auto-, GLIDE, etc.)
  • Informatics (OpenBabel, RDKit, OEChem, etc.)
  • Visualization (VMD, Chimera, etc.)

Object layout

The basic object is a "Molecule". Note that this document only specifies the content of a molecular JSON object; a JSON file could contain a molecule at any point in its heierarchy.

A molecule has these fields:

  • name (string): name of the molecule (no particular meaning)
  • type (string): "Molecule"
  • provenance (Provenance object): where this molecule came from
  • topology (Topology object): specifies atomic data, bonds and biomolecular (or materials) hierarchy
  • states (list of State objects): dynamical states with position, momentum, and calculated properties at each point - forcefield (optional; Forcefield object): forcefield specification

Open questions and possible answers

Items here don't have a definite answer yet. Answers are currently ranked by AMV's preference.

How do we represent time-dependent topology? (grand canonical, ReaxFF, etc.)

Possible answers:

  1. multiple topology objects (explicit but expensive storage)
  2. states can store topology "patches" (saves memory, but confusing and hard to implement)
  3. Single global topology with all possible states (i.e., all possible bonds, all possible residues), states can include flags to turn elements on/off (NP-complete in some cases)

How do we reference other objects?

JSON does not directly support object references. This makes it non-trivial to, say, maintain a list of bonds between atoms. Some solutions are:

  1. by array index (e.g., residue.atom_indices=[3,4,5,6])
  2. by JSON path reference (see, e.g., https://tools.ietf.org/html/draft-pbryan-zyp-json-ref-03)
  3. by a unique key. (e.g., residue.id='a83nd83', residue.atoms=['a9n3d9', '31di3'])

Array index is probably the best option - although they are a little fragile, they're no more fragile than path references, and require far less overhead than unique keys.

See also: http://stackoverflow.com/q/4001474/1958900

How do we uniquely specify physical units?

Units should be clear and unambiguous.

How to specify compound units?

For instance, velocity might be "angstrom/fs" Alternatives:

  1. Require units in the form {unit_name:exponent}, e.g. atom.velocity.units={'angstrom':1, 'fs':-1}
  2. Allow strings of the form atom.velocity.units="angstrom/fs", but require that units be chosen from a specific list of specifications
  3. Allow strings of the form atom.velocity.units="angstrom/fs", and require file parsers to parse the units according to a specified syntax

Extensibility

Users should feel free to add metadata to these structures. However, a few notes of caution:

 1) Calculated quantities should go in "topology.properties" or "state.properties"; UNLESS they are so unambiguous as to be trivially calculable (such as atomic numbers).  2) Method-dependent metadata should be prepended with a unique string to avoid namespace clashes. For instance, a state coefficient for a surface hopping method should be expressed as:  

Clone this wiki locally