-
Notifications
You must be signed in to change notification settings - Fork 37
Molecular JSON Draft Spec
THIS IS AN INCOMPLETE WORK IN PROGRESS!
- Facilitate interchange between computational chemistry and computational materials programs
- Unambiguous, easy-to-parse data storage for geometry, dynamics, topology, and calculated properties
- Easy extensibility for specific workflows
- Human readable and editable data
- Format:JSON (with 1-to-1 conversion to/from HDF5)
- Encoding: UTF-8 (part of JSON spec; could limit to ASCII subset for maximum compatibility)
- Large file limits: ???
The file format is designed specifically to facilitate interchange between molecular software packages.
It is designed specifically to support common input and output for these applications:
- Molecular dynamics: both simulation (OpenMM, DESMOND, etc.) and analysis (MDTraj, PyTraj, etc.)
- Quantum chemistry (PySCF, Psi4, NWChem, etc.)
- Docking (UCSF-, Auto-, GLIDE, etc.)
- Informatics (OpenBabel, RDKit, OEChem, etc.)
- Visualization (VMD, Chimera, etc.)
The basic object is a "Molecule". Note that this document only specifies the content of a molecular JSON object; a JSON file could contain a molecule at any point in its heierarchy.
A molecule has these fields:
-
name
(string): name of the molecule (no particular meaning) -
type
(string):"Molecule"
-
provenance
(Provenance object): where this molecule came from -
topology
(Topology object): specifies atomic data, bonds and biomolecular (or materials) hierarchy -
states
(list of State objects): dynamical states with position, momentum, and calculated properties at each point -forcefield
(optional; Forcefield object): forcefield specification
Items here don't have a definite answer yet. Answers are currently ranked by AMV's preference.
Possible answers:
- multiple topology objects (explicit but expensive storage)
- states can store topology "patches" (saves memory, but confusing and hard to implement)
- Single global topology with all possible states (i.e., all possible bonds, all possible residues), states can include flags to turn elements on/off (NP-complete in some cases)
JSON does not directly support object references. This makes it non-trivial to, say, maintain a list of bonds between atoms. Some solutions are:
- by array index (e.g.,
residue.atom_indices=[3,4,5,6]
) - by JSON path reference (see, e.g., https://tools.ietf.org/html/draft-pbryan-zyp-json-ref-03)
- by a unique key. (e.g.,
residue.id='a83nd83'
,residue.atoms=['a9n3d9', '31di3']
)
Array index is probably the best option - although they are a little fragile, they're no more fragile than path references, and require far less overhead than unique keys.
See also: http://stackoverflow.com/q/4001474/1958900
Units should be clear and unambiguous.
For instance, velocity might be "angstrom/fs" Alternatives:
- Require units in the form {unit_name:exponent}, e.g.
atom.velocity.units={'angstrom':1, 'fs':-1}
- Allow strings of the form
atom.velocity.units="angstrom/fs"
, but require that units be chosen from a specific list of specifications - Allow strings of the form
atom.velocity.units="angstrom/fs"
, and require file parsers to parse the units according to a specified syntax
Users should feel free to add metadata to these structures. However, a few notes of caution:
- Calculated quantities should go in "topology.properties" or "state.properties"; UNLESS they are so unambiguous as to be trivially calculable (such as atomic numbers).
- Method-dependent metadata should be prepended with a unique string to avoid namespace clashes. For instance, a state coefficient for a surface hopping method should be expressed as: