-
Notifications
You must be signed in to change notification settings - Fork 37
Molecular JSON Draft Spec
THIS IS AN INCOMPLETE WORK IN PROGRESS!
- Facilitate interchange between most computational chemistry/materials programs
- Store data in an unambiguous, easy-to-parse format
- Store data in a human-readable and -editable form
- Support flexible, self-describing, hierarchical storage
- Text format: JSON
- High-performance format: HDF5
- Encoding: UTF-8 (the JSON standard)
- File size limits: ???
The file format is designed specifically to facilitate interchange between molecular software packages.
It is designed specifically to support common input and output for these applications:
- Molecular dynamics: both simulation (OpenMM, DESMOND, etc.) and analysis (MDTraj, PyTraj, etc.)
- Quantum chemistry (PySCF, Psi4, NWChem, etc.)
- Docking (UCSF-, Auto-, GLIDE, etc.)
- Informatics (OpenBabel, RDKit, OEChem, etc.)
- Visualization (VMD, Chimera, etc.)
Note that this specification is intended to produce JSON files that are human-readable (and, to an extent, human-writable) ... especially by humans that have not read this specification.
Molecules structures will necessarily need to be self-referencing (e.g., a Bond
object will need to reference two Atom
objects). The specific method for doing so is one of the outstanding design decisions (see below).
A molecule has these fields:
-
name
(string): name of the molecule (no particular meaning) -
type
(string):"Molecule"
-
provenance
(Provenance object): where this molecule came from -
topology
(Topology object): specifies atomic data, bonds and biomolecular (or materials) hierarchy -
states
(list of State objects): dynamical states with position, momentum, and calculated properties at each point -
forcefield
(optional) (Forcefield object): forcefield specification
TBD
TBD (see design decisions
below)
TBD
TBD
All physical quantities must have associated units. The units are defined in TBD (design decision)
- Simple units such as "angstrom", "nm", "femtosecond", "kilogram", etc. may be written as a string.
- Compound units such as "angstrom/fs" or "kcal/mol" should be specified as TBD (design decision)
Note that javascript only has one numeric data type (all numbers are floating point).
-
unitless scalars:
1.0
OR{val:1.0, units:null}
-
unitless arrays:
[1,2,3.0,4]
OR{val:[1,2,3.0,4], units:null}
-
scalar with units:
scalar = {val:2.0, units:'fs'}
-
array with units:
array = {val:[1,2,3.0], units:'angstrom'}
-
complex numbers:
{val: {real:0.0, imag:-1.0}, units: null}
(Complex numbers should always be written with units, even if they arenull
.)
TBD (see design decisions
below)
Users should feel free to add metadata to these structures. However, a few notes of caution:
- Calculated quantities should go in "topology.properties" or "state.properties"; UNLESS they are so unambiguous as to be trivially calculable (such as atomic numbers).
- Method-dependent metadata should be prepended with a unique string to avoid namespace clashes. For instance, a state coefficient for a surface hopping method should be expressed as:
mdt_surface_hopping_coeffs = [{type:complex, real=0.5, imag=-.1},...]
Items here don't have a definite answer yet - there are multiple answers for each. Answers are currently ranked by AMV's capricious preferences.
JSON does not directly support object references. This makes it non-trivial to, say, maintain a list of bonds between atoms. Some solutions are:
- by array index (e.g.,
residue.atom_indices=[3,4,5,6]
) - by JSON path reference (see, e.g., https://tools.ietf.org/html/draft-pbryan-zyp-json-ref-03)
- by a unique key. (e.g.,
residue.id='a83nd83'
,residue.atoms=['a9n3d9', '31di3']
)
Array index is probably the best option - although they are a little fragile, they're no more fragile than path references, and require far less overhead than unique keys.
See also: http://stackoverflow.com/q/4001474/1958900
- Publicly-available JSON file with supported units and conversions
- Standardize to some externally-chosen database or web service
For instance, velocity might be "angstrom/fs" Alternatives:
- Require units in the form
{unit_name:exponent}
, e.g.atom.velocity.units={'angstrom':1, 'fs':-1}
- Allow strings of the form
atom.velocity.units="angstrom/fs"
, but require that units be chosen from a specific list of specifications - Allow strings of the form
atom.velocity.units="angstrom/fs"
, and require file parsers to parse the units according to a specified syntax
Possible answers:
- multiple topology objects (explicit but storage-intensive)
- states can store topology "patches" (saves memory, but confusing and hard to implement)
- Single global topology with all possible states (i.e., all possible bonds, all possible residues), states can include flags to turn elements on/off (NP-complete in some cases)
- As a table of values
- As a set of arrays
- As a list of objects
Examples:
// 1) Storing fields as tables: creates an mmCIF/PDB-like layout
{atoms={type:'table[atom]',
fields=['name', 'atomic_number', 'mass/Dalton', 'residue_index', 'position/angstrom', 'momentum/angstrom*amu*fs^-1']
entries=[
['CA', 6, 12.0, 0, [0.214,12.124,1.12], [0,0,0]],
['N', 7, 14.20, 0, [0.214,12.124,1.12], [0,0,0]],
...
}
// 2) Storing the fieldnames for each atom: readable, but makes the file huge
{atoms=[
{name:'CA', atnum:6, residue_index:0,
mass:{value:12.00, units:'Daltons'},
position:{value:[0.214,12.124,1.12], units:'angstroms'},
momentum:{value:[0.0, 0.0, 0.0], units:'angstrom*dalton*fs^-1'},
},
{name:'N', atnum:7, residue_index:0,
mass:{value:14.20, units:'Daltons'},
position:{value:[0.214,12.124,1.12], units:'angstroms'},
momentum:{value:[0.0, 0.0, 0.0], units:'angstrom*dalton*fs^-1'},
},
...
}]
}
// 3) Storing fields as arrays: much more compact, but harder to read and edit
{num_atoms=1234,
atoms={names:['CA','CB','OP' ...],
atomic_numbers:[6,6,8, ...],
masses:{val:[12.0, 12.0, 16.12, ...], units:amu},
residue_indices:[0,0,0,1,1, ...],
positions:{val:[[0.214,12.124,1.12], [0.214,12.124,1.12], ...], units:angstrom},
momenta:{val:[[0,0,0], [1,2,3], ...], units:angstrom*amu*fs^-1},
}