Chemical JSON: Difference between revisions

From wiki.openchemistry.org
Jump to navigation Jump to search
(Use molecular mass rather than just mass)
(Added a link to the corresponding C++ implementation of Chemical JSON in the Avogadro source tree.)
Line 105: Line 105:
A major challenge will be in establishing a list of accepted names, and a convention for adding new names. I think adding in a version number, and having a structure like JSON that is discoverable will make this approachable.
A major challenge will be in establishing a list of accepted names, and a convention for adding new names. I think adding in a version number, and having a structure like JSON that is discoverable will make this approachable.


So, a more minimal example, with just what a computer needs...
So, a more minimal example, with just what a computer needs is outlined below, with a C++ implementation of a reader and writer [https://github.com/OpenChemistry/avogadrolibs/blob/master/avogadro/io/cjsonformat.cpp available in Avogadro here].


<source lang="JavaScript">
<source lang="JavaScript">

Revision as of 19:54, 30 December 2012

This is my first attempt to outline a schema for encoding molecule objects in JSON. The intent is be be expressive, and allow for many properties to be optional and encoded as arrays where appropriate. I took a CML file for ethane (present in the Avogadro source tree) and attempted to translate it to a JSON representation. This would form the basis of our storage in Mongo DB, as well as a possible on disk format. Looking at some recent work on memory mapped binary JSON data structures in C++, I am encouraged by the flexibility and efficiency of this representation along with its strong programming language coverage.

{
  "version": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "formula": {
    "concise": "C 2 H 6"
  },
  "atoms": {
    "ids": [ "a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8" ],
    "elementType":   [ "H", "C", "H", "H", "C", "H", "H", "H" ],
    "elementNumber": [  1,   6,   1,   1,   6,   1,   1,   1 ],
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
               0.751621, -0.022441, -0.020839,
               1.166929,  0.833015, -0.569312,
               1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connectionIds": [ "a1", "a2",
                       "a2", "a3",
                       "a2", "a4",
                       "a2", "a5",
                       "a5", "a6",
                       "a5", "a7",
                       "a5", "a8" ],
    "connectionIndex": [ 1, 2,
                         2, 3,
                         2, 4,
                         2, 5,
                         5, 6,
                         5, 7,
                         5, 8 ],
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "molecular mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}

The JSON above validates here, and has a few extra arrays that I think are only really necessary for human readers (atom string IDs, element type rather than number). An alternative, with perhaps a little more standardization in the naming and some extra nesting might be as follows. Code can check whether expected fields are there, and act accordingly.

{
  "version": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "formula": {
    "concise": "C 2 H 6"
  },
  "atoms": {
    "ids": [ "a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8" ],
    "elements": {
      "type":   [ "H", "C", "H", "H", "C", "H", "H", "H" ],
      "number": [  1,   6,   1,   1,   6,   1,   1,   1 ]
    },
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
               0.751621, -0.022441, -0.020839,
               1.166929,  0.833015, -0.569312,
               1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connections": {
      "ids": [ "a1", "a2",
               "a2", "a3",
               "a2", "a4",
               "a2", "a5",
               "a5", "a6",
               "a5", "a7",
               "a5", "a8" ],
      "index": [ 1, 2,
                 2, 3,
                 2, 4,
                 2, 5,
                 5, 6,
                 5, 7,
                 5, 8 ]
    },
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "molecular mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}

A major challenge will be in establishing a list of accepted names, and a convention for adding new names. I think adding in a version number, and having a structure like JSON that is discoverable will make this approachable.

So, a more minimal example, with just what a computer needs is outlined below, with a C++ implementation of a reader and writer available in Avogadro here.

{
  "chemical json": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "formula": "C 2 H 6",
  "atoms": {
    "elements": {
      "number": [  1,   6,   1,   1,   6,   1,   1,   1 ]
    },
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
               0.751621, -0.022441, -0.020839,
               1.166929,  0.833015, -0.569312,
               1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connections": {
      "index": [ 0, 1,
                 1, 2,
                 1, 3,
                 1, 4,
                 4, 5,
                 4, 6,
                 4, 7 ]
    },
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "molecular mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}