Chemical JSON: Difference between revisions

From wiki.openchemistry.org
Jump to navigation Jump to search
No edit summary
(molecular weight -> mass)
Line 42: Line 42:
   },
   },
   "properties": {
   "properties": {
     "molecular weight": 30.0690,
     "mass": 30.0690,
     "melting point": -172,
     "melting point": -172,
     "boiling point": -88
     "boiling point": -88
Line 96: Line 96:
   },
   },
   "properties": {
   "properties": {
     "molecular weight": 30.0690,
     "mass": 30.0690,
     "melting point": -172,
     "melting point": -172,
     "boiling point": -88
     "boiling point": -88
Line 141: Line 141:
   },
   },
   "properties": {
   "properties": {
     "molecular weight": 30.0690,
     "mass": 30.0690,
     "melting point": -172,
     "melting point": -172,
     "boiling point": -88
     "boiling point": -88

Revision as of 19:03, 19 December 2012

This is my first attempt to outline a schema for encoding molecule objects in JSON. The intent is be be expressive, and allow for many properties to be optional and encoded as arrays where appropriate. I took a CML file for ethane (present in the Avogadro source tree) and attempted to translate it to a JSON representation. This would form the basis of our storage in Mongo DB, as well as a possible on disk format. Looking at some recent work on memory mapped binary JSON data structures in C++, I am encouraged by the flexibility and efficiency of this representation along with its strong programming language coverage.

{
  "version": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "formula": {
    "concise": "C 2 H 6"
  },
  "atoms": {
    "ids": [ "a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8" ],
    "elementType":   [ "H", "C", "H", "H", "C", "H", "H", "H" ],
    "elementNumber": [  1,   6,   1,   1,   6,   1,   1,   1 ],
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
               0.751621, -0.022441, -0.020839,
               1.166929,  0.833015, -0.569312,
               1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connectionIds": [ "a1", "a2",
                       "a2", "a3",
                       "a2", "a4",
                       "a2", "a5",
                       "a5", "a6",
                       "a5", "a7",
                       "a5", "a8" ],
    "connectionIndex": [ 1, 2,
                         2, 3,
                         2, 4,
                         2, 5,
                         5, 6,
                         5, 7,
                         5, 8 ],
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}

The JSON above validates here, and has a few extra arrays that I think are only really necessary for human readers (atom string IDs, element type rather than number). An alternative, with perhaps a little more standardization in the naming and some extra nesting might be as follows. Code can check whether expected fields are there, and act accordingly.

{
  "version": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "formula": {
    "concise": "C 2 H 6"
  },
  "atoms": {
    "ids": [ "a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8" ],
    "elements": {
      "type":   [ "H", "C", "H", "H", "C", "H", "H", "H" ],
      "number": [  1,   6,   1,   1,   6,   1,   1,   1 ]
    },
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
               0.751621, -0.022441, -0.020839,
               1.166929,  0.833015, -0.569312,
               1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connections": {
      "ids": [ "a1", "a2",
               "a2", "a3",
               "a2", "a4",
               "a2", "a5",
               "a5", "a6",
               "a5", "a7",
               "a5", "a8" ],
      "index": [ 1, 2,
                 2, 3,
                 2, 4,
                 2, 5,
                 5, 6,
                 5, 7,
                 5, 8 ]
    },
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}

A major challenge will be in establishing a list of accepted names, and a convention for adding new names. I think adding in a version number, and having a structure like JSON that is discoverable will make this approachable.

So, a more minimal example, with just what a computer needs...

{
  "chemical json": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "formula": "C 2 H 6",
  "atoms": {
    "elements": {
      "number": [  1,   6,   1,   1,   6,   1,   1,   1 ]
    },
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
               0.751621, -0.022441, -0.020839,
               1.166929,  0.833015, -0.569312,
               1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connections": {
      "index": [ 0, 1,
                 1, 2,
                 1, 3,
                 1, 4,
                 4, 5,
                 4, 6,
                 4, 7 ]
    },
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}