Chemical JSON: Difference between revisions

From wiki.openchemistry.org
Jump to navigation Jump to search
(Added a link to the corresponding C++ implementation of Chemical JSON in the Avogadro source tree.)
(Updated the page a little to reflect what we are using and present this at the start of the page.)
Line 1: Line 1:
This is my first attempt to outline a schema for encoding molecule objects in [http://json.org/ JSON]. The intent is be be expressive, and allow for many properties to be optional and encoded as arrays where appropriate. I took a [http://www.xml-cml.org/ CML] file for ethane (present in the Avogadro source tree) and attempted to translate it to a JSON representation. This would form the basis of our storage in Mongo DB, as well as a possible on disk format. Looking at some recent work on memory mapped binary JSON data structures in C++, I am encouraged by the flexibility and efficiency of this representation along with its strong programming language coverage.
The example presented below encodes a chemical molecule in [http://json.org/ JSON]. The intent is be be expressive, and allow for most properties to be optional and encoded as arrays where appropriate. I took a [http://www.xml-cml.org/ CML] file for [https://github.com/OpenChemistry/avogadrodata/blob/master/data/ethane.cml ethane] and attempted to translate it to a JSON representation. This forms the basis of our storage in Mongo DB, as well as an on disk format. Looking at some recent work on memory mapped binary JSON data structures in C++, I am encouraged by the flexibility and efficiency of this representation along with its strong programming language coverage.
 
A major challenge will be in establishing a list of accepted names, and a convention for adding new names. I think adding in a version number, and having a structure like JSON that is discoverable will make this approachable.
 
A minimal [https://github.com/OpenChemistry/avogadrodata/blob/master/data/ethane.cjson example of ethane], with just what a computer needs is outlined below, with a C++ implementation of a reader and writer [https://github.com/OpenChemistry/avogadrolibs/blob/master/avogadro/io/cjsonformat.cpp available in Avogadro here].
 
<source lang="JavaScript">
{
  "chemical json": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "formula": "C 2 H 6",
  "atoms": {
    "elements": {
      "number": [  1,  6,  1,  1,  6,  1,  1,  1 ]
    },
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
              0.751621, -0.022441, -0.020839,
              1.166929,  0.833015, -0.569312,
              1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connections": {
      "index": [ 0, 1,
                1, 2,
                1, 3,
                1, 4,
                4, 5,
                4, 6,
                4, 7 ]
    },
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "molecular mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}
</source>
 
==Earlier revisions==
 
Some more verbose representations, with additional fields that are not necessarily required are presented below (and formed part of the initial design). The first iteration was not as nested for example,


<source lang="JavaScript">
<source lang="JavaScript">
Line 49: Line 98:
</source>
</source>


The JSON above validates [http://jsonlint.com/ here], and has a few extra arrays that I think are only really necessary for human readers (atom string IDs, element type rather than number). An alternative, with perhaps a little more standardization in the naming and some extra nesting might be as follows. Code can check whether expected fields are there, and act accordingly.
The JSON above validates [http://jsonlint.com/ here], and has a few extra arrays that I think are only really necessary for human readers (atom string IDs, element type rather than number). An alternative, with perhaps a little more standardization in the naming and some extra nesting might be as follows. Code can check whether expected fields are there, and act accordingly. This is close to the minimal example presented at the top of the page, but has arrays such as atoms.elements.type in addition to atoms.elements.number.


<source lang="JavaScript">
<source lang="JavaScript">
Line 92: Line 141:
                 5, 7,
                 5, 7,
                 5, 8 ]
                 5, 8 ]
    },
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "molecular mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}
</source>
A major challenge will be in establishing a list of accepted names, and a convention for adding new names. I think adding in a version number, and having a structure like JSON that is discoverable will make this approachable.
So, a more minimal example, with just what a computer needs is outlined below, with a C++ implementation of a reader and writer [https://github.com/OpenChemistry/avogadrolibs/blob/master/avogadro/io/cjsonformat.cpp available in Avogadro here].
<source lang="JavaScript">
{
  "chemical json": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "formula": "C 2 H 6",
  "atoms": {
    "elements": {
      "number": [  1,  6,  1,  1,  6,  1,  1,  1 ]
    },
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
              0.751621, -0.022441, -0.020839,
              1.166929,  0.833015, -0.569312,
              1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connections": {
      "index": [ 0, 1,
                1, 2,
                1, 3,
                1, 4,
                4, 5,
                4, 6,
                4, 7 ]
     },
     },
     "order": [ 1, 1, 1, 1, 1, 1, 1 ]
     "order": [ 1, 1, 1, 1, 1, 1, 1 ]

Revision as of 11:37, 19 February 2013

The example presented below encodes a chemical molecule in JSON. The intent is be be expressive, and allow for most properties to be optional and encoded as arrays where appropriate. I took a CML file for ethane and attempted to translate it to a JSON representation. This forms the basis of our storage in Mongo DB, as well as an on disk format. Looking at some recent work on memory mapped binary JSON data structures in C++, I am encouraged by the flexibility and efficiency of this representation along with its strong programming language coverage.

A major challenge will be in establishing a list of accepted names, and a convention for adding new names. I think adding in a version number, and having a structure like JSON that is discoverable will make this approachable.

A minimal example of ethane, with just what a computer needs is outlined below, with a C++ implementation of a reader and writer available in Avogadro here.

{
  "chemical json": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "formula": "C 2 H 6",
  "atoms": {
    "elements": {
      "number": [  1,   6,   1,   1,   6,   1,   1,   1 ]
    },
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
               0.751621, -0.022441, -0.020839,
               1.166929,  0.833015, -0.569312,
               1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connections": {
      "index": [ 0, 1,
                 1, 2,
                 1, 3,
                 1, 4,
                 4, 5,
                 4, 6,
                 4, 7 ]
    },
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "molecular mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}

Earlier revisions

Some more verbose representations, with additional fields that are not necessarily required are presented below (and formed part of the initial design). The first iteration was not as nested for example,

{
  "version": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "formula": {
    "concise": "C 2 H 6"
  },
  "atoms": {
    "ids": [ "a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8" ],
    "elementType":   [ "H", "C", "H", "H", "C", "H", "H", "H" ],
    "elementNumber": [  1,   6,   1,   1,   6,   1,   1,   1 ],
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
               0.751621, -0.022441, -0.020839,
               1.166929,  0.833015, -0.569312,
               1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connectionIds": [ "a1", "a2",
                       "a2", "a3",
                       "a2", "a4",
                       "a2", "a5",
                       "a5", "a6",
                       "a5", "a7",
                       "a5", "a8" ],
    "connectionIndex": [ 1, 2,
                         2, 3,
                         2, 4,
                         2, 5,
                         5, 6,
                         5, 7,
                         5, 8 ],
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "molecular mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}

The JSON above validates here, and has a few extra arrays that I think are only really necessary for human readers (atom string IDs, element type rather than number). An alternative, with perhaps a little more standardization in the naming and some extra nesting might be as follows. Code can check whether expected fields are there, and act accordingly. This is close to the minimal example presented at the top of the page, but has arrays such as atoms.elements.type in addition to atoms.elements.number.

{
  "version": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "formula": {
    "concise": "C 2 H 6"
  },
  "atoms": {
    "ids": [ "a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8" ],
    "elements": {
      "type":   [ "H", "C", "H", "H", "C", "H", "H", "H" ],
      "number": [  1,   6,   1,   1,   6,   1,   1,   1 ]
    },
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
               0.751621, -0.022441, -0.020839,
               1.166929,  0.833015, -0.569312,
               1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connections": {
      "ids": [ "a1", "a2",
               "a2", "a3",
               "a2", "a4",
               "a2", "a5",
               "a5", "a6",
               "a5", "a7",
               "a5", "a8" ],
      "index": [ 1, 2,
                 2, 3,
                 2, 4,
                 2, 5,
                 5, 6,
                 5, 7,
                 5, 8 ]
    },
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "molecular mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}