Chemical JSON: Difference between revisions

From wiki.openchemistry.org
Jump to navigation Jump to search
(molecular weight -> mass)
No edit summary
 
(9 intermediate revisions by 3 users not shown)
Line 1: Line 1:
This is my first attempt to outline a schema for encoding molecule objects in [http://json.org/ JSON]. The intent is be be expressive, and allow for many properties to be optional and encoded as arrays where appropriate. I took a [http://www.xml-cml.org/ CML] file for ethane (present in the Avogadro source tree) and attempted to translate it to a JSON representation. This would form the basis of our storage in Mongo DB, as well as a possible on disk format. Looking at some recent work on memory mapped binary JSON data structures in C++, I am encouraged by the flexibility and efficiency of this representation along with its strong programming language coverage.
We are moving development of the Chemical JSON format to a [https://github.com/OpenChemistry/chemicaljson GitHub repository] for better coordination.
 
The example presented below encodes a chemical molecule in [http://json.org/ JSON]. The intent is be be expressive, and allow for most properties to be optional and encoded as arrays where appropriate. I took a [http://www.xml-cml.org/ CML] file for [https://github.com/OpenChemistry/avogadrodata/blob/master/data/ethane.cml ethane] and attempted to translate it to a JSON representation. This forms the basis of our storage in Mongo DB, as well as an on disk format. Looking at some recent work on memory mapped binary JSON data structures in C++, I am encouraged by the flexibility and efficiency of this representation along with its strong programming language coverage.
 
A major challenge will be in establishing a list of accepted names, and a convention for adding new names. I think adding in a version number, and having a structure like JSON that is discoverable will make this approachable.
 
A minimal [https://github.com/OpenChemistry/avogadrodata/blob/master/data/ethane.cjson example of ethane], with just what a computer needs is outlined below, with a C++ implementation of a reader and writer [https://github.com/OpenChemistry/avogadrolibs/blob/master/avogadro/io/cjsonformat.cpp available in Avogadro here].
 
<source lang="JavaScript">
{
  "chemical json": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "formula": "C 2 H 6",
  "atoms": {
    "elements": {
      "number": [  1,  6,  1,  1,  6,  1,  1,  1 ]
    },
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
              0.751621, -0.022441, -0.020839,
              1.166929,  0.833015, -0.569312,
              1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connections": {
      "index": [ 0, 1,
                1, 2,
                1, 3,
                1, 4,
                4, 5,
                4, 6,
                4, 7 ]
    },
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "molecular mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}
</source>
 
Crystal structures are specified using cell parameters and fractional (lattice) coordinates:
 
<source lang="JavaScript">
{
  "chemical json": 0,
  "name": "TiO2 rutile",
  "formula": "Ti 2 O 4",
  "unit cell": {
    "a": 2.95812,
    "b": 4.59373,
    "c": 4.59373,
    "alpha": 90.0,
    "beta":  90.0,
    "gamma": 90.0
  },
  "atoms": {
    "elements": {
      "number": [ 22, 22, 8, 8, 8, 8 ]
    },
    "coords": {
      "3d fractional": [ 0.00000, 0.00000, 0.00000,
                        0.50000, 0.50000, 0.50000,
                        0.00000, 0.30530, 0.30530,
                        0.00000, 0.69470, 0.69470,
                        0.50000, 0.19470, 0.80530,
                        0.50000, 0.80530, 0.19470 ]
    }
  }
}
</source>
 
==Earlier revisions==
 
Some more verbose representations, with additional fields that are not necessarily required are presented below (and formed part of the initial design). The first iteration was not as nested for example,


<source lang="JavaScript">
<source lang="JavaScript">
Line 6: Line 88:
   "name": "ethane",
   "name": "ethane",
   "inchi": "1/C2H6/c1-2/h1-2H3",
   "inchi": "1/C2H6/c1-2/h1-2H3",
  "inchikey": "WETWJCDKMRHUPV-UHFFFAOYSA-N"
   "formula": {
   "formula": {
     "concise": "C 2 H 6"
     "concise": "C 2 H 6"
Line 42: Line 125:
   },
   },
   "properties": {
   "properties": {
     "mass": 30.0690,
     "molecular mass": 30.0690,
     "melting point": -172,
     "melting point": -172,
     "boiling point": -88
     "boiling point": -88
Line 49: Line 132:
</source>
</source>


The JSON above validates [http://jsonlint.com/ here], and has a few extra arrays that I think are only really necessary for human readers (atom string IDs, element type rather than number). An alternative, with perhaps a little more standardization in the naming and some extra nesting might be as follows. Code can check whether expected fields are there, and act accordingly.
The JSON above validates [http://jsonlint.com/ here], and has a few extra arrays that I think are only really necessary for human readers (atom string IDs, element type rather than number). An alternative, with perhaps a little more standardization in the naming and some extra nesting might be as follows. Code can check whether expected fields are there, and act accordingly. This is close to the minimal example presented at the top of the page, but has arrays such as atoms.elements.type in addition to atoms.elements.number.


<source lang="JavaScript">
<source lang="JavaScript">
Line 96: Line 179:
   },
   },
   "properties": {
   "properties": {
     "mass": 30.0690,
     "molecular mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}
</source>
 
A major challenge will be in establishing a list of accepted names, and a convention for adding new names. I think adding in a version number, and having a structure like JSON that is discoverable will make this approachable.
 
So, a more minimal example, with just what a computer needs...
 
<source lang="JavaScript">
{
  "chemical json": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "formula": "C 2 H 6",
  "atoms": {
    "elements": {
      "number": [  1,  6,  1,  1,  6,  1,  1,  1 ]
    },
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
              0.751621, -0.022441, -0.020839,
              1.166929,  0.833015, -0.569312,
              1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connections": {
      "index": [ 0, 1,
                1, 2,
                1, 3,
                1, 4,
                4, 5,
                4, 6,
                4, 7 ]
    },
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "mass": 30.0690,
     "melting point": -172,
     "melting point": -172,
     "boiling point": -88
     "boiling point": -88

Latest revision as of 14:18, 3 May 2017

We are moving development of the Chemical JSON format to a GitHub repository for better coordination.

The example presented below encodes a chemical molecule in JSON. The intent is be be expressive, and allow for most properties to be optional and encoded as arrays where appropriate. I took a CML file for ethane and attempted to translate it to a JSON representation. This forms the basis of our storage in Mongo DB, as well as an on disk format. Looking at some recent work on memory mapped binary JSON data structures in C++, I am encouraged by the flexibility and efficiency of this representation along with its strong programming language coverage.

A major challenge will be in establishing a list of accepted names, and a convention for adding new names. I think adding in a version number, and having a structure like JSON that is discoverable will make this approachable.

A minimal example of ethane, with just what a computer needs is outlined below, with a C++ implementation of a reader and writer available in Avogadro here.

{
  "chemical json": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "formula": "C 2 H 6",
  "atoms": {
    "elements": {
      "number": [  1,   6,   1,   1,   6,   1,   1,   1 ]
    },
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
               0.751621, -0.022441, -0.020839,
               1.166929,  0.833015, -0.569312,
               1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connections": {
      "index": [ 0, 1,
                 1, 2,
                 1, 3,
                 1, 4,
                 4, 5,
                 4, 6,
                 4, 7 ]
    },
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "molecular mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}

Crystal structures are specified using cell parameters and fractional (lattice) coordinates:

{
  "chemical json": 0,
  "name": "TiO2 rutile",
  "formula": "Ti 2 O 4",
  "unit cell": {
    "a": 2.95812,
    "b": 4.59373,
    "c": 4.59373,
    "alpha": 90.0,
    "beta":  90.0,
    "gamma": 90.0
  },
  "atoms": {
    "elements": {
      "number": [ 22, 22, 8, 8, 8, 8 ]
    },
    "coords": {
      "3d fractional": [ 0.00000, 0.00000, 0.00000,
                         0.50000, 0.50000, 0.50000,
                         0.00000, 0.30530, 0.30530,
                         0.00000, 0.69470, 0.69470,
                         0.50000, 0.19470, 0.80530,
                         0.50000, 0.80530, 0.19470 ]
    }
  }
}

Earlier revisions

Some more verbose representations, with additional fields that are not necessarily required are presented below (and formed part of the initial design). The first iteration was not as nested for example,

{
  "version": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "inchikey": "WETWJCDKMRHUPV-UHFFFAOYSA-N"
  "formula": {
    "concise": "C 2 H 6"
  },
  "atoms": {
    "ids": [ "a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8" ],
    "elementType":   [ "H", "C", "H", "H", "C", "H", "H", "H" ],
    "elementNumber": [  1,   6,   1,   1,   6,   1,   1,   1 ],
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
               0.751621, -0.022441, -0.020839,
               1.166929,  0.833015, -0.569312,
               1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connectionIds": [ "a1", "a2",
                       "a2", "a3",
                       "a2", "a4",
                       "a2", "a5",
                       "a5", "a6",
                       "a5", "a7",
                       "a5", "a8" ],
    "connectionIndex": [ 1, 2,
                         2, 3,
                         2, 4,
                         2, 5,
                         5, 6,
                         5, 7,
                         5, 8 ],
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "molecular mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}

The JSON above validates here, and has a few extra arrays that I think are only really necessary for human readers (atom string IDs, element type rather than number). An alternative, with perhaps a little more standardization in the naming and some extra nesting might be as follows. Code can check whether expected fields are there, and act accordingly. This is close to the minimal example presented at the top of the page, but has arrays such as atoms.elements.type in addition to atoms.elements.number.

{
  "version": 0,
  "name": "ethane",
  "inchi": "1/C2H6/c1-2/h1-2H3",
  "formula": {
    "concise": "C 2 H 6"
  },
  "atoms": {
    "ids": [ "a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8" ],
    "elements": {
      "type":   [ "H", "C", "H", "H", "C", "H", "H", "H" ],
      "number": [  1,   6,   1,   1,   6,   1,   1,   1 ]
    },
    "coords": {
      "3d": [  1.185080, -0.003838,  0.987524,
               0.751621, -0.022441, -0.020839,
               1.166929,  0.833015, -0.569312,
               1.115519, -0.932892, -0.514525,
              -0.751587,  0.022496,  0.020891,
              -1.166882, -0.833372,  0.568699,
              -1.115691,  0.932608,  0.515082,
              -1.184988,  0.004424, -0.987522 ]
    }
  },
  "bonds": {
    "connections": {
      "ids": [ "a1", "a2",
               "a2", "a3",
               "a2", "a4",
               "a2", "a5",
               "a5", "a6",
               "a5", "a7",
               "a5", "a8" ],
      "index": [ 1, 2,
                 2, 3,
                 2, 4,
                 2, 5,
                 5, 6,
                 5, 7,
                 5, 8 ]
    },
    "order": [ 1, 1, 1, 1, 1, 1, 1 ]
  },
  "properties": {
    "molecular mass": 30.0690,
    "melting point": -172,
    "boiling point": -88
  }
}