GSoC Ideas 2018: Difference between revisions
No edit summary |
Greg.landrum (talk | contribs) |
||
(31 intermediate revisions by 8 users not shown) | |||
Line 31: | Line 31: | ||
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary. | '''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary. | ||
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com). | |||
=== Project: Molecular Dynamics === | |||
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2 | |||
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations. | |||
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary. | |||
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com). | '''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com). | ||
Line 40: | Line 50: | ||
'''Brief explanation:''' Support for biological data, representations, and visualization | '''Brief explanation:''' Support for biological data, representations, and visualization | ||
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. | '''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place. | ||
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary. | '''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary. | ||
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu) | '''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu) | ||
=== Project: Scripting Bindings === | === Project: Scripting Bindings === | ||
Line 75: | Line 75: | ||
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) | '''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) | ||
==Open Babel Project Ideas== | ==Open Babel Project Ideas== | ||
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas. | [http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas. | ||
=== Project: Implement MMTF format === | === Project: Implement MMTF format === | ||
Line 122: | Line 113: | ||
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) | '''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) | ||
=== Project: Test Framework Overhaul === | |||
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel. | |||
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage. | |||
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal. | |||
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community. | |||
=== Project: Develop a JavaScript version of Open Babel === | |||
'''Brief explanation:''' Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality. | |||
'''Expected results''': Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API. | |||
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.) | |||
'''Prerequisities''': Some experience in C++, and also with JavaScript. | |||
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) | |||
=== Project: Develop a validation and standardization filter === | |||
'''Brief explanation''': Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)? | |||
'''Expected results''': Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual. | |||
Such a model could be used as a filter, or as a warning to flag up problematic structures. | |||
Code could be modeled on MolVS using RDKit [[https://molvs.readthedocs.io/en/latest/]] | |||
'''Prerequisites''': Experience in C++ or Python, and an interest in data science or statistics. | |||
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu) | |||
=== Project: Add color to Open Babel output === | |||
'''Brief explanation''': A general framework for specifying color at the terminal would allow us to add color to output, e.g. nitrogen in blue, improve the display of warning messages (red), and highlight substructures. | |||
'''Expected results''': When writing results to a terminal, it's possible to use a whole range of colors (and styles) to enhance the display. By putting a general framework in place, that works cross-platform, this could be very helpful in a wide range of scenarios. For example, when substructure searching, the matched atoms could be highlighted, which would greatly aid in visualizing the match. | |||
'''Prerequisites''': Experience in C++. | |||
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) | |||
==cclib Project Ideas== | ==cclib Project Ideas== | ||
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs | [http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files. | ||
=== Project: Implement QC JSON schema in cclib === | |||
'''Brief explanation:''' Incorporate the [https://github.com/MolSSI/QC_JSON_Schema MolSSI JSON schema], which is currently in the design stage. | |||
'''Expected results:''' Implement a reader and writer according to the schema, and provide feedback to help drive the schema design to completion. Optionally also improve the code and tests for existing reader/writer classes in cclib. | |||
'''Prerequisites:''' Experience with Python, and ideally some familiarity with JSON, quantum chemistry and computational chemistry programs. | |||
'''Mentor:''' Karol Langner (karol.langner at gmail dot com) | |||
=== Project: Advanced Analysis of Quantum Chemistry Data === | === Project: Advanced Analysis of Quantum Chemistry Data === | ||
Line 131: | Line 178: | ||
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges. | '''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges. | ||
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly multiple partial charge assignment methods exist and can be implemented, including DDEC6. | '''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6. | ||
'''Suggested Readings:''' | '''Suggested Readings:''' | ||
Line 138: | Line 185: | ||
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested. | '''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested. | ||
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) | '''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com) | ||
=== Project: Refactor parsers === | === Project: Refactor parsers === | ||
'''Brief explanation:''' The main extract() | '''Brief explanation:''' The main extract() functions in parsers are long and contain a lot of business logic. They should be refactored into smaller functions for maintainability. | ||
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date. | '''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement the best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date. | ||
'''Prerequisites:''' Experience with | '''Prerequisites:''' Experience with Python. | ||
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and | '''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com) | ||
=== Project: Implement new parsers === | === Project: Implement new parsers === | ||
'''Brief explanation:''' There are outstanding issues on Github for | '''Brief explanation:''' There are outstanding issues on Github for parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA). | ||
'''Expected results:''' Generate test data and unit tests, and implement new parsers | '''Expected results:''' Generate test data and unit tests, and implement new parsers. | ||
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs. | '''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs. | ||
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and | '''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com) | ||
=== Project: Discovering computational chemistry content online === | === Project: Discovering computational chemistry content online === | ||
Line 168: | Line 215: | ||
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing. | '''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing. | ||
'''Mentor:''' Karol Langner (karol.langner at gmail dot com) | '''Mentor:''' Karol Langner (karol.langner at gmail dot com) | ||
=== Project: Machine learning applied to parsing computational chemistry output === | === Project: Machine learning applied to parsing computational chemistry output === | ||
Line 178: | Line 225: | ||
'''Prerequisites:''' Experience with Python, machine learning, and ideally familiarity with computational chemistry. | '''Prerequisites:''' Experience with Python, machine learning, and ideally familiarity with computational chemistry. | ||
'''Mentor:''' Karol Langner (karol.langner at gmail dot com) | '''Mentor:''' Karol Langner (karol.langner at gmail dot com) | ||
==3Dmol.js Project Ideas== | ==3Dmol.js Project Ideas== | ||
Line 187: | Line 234: | ||
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]] | [[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]] | ||
'''Brief explanation:''' [http:// | '''Brief explanation:''' [http://developer.download.nvidia.com/books/HTML/gpugems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces. | ||
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types. | '''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types. | ||
Line 217: | Line 264: | ||
'''Prerequisites:''' C++ | '''Prerequisites:''' C++ | ||
'''Mentor:''' | '''Mentor:''' Nadine Schneider (nadine-1.schneider at novartis dot com ) | ||
=== Project: RDKit - MMTF Integration === | === Project: RDKit - MMTF Integration === | ||
'''Brief explanation:''' Implementation of MMTF file format in the RDKit. See the similar OpenBabel project for more details. | '''Brief explanation:''' Implementation of the MMTF file format in the RDKit. See the similar OpenBabel project for more details. | ||
'''Expected results:''' A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers. | '''Expected results:''' A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers. | ||
Line 228: | Line 275: | ||
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com) | '''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com) | ||
=== Project: RDKit - MolVS === | |||
'''Brief explanation:''' MolVS (https://molvs.readthedocs.io/en/latest/) provides very useful functionality for molecular validation and standardization. MolVS is built using the RDKit, but in this project we will expand its capabilities and integrate it into the RDKit project. An eventual end goal, though not necessarily one for this project, will be to have a C++ implementation of the algorithm that is part of the core RDKit. Matt Swain (the original author of MolVS) will collaborate with us on this. | |||
'''Expected results:''' A Python or C++ implementation of an extended version of MolVS that can be integrated into the RDKit core. The extensions will include support for sets of atom types that are to be allowed/disallowed. We will also add ideas borrowed from Standardiser (https://github.com/flatkinson/standardiser) | |||
'''Prerequisites:''' Python, C++ would be an advantage but is not required | |||
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com) | |||
=== Project: neo4j integration === | |||
'''Brief explanation:''' | |||
The RDKit already has strong integration with the open-source relational database PostgreSQL, in this project you'll build a similar extension for the open-source graph database neo4j (https://neo4j.com/). The concept of the knowledge graph, which stores the relationships between objects in addition to the objects themselves, has become widespread in data management and integration. This project will allow us to build and query knowledge graphs storing molecular and chemical information. | |||
'''Expected results:''' | |||
An RDKit extension to neo4j that provides chemical functionality for finding entry points into the graph and to efficiently filter paths using chemical knowledge while traversing the graph. | |||
'''Prerequisites:''' Java | |||
'''Mentor:''' Christian Pilger (christian.pilger at basf.com) | |||
=== Project: RDKit - OpenMM Integration === | === Project: RDKit - OpenMM Integration === | ||
Line 233: | Line 302: | ||
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules. | '''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules. | ||
'''Expected results:''' C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. | '''Expected results:''' C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. The work would likely involve completing the MMFF94 implementation described by Paolo Tosco at the 2017 RDKit UGM (https://github.com/rdkit/UGM_2017/blob/master/Presentations/Tosco_RDKit_OpenMM_integration.pdf) and extending to other force fields like UFF. | ||
'''Prerequisites:''' C++ and some Python | '''Prerequisites:''' C++ and some Python | ||
'''Mentor:''' | '''Mentor:''' TBA, likely Geoff Hutchison (geoffh at pitt.edu) and others | ||
=== Project: MongoDB integration === | |||
'''Brief explanation:''' | |||
MongoDB (https://www.mongodb.com/) is an open-source cross-platform document oriented NoSQL database program optimized for performance. Its flexible schema can accommodate hierarchical relationships between chemical compounds. To enable chemical intelligence in mongoDB queries, an integration with RDKit is necessary. This project will allow us to build a frame to perform similarity, substructure, and identity searches. We will leverage on the document structure of the database to store multiple representations of each molecule. While adding new functionalities and developing existing capabilities we will keep an eye on performance, to ensure optimal scalability (indexing, shards, multiprocessing, etc.). We will learn from, and possibly build upon other work that has been done for chemistry integration into MongoDB (e.g. http://wiki.openchemistry.org/MongoChem and http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb). | |||
'''Expected results:''' A stable and performant RDKit extension to MongoDB that provides chemical functionalities on a document database. | |||
'''Prerequisites:''' Python | |||
'''Mentor:''' Marco Stenta (marco.stenta at syngenta.com) | |||
=== Project: Implement the Analog Series-Based Scaffold method === | |||
'''Brief explanation:''' | |||
The concept of the chemical scaffold is central to our understanding and analysis of many medicinal chemistry datasets. There are multiple ways to define the scaffold of a set of molecules, of these the "Murcko scaffold" is probably the most common, but it's probably also one of the worst (though that's probably a bit harsh since the idea of scaffold is not very well defined). A more data-driven approach is described in these two open-access articles and the references therein: | |||
https://www.future-science.com/doi/10.4155/fsoa-2017-0102 | |||
https://www.future-science.com/doi/10.4155/fsoa-2017-0135 | |||
It would be quite useful to have an RDKit implementation of this method. | |||
'''Expected results:''' | |||
A stable and well tested RDKit implementation, Python or C++ based, of the Analog Series-Based Scaffold method. | |||
'''Prerequisites:''' Python or C++ | |||
'''Mentor:''' Nik Stiefl (nikolaus.stiefl at novartis dot com ) | |||
== MSDK / MZmine Project Ideas == | == MSDK / MZmine Project Ideas == | ||
Mass spectrometry is an analytical technique that measures the mass of small molecules with high precision. The data coming from mass spectrometry instruments is complex and multi-dimensional. Mass spectrometry development kit (MSDK [http://msdk.github.io]) is a Java library of algorithms for processing such mass spectrometry data. The goals of the library are to provide a flexible data model with Java interfaces for mass-spectrometry related objects (including raw spectra, processed data sets, identification results etc.) and to integrate the existing algorithms that are currently scattered around various Java-based graphical tools. MZmine [https://mzmine.github.io] is an open-source software for mass-spectrometry data processing. A new version, MZmine 3, which is currently under development, is based on JavaFX for GUI and on MSDK for data processing algorithms. | Mass spectrometry is an analytical technique that measures the mass of small molecules with high precision. The data coming from mass spectrometry instruments is complex and multi-dimensional. Mass spectrometry development kit (MSDK [http://msdk.github.io]) is a Java library of algorithms for processing such mass spectrometry data. The goals of the library are to provide a flexible data model with Java interfaces for mass-spectrometry related objects (including raw spectra, processed data sets, identification results etc.) and to integrate the existing algorithms that are currently scattered around various Java-based graphical tools. MZmine [https://mzmine.github.io] is an open-source software for mass-spectrometry data processing. A new version, MZmine 3, which is currently under development, is based on JavaFX for GUI and on MSDK for data processing algorithms. | ||
=== Project: MSDK - Feature Detection === | === Project: MSDK - Feature Detection === | ||
'''Brief explanation:''' Provide a native Java implementation of some popular LC-MS feature detection algorithms from the R world (centWave, massifquant, CAMERA, etc.). | '''Brief explanation:''' Provide a native Java implementation of some popular LC-MS feature detection algorithms from the R world (centWave, massifquant, CAMERA, etc.) [https://www.bioconductor.org/packages/3.7/bioc/manuals/xcms/man/xcms.pdf]. Further development of ADAP-3D module [https://github.com/msdk/msdk/tree/master/msdk-featuredetection-adap3d] for intelligent, parameter-less feature detection. | ||
'''Prerequisites:''' Java, preferably some knowledge about mass spectrometry | '''Prerequisites:''' Java, preferably some knowledge about mass spectrometry | ||
'''Mentor:''' Dmitry Avtonomov <dmitriy.avtonomov@gmail.com> | '''Mentor:''' Dmitry Avtonomov <dmitriy.avtonomov@gmail.com>, Xiuxia Du <Xiuxia.Du@uncc.edu> | ||
=== Project: MSDK - Spectral Database Search === | === Project: MSDK - Spectral Database Search === | ||
'''Brief explanation:''' Develop new MSDK modules for spectral search in offline and online databases (especially MoNA [http://mona.fiehnlab.ucdavis.edu]) | '''Brief explanation:''' Develop new MSDK modules for spectral search in offline and online databases (especially MoNA [http://mona.fiehnlab.ucdavis.edu]). | ||
'''Prerequisites:''' Java | '''Prerequisites:''' Java | ||
'''Mentor:''' Gert Wohlgemuth <berlinguyinca@gmail.com> | '''Mentor:''' Gert Wohlgemuth <berlinguyinca@gmail.com> | ||
=== Project: MSDK - New IO Modules === | |||
'''Brief explanation:''' Develop new MSDK-IO modules for currently unsupported file formats, like mzDB [https://github.com/mzdb/mzdb-specs], mz5 [http://software.steenlab.org/mz5/], or imzML [https://ms-imaging.org/wp/imzml/], and improve the existing support for reading native vendor formats. Update mzTab [https://github.com/HUPO-PSI/mzTab] support to version 1.1 with new features for metabolomics. In addition, support for reading ion mobility data can be added to the existing mzML format reader/writer. | |||
'''Prerequisites:''' Java | |||
'''Mentor:''' Tomas Pluskal <plusik@gmail.com> and/or Adam Tenderholt <atenderholt@gmail.com> | |||
=== Project: MSDK - KNIME integration === | |||
'''Brief explanation:''' Develop an integration layer for MSDK algorithms into the workflow platform KNIME [https://www.knime.com]. | |||
'''Prerequisites:''' Java | |||
'''Mentor:''' Tomas Pluskal <plusik@gmail.com> | |||
Line 267: | Line 382: | ||
'''Mentor:''' Xiuxia Du <Xiuxia.Du@uncc.edu> | '''Mentor:''' Xiuxia Du <Xiuxia.Du@uncc.edu> | ||
=== Project: MSDK / MZmine - Correlation Analysis === | |||
'''Brief explanation:''' Develop new modules for correlation-based identification of related mass spectrometry signals. | |||
'''Prerequisites:''' Java | |||
'''Mentor:''' Tomas Pluskal <plusik@gmail.com> | |||
=== Project: MZmine - New Visualization Modules === | === Project: MZmine - New Visualization Modules === | ||
'''Brief explanation:''' | '''Brief explanation:''' Implement new, JavaFX-based visualization modules for MZmine such as Cloud Plot [http://pubs.acs.org/doi/abs/10.1021/ac3029745] or spectral similarity tree imaging. | ||
'''Prerequisites:''' Java (experience with JavaFX is helpful but not required) | '''Prerequisites:''' Java (experience with JavaFX is helpful but not required) | ||
'''Mentor:''' Tomas Pluskal <plusik@gmail.com> | '''Mentor:''' Tomas Pluskal <plusik@gmail.com> | ||
== NWChem Project Ideas == | == NWChem Project Ideas == | ||
Line 310: | Line 433: | ||
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov) | '''Mentor:''' Bert de Jong (wadejong at lbl dot gov) | ||
== DeepChem Project Ideas == | |||
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology. | |||
=== Project: Transfer Learning Framework === | |||
'''Brief explanation:''' Create easy to use tools for common transfer learning scenarios. | |||
'''Expected results:''' [https://arxiv.org/abs/1712.02734 ChemNet] discusses a powerful model independent transfer learning protocol. We would want to reproduce the results, and be able to apply the transfer learning protocol to arbitrary TensorGraph models. Jupyter notebook tutorials and blog posts will be expected over the course of the summer. | |||
'''Prerequisites:''' Python, Tensorflow | |||
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com) | |||
=== Project: Data Interfaces === | |||
'''Brief explanation:''' Transition deepchem.data.Dataset to tf.data. | |||
'''Expected results:''' DeepChem data objects were created before tf.data existed. We need to make our existing Featurizers, Transformers, and Models work over tf.data objects. Jupyter notebook tutorials and blog posts for how to use the new improved interfaces. | |||
'''Prerequisites:''' Python, some Tensorflow | |||
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com) | |||
=== Project: Model Visualization === | |||
'''Brief explanation:''' Node Importance Visualizations from Graph Models | |||
'''Expected results:''' An argument often used against deep learning methods is that they are not understandable. This project would be to implement [https://github.com/debbiemarkslab/neural-fingerprint-theano visual neural graph fingerprints] into DeepChem. Stretch goals would be to implement [https://arxiv.org/abs/1605.01713 DeepLift] or masking techniques for atom level visualizations. | |||
'''Prerequisites:''' Python, Tensorflow, rdkit | |||
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com) | |||
=== Project: Imaging Tools === | |||
'''Brief explanation:''' Enable chemical image segmentation and property prediction. | |||
'''Expected results:''' We want an implementation of [https://arxiv.org/pdf/1505.04597.pdf U-Net], and [https://arxiv.org/pdf/1512.03385.pdf ResNet] inside of the DeepChem framework. We want both pre-trained networks on problems of chemical importance and the image data augmentation techniques used to create the models. | |||
'''Prerequisites:''' Python, Tensorflow | |||
'''Mentor:''' Bharath Ramsundar (bharath at datamined dot io) | |||
== Miscellaneous Project Ideas == | == Miscellaneous Project Ideas == |
Latest revision as of 23:08, 27 February 2018
Guidelines
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve Avogadro 2, cclib, 3DMol, and Open Babel. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!
Adding Ideas
When adding a new idea to this page, please try to include the following information:
- A brief explanation of the idea.
- Expected results/feature additions.
- Any prerequisites for working on the project.
- Links to any further information, discussions, bug reports etc.
- Any special mailing lists if not the standard mailing list for the project
- Your name and email address for contact (if willing to mentor, or nominated mentor).
Proposal Guidelines
Students need to write and submit a proposal, we have added the applying to GSoC page to help guide our students on what we would like to see in those proposals.
Avogadro 2 Project Ideas
Avogadro 2 is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.
Project: Integrate with VTK: Volume Rendering and Charts
Brief explanation: Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data
Expected results: The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.
Prerequisites: Experience in C++, some experience with OpenGL ideal, but not necessary.
Mentor: Marcus D. Hanwell (marcus dot hanwell at kitware dot com).
Project: Molecular Dynamics
Brief explanation: Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2
Expected results: Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.
Prerequisites: Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.
Mentor: Marcus D. Hanwell (marcus dot hanwell at kitware dot com).
Project: Biological Data Visualization
Brief explanation: Support for biological data, representations, and visualization
Expected results: Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.
Prerequisites: Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.
Mentor: Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)
Project: Scripting Bindings
Brief explanation: Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2
Expected results: Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented using PyBind11 with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.
Prerequisites: Experience in C++ and Python or JavaScript, some experience with PyBind11, SWIG, Boost.Python, or similar packages (SIP, PySide, etc.) suggested.
Mentor: Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (marcus dot hanwell at kitware dot com)
Project: Integrate with RDKit
Brief explanation: Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization
Expected results: RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.
'Prerequisites: Experience in C++, some experience with Python will be helpful.
Mentor: Geoff Hutchison (geoffh at pitt dot edu)
Open Babel Project Ideas
Open Babel is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.
Project: Implement MMTF format
Brief explanation: Implementation of MMTF file format in OpenBabel.
Expected results:' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.
Mentor: Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)
Project: Fast, Efficient Fragment-Based Coordinate Generation
Brief explanation: A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.
Expected results: Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail.
Importantly, the approach is highly efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling.
A recent research paper describing one such approach can be found here
Prerequisites: Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.
Mentor: Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)
Project: Bayesian Optimization of Conformer Geometries
Brief explanation: Most molecules have multiple energetically-accessible geometries (conformations). In even medium-sized molecules, there may be thousands or millions of possibilities. Intelligent search strategies (i.e., Bayesian optimization) are needed to find the best geometries in the shortest amount of time.
Expected results: An efficient implementation of Bayesian optimization of molecular dihedral angles and testing against known molecular geometries (crystal structures) and libraries of conformers. In principle, the goal is to balance exploration of the multiple degrees of freedom and exploitation of known data (i.e., local optimization). A key test is to compare against existing Monte Carlo and genetic algorithm methods already implemented in Open Babel.
In many molecules, the degrees of freedom (dihedral angles) are non-independent, so detecting correlations between dimensions, dimensional reduction, etc. should likely improve performance. Combining data science and machine learning techniques may allow the code to detect such conditions based on the molecular structure (i.e., this is not a completely black-box optimization - we know some of the physics involved).
Prerequisites: Experience in C++ or Python. Knowledge of data science or statistics (e.g., Bayesian inference, data mining) is ideal.
Mentor: Geoff Hutchison (geoffh at pitt dot edu)
Project: Test Framework Overhaul
Brief explanation: Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.
Expected results: A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.
Prerequisites: Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.
Mentor: Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.
Project: Develop a JavaScript version of Open Babel
Brief explanation: Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.
Expected results: Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)
Prerequisities: Some experience in C++, and also with JavaScript.
Mentor: Noel O'Boyle (baoilleach at gmail dot com)
Project: Develop a validation and standardization filter
Brief explanation: Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?
Expected results: Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.
Such a model could be used as a filter, or as a warning to flag up problematic structures.
Code could be modeled on MolVS using RDKit [[1]]
Prerequisites: Experience in C++ or Python, and an interest in data science or statistics.
Mentor: Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)
Project: Add color to Open Babel output
Brief explanation: A general framework for specifying color at the terminal would allow us to add color to output, e.g. nitrogen in blue, improve the display of warning messages (red), and highlight substructures.
Expected results: When writing results to a terminal, it's possible to use a whole range of colors (and styles) to enhance the display. By putting a general framework in place, that works cross-platform, this could be very helpful in a wide range of scenarios. For example, when substructure searching, the matched atoms could be highlighted, which would greatly aid in visualizing the match.
Prerequisites: Experience in C++.
Mentor: Noel O'Boyle (baoilleach at gmail dot com)
cclib Project Ideas
cclib is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.
Project: Implement QC JSON schema in cclib
Brief explanation: Incorporate the MolSSI JSON schema, which is currently in the design stage.
Expected results: Implement a reader and writer according to the schema, and provide feedback to help drive the schema design to completion. Optionally also improve the code and tests for existing reader/writer classes in cclib.
Prerequisites: Experience with Python, and ideally some familiarity with JSON, quantum chemistry and computational chemistry programs.
Mentor: Karol Langner (karol.langner at gmail dot com)
Project: Advanced Analysis of Quantum Chemistry Data
Brief explanation: Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.
Expected results: The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6.
Suggested Readings:
Prerequisites: Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.
Mentor: Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com)
Project: Refactor parsers
Brief explanation: The main extract() functions in parsers are long and contain a lot of business logic. They should be refactored into smaller functions for maintainability.
Expected results: Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement the best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.
Prerequisites: Experience with Python.
Mentor: Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)
Project: Implement new parsers
Brief explanation: There are outstanding issues on Github for parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).
Expected results: Generate test data and unit tests, and implement new parsers.
Prerequisites: Experience with the Python, and ideally familiarity with computational chemistry programs.
Mentor: Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)
Project: Discovering computational chemistry content online
Brief explanation: There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!
Expected results: Build a crawler that identifies and indexes computational chemistry content online, and provides the ability to extract data with cclib.
Prerequisites: Experience with Python, and ideally familiarity with computational chemistry and web indexing.
Mentor: Karol Langner (karol.langner at gmail dot com)
Project: Machine learning applied to parsing computational chemistry output
Bried explanation: Can we teach a machine to parse computational chemistry logfiles at least as well as cclib already does? What machine learning approach here would be most appropriate? Is it useful to include prior (chemical) knowldedge or soft constraints to guide parser learning?
Expected results: Identify and implement a machine learning pipeline that attempts to reproduce or complement cclib's various parsers.
Prerequisites: Experience with Python, machine learning, and ideally familiarity with computational chemistry.
Mentor: Karol Langner (karol.langner at gmail dot com)
3Dmol.js Project Ideas
3Dmol.js is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.
Project: Implement volumetric rendering in 3Dmol.js
Brief explanation: Volumetric rendering provides a way to visualize volumetric data in more detail than simple isosurfaces.
Expected results: A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.
Prerequisites: Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.
Mentor: David Koes (dkoes@pitt.edu)
Project: Google Cardboard for 3Dmol.js
Brief explanation: Implement low cost virtual reality visualization using Google Cardboard
Expected results: [Google Cardboard https://en.wikipedia.org/wiki/Google_Cardboard] is a VR experience using commodity smartphones and either a paperboard/cardboard mount or an inexpensive pre-made mount. The project would produce an implementation using the Cardboard SDK for 3Dmol.js, allowing both individual VR use and synchronized classroom use (e.g., one "guide" and multiple synchronized viewers).
Prerequisites: Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.
Mentor: David Koes l (dkoes@pitt.edu)
RDKit Project Ideas
The RDKit is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.
Project: Create a generalized fingerprinting function
Brief explanation: The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008
Expected results: A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.
Prerequisites: C++
Mentor: Nadine Schneider (nadine-1.schneider at novartis dot com )
Project: RDKit - MMTF Integration
Brief explanation: Implementation of the MMTF file format in the RDKit. See the similar OpenBabel project for more details.
Expected results: A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers.
Prerequisites: C++
Mentor: Greg Landrum (greg.landrum at t5informatics dot com)
Project: RDKit - MolVS
Brief explanation: MolVS (https://molvs.readthedocs.io/en/latest/) provides very useful functionality for molecular validation and standardization. MolVS is built using the RDKit, but in this project we will expand its capabilities and integrate it into the RDKit project. An eventual end goal, though not necessarily one for this project, will be to have a C++ implementation of the algorithm that is part of the core RDKit. Matt Swain (the original author of MolVS) will collaborate with us on this.
Expected results: A Python or C++ implementation of an extended version of MolVS that can be integrated into the RDKit core. The extensions will include support for sets of atom types that are to be allowed/disallowed. We will also add ideas borrowed from Standardiser (https://github.com/flatkinson/standardiser)
Prerequisites: Python, C++ would be an advantage but is not required
Mentor: Greg Landrum (greg.landrum at t5informatics dot com)
Project: neo4j integration
Brief explanation: The RDKit already has strong integration with the open-source relational database PostgreSQL, in this project you'll build a similar extension for the open-source graph database neo4j (https://neo4j.com/). The concept of the knowledge graph, which stores the relationships between objects in addition to the objects themselves, has become widespread in data management and integration. This project will allow us to build and query knowledge graphs storing molecular and chemical information.
Expected results: An RDKit extension to neo4j that provides chemical functionality for finding entry points into the graph and to efficiently filter paths using chemical knowledge while traversing the graph.
Prerequisites: Java
Mentor: Christian Pilger (christian.pilger at basf.com)
Project: RDKit - OpenMM Integration
Brief explanation: OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.
Expected results: C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. The work would likely involve completing the MMFF94 implementation described by Paolo Tosco at the 2017 RDKit UGM (https://github.com/rdkit/UGM_2017/blob/master/Presentations/Tosco_RDKit_OpenMM_integration.pdf) and extending to other force fields like UFF.
Prerequisites: C++ and some Python
Mentor: TBA, likely Geoff Hutchison (geoffh at pitt.edu) and others
Project: MongoDB integration
Brief explanation: MongoDB (https://www.mongodb.com/) is an open-source cross-platform document oriented NoSQL database program optimized for performance. Its flexible schema can accommodate hierarchical relationships between chemical compounds. To enable chemical intelligence in mongoDB queries, an integration with RDKit is necessary. This project will allow us to build a frame to perform similarity, substructure, and identity searches. We will leverage on the document structure of the database to store multiple representations of each molecule. While adding new functionalities and developing existing capabilities we will keep an eye on performance, to ensure optimal scalability (indexing, shards, multiprocessing, etc.). We will learn from, and possibly build upon other work that has been done for chemistry integration into MongoDB (e.g. http://wiki.openchemistry.org/MongoChem and http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb).
Expected results: A stable and performant RDKit extension to MongoDB that provides chemical functionalities on a document database.
Prerequisites: Python
Mentor: Marco Stenta (marco.stenta at syngenta.com)
Project: Implement the Analog Series-Based Scaffold method
Brief explanation: The concept of the chemical scaffold is central to our understanding and analysis of many medicinal chemistry datasets. There are multiple ways to define the scaffold of a set of molecules, of these the "Murcko scaffold" is probably the most common, but it's probably also one of the worst (though that's probably a bit harsh since the idea of scaffold is not very well defined). A more data-driven approach is described in these two open-access articles and the references therein: https://www.future-science.com/doi/10.4155/fsoa-2017-0102 https://www.future-science.com/doi/10.4155/fsoa-2017-0135 It would be quite useful to have an RDKit implementation of this method.
Expected results: A stable and well tested RDKit implementation, Python or C++ based, of the Analog Series-Based Scaffold method.
Prerequisites: Python or C++
Mentor: Nik Stiefl (nikolaus.stiefl at novartis dot com )
MSDK / MZmine Project Ideas
Mass spectrometry is an analytical technique that measures the mass of small molecules with high precision. The data coming from mass spectrometry instruments is complex and multi-dimensional. Mass spectrometry development kit (MSDK [2]) is a Java library of algorithms for processing such mass spectrometry data. The goals of the library are to provide a flexible data model with Java interfaces for mass-spectrometry related objects (including raw spectra, processed data sets, identification results etc.) and to integrate the existing algorithms that are currently scattered around various Java-based graphical tools. MZmine [3] is an open-source software for mass-spectrometry data processing. A new version, MZmine 3, which is currently under development, is based on JavaFX for GUI and on MSDK for data processing algorithms.
Project: MSDK - Feature Detection
Brief explanation: Provide a native Java implementation of some popular LC-MS feature detection algorithms from the R world (centWave, massifquant, CAMERA, etc.) [4]. Further development of ADAP-3D module [5] for intelligent, parameter-less feature detection.
Prerequisites: Java, preferably some knowledge about mass spectrometry
Mentor: Dmitry Avtonomov <dmitriy.avtonomov@gmail.com>, Xiuxia Du <Xiuxia.Du@uncc.edu>
Project: MSDK - Spectral Database Search
Brief explanation: Develop new MSDK modules for spectral search in offline and online databases (especially MoNA [6]).
Prerequisites: Java
Mentor: Gert Wohlgemuth <berlinguyinca@gmail.com>
Project: MSDK - New IO Modules
Brief explanation: Develop new MSDK-IO modules for currently unsupported file formats, like mzDB [7], mz5 [8], or imzML [9], and improve the existing support for reading native vendor formats. Update mzTab [10] support to version 1.1 with new features for metabolomics. In addition, support for reading ion mobility data can be added to the existing mzML format reader/writer.
Prerequisites: Java
Mentor: Tomas Pluskal <plusik@gmail.com> and/or Adam Tenderholt <atenderholt@gmail.com>
Project: MSDK - KNIME integration
Brief explanation: Develop an integration layer for MSDK algorithms into the workflow platform KNIME [11].
Prerequisites: Java
Mentor: Tomas Pluskal <plusik@gmail.com>
Project: MSDK / MZmine - Statistical Analysis
Brief explanation: Develop new modules for multivariate statistics and machine learning-based analysis of mass spectrometry results. Part of this project is algorithmic, part of it is GUI development.
Prerequisites: Java, preferably basic knowledge about statistics
Mentor: Xiuxia Du <Xiuxia.Du@uncc.edu>
Project: MSDK / MZmine - Correlation Analysis
Brief explanation: Develop new modules for correlation-based identification of related mass spectrometry signals.
Prerequisites: Java
Mentor: Tomas Pluskal <plusik@gmail.com>
Project: MZmine - New Visualization Modules
Brief explanation: Implement new, JavaFX-based visualization modules for MZmine such as Cloud Plot [12] or spectral similarity tree imaging.
Prerequisites: Java (experience with JavaFX is helpful but not required)
Mentor: Tomas Pluskal <plusik@gmail.com>
NWChem Project Ideas
NWChem is widely used open-source computational chemistry software ([13]) that tackles a wide variety of scientific problems.
Project NWChem-JSON
Brief explanation: Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics.
Expected results: Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed.
Prerequisites: Experience with Fortran90 and Python
Mentor: Bert de Jong (wadejong at lbl dot gov)
Project NWChem-Python-Jupyter Interface
Brief explanation: Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks
Expected results: NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite.
Prerequisites: Experience with Fortran and Python
Mentor: Bert de Jong (wadejong at lbl dot gov)
JSON-LD for Chemical Data
Brief explanation: Transforming NWChem and Chemical JSON formats to JSON-LD
Expected results: Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.
Prerequisites: Experience with Python and JSON-LD
Mentor: Bert de Jong (wadejong at lbl dot gov)
DeepChem Project Ideas
DeepChem aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.
Project: Transfer Learning Framework
Brief explanation: Create easy to use tools for common transfer learning scenarios.
Expected results: ChemNet discusses a powerful model independent transfer learning protocol. We would want to reproduce the results, and be able to apply the transfer learning protocol to arbitrary TensorGraph models. Jupyter notebook tutorials and blog posts will be expected over the course of the summer.
Prerequisites: Python, Tensorflow
Mentor: Karl Leswing (karl dot leswing at schrodinger dot com)
Project: Data Interfaces
Brief explanation: Transition deepchem.data.Dataset to tf.data.
Expected results: DeepChem data objects were created before tf.data existed. We need to make our existing Featurizers, Transformers, and Models work over tf.data objects. Jupyter notebook tutorials and blog posts for how to use the new improved interfaces.
Prerequisites: Python, some Tensorflow
Mentor: Karl Leswing (karl dot leswing at schrodinger dot com)
Project: Model Visualization
Brief explanation: Node Importance Visualizations from Graph Models
Expected results: An argument often used against deep learning methods is that they are not understandable. This project would be to implement visual neural graph fingerprints into DeepChem. Stretch goals would be to implement DeepLift or masking techniques for atom level visualizations.
Prerequisites: Python, Tensorflow, rdkit
Mentor: Karl Leswing (karl dot leswing at schrodinger dot com)
Project: Imaging Tools
Brief explanation: Enable chemical image segmentation and property prediction.
Expected results: We want an implementation of U-Net, and ResNet inside of the DeepChem framework. We want both pre-trained networks on problems of chemical importance and the image data augmentation techniques used to create the models.
Prerequisites: Python, Tensorflow
Mentor: Bharath Ramsundar (bharath at datamined dot io)
Miscellaneous Project Ideas
These ideas would likely benefit two or more projects.
Project: Utilizing Virtual Reality in Chemistry Visualization and Modeling
Brief explanation:' Develop a VR application or library that can be used to visualize molecular structures, possibly manipulate them.
Expected results: A VR application or library that can be integrated in one of the apps above, focused on molecular structure modeling. The target is both scientific applications and an educational component. If time permits, development of an interface that allows users to manipulate the structures and get a realtime response (using fast molecular force-fields to compute responses) would be a stretch outcome.
Prerequisites: Python, C++, VR SDK experience would be nice.
Mentor: Bert de Jong (wadejong at lbl dot gov)
Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data
Brief explanation: Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).
Expected results: Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.
Prerequisites: General programming experience, and ideally experience in chemistry and matrix manipulations.
Suggested Readings:
- http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009
- http://zhanglab.ccmb.med.umich.edu/EDTSurf/
- https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf
Mentor: Geoffrey Hutchison (geoffh at pitt dot edu) and Adam Tenderholt (atenderholt at gmail dot com)
Project: OneMol: Google Docs & YouTube for Molecules
Brief explanation: There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.
Expected results: Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).
Prerequisites: Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.
Mentor: David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)
Project: YAeHMOP as a library
Brief explanation: YAeHMOP (https://github.com/greglandrum/yaehmop and http://yaehmop.sourceforge.net/) is an open-source package of tools for doing extended Hueckel calculations on molecules and crystals. The software was developed as a series of command line tools that expect to read and write from files. After some years of obscurity, YAeHMOP has recently attracted attention as a plugin for Avogadro. The goal of this project is to modernize aspects of YAeHMOP and make the core computational functionality accessible as a library.
Expected results: A library that is callable from C/C++ allowing the construction of an input molecule/crystal, specification of computational parameters, execution of a calculation, and capture of results. Ideally this will include a modernization of the pieces of the code that still rely on f2c-translated Fortran77 code: the eigenvalue solver can likely be replaced with eigen without too much effort; the functionality to calculate STO integrals will need to be re-written based on the original f77. The library should also include a robust set of regression tests. Stretch goals would be adding Python wrappers to the library and/or creating an RDKit plugin for it.
Prerequisites: C/C++
Mentor: Greg Landrum (greg.landrum at t5informatics dot com)