https://wiki.openchemistry.org/api.php?action=feedcontributions&user=Greg.landrum&feedformat=atomwiki.openchemistry.org - User contributions [en]2024-03-29T12:25:38ZUser contributionsMediaWiki 1.39.3https://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2023&diff=736GSoC Ideas 20232023-03-03T04:12:56Z<p>Greg.landrum: /* RDKit Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. While we have participated in the last few Google Summer of Code programs and will apply again in 2022, there is '''no guarantee''' that we will be selected again for GSoC in 2023.<br />
<br />
One important factor is that GSoC in 2023 will include both shorter projects (~175 hours) and longer projects (~350 hours). You should consider the appropriate timeline for your project proposal. We have indicated in the project totals where we suggest particular lengths.<br />
<br />
Contributors can also decide on the number of weeks (e.g., spreading the project time over multiple weeks).<br />
<br />
If you are unsure of the scope of a project, please reach out and discuss BEFORE the proposal deadline.<br />
<br />
When possible, submitting drafts a week or more in dance of the proposal deadline is preferred because we can make suggestions towards your proposal.<br />
<br />
We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* Size of the project (~175 hours of work) or (~350 hours of work)<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://two.avogadro.cc/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project [350 hours]: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (i.e., Python) in Avogadro 2<br />
<br />
'''Expected results:''' Enable an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python. Python bindings exist, using PyBind11 with the new codebase, and the Avogadro 2 core libraries are pip installable. Extending the coverage of the API from the rudimentary parts of core/io would be a good starting point. An ideal solution would connect to PySide, to allow scripting to add UI like menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
Example scripts, documentation, are highly encouraged.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python, some experience with PyBind11, Qt for Python, PySide suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (mhanwell at bnl dot gov)<br />
<br />
=== Project [175 hours]: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project [175 or 350 hours]: Tools for Interactive Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Building solvent boxes, implementing standard molecular dynamics using in-progress optimization framework. The scope could be 175 or 350 hours - please discuss what scale project you have in mind.<br />
<br />
'''Expected results:''' Avogadro (v1) has interactive force field optimization allowing building and manipulation (e.g., push-pull atoms into position). Some users call this 'video game mode' ;-) A new optimization framework is in progress, including calling external programs for energies and forces. The project would enable building out MD simulations, including tools to add water or solvent boxes, build larger systems (e.g., via PackMol integration) and implement simple MD integration and thermostats.<br />
<br />
'''Prerequisites:'' Experience in C++, ideally with knowledge of molecular dynamics methods and tools. Some Python would be helpful <br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project [350 hours]: Improved Rendering with Geometry Shaders ===<br />
<br />
'''Brief explanation:''' Our current rendering code needs updating to the OpenGL Core Profile and Optimization with Geometry Shaders<br />
<br />
'''Expected results:''' An efficient GPU-enabled surface generation and rendering framework using geometry shaders to provide dynamic level of detail, improved depth-of-focus and rendering quality.<br />
<br />
'''Prerequisites:'' Experience in C++, ideally with knowledge of OpenGL shaders. Some understanding of quantum chemistry would be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project [175 hours]: Integrate CoordGen library ===<br />
<br />
'''Expected results:''' Schrodinger has released a BSD-licensed library for 2D chemical structure layout (https://github.com/schrodinger/coordgenlibs) and it has been successfully integrated into RDKit. The student will be responsible for integrating CoordGen into Open Babel. Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project [175 hours]: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project [175 hours]: Test Framework Overhaul ===<br />
<br />
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.<br />
<br />
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.<br />
<br />
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.<br />
<br />
=== Project [350 hours]: Develop a JavaScript version of Open Babel ===<br />
<br />
'''Brief explanation:''' Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.<br />
<br />
'''Expected results''': Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.<br />
<br />
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)<br />
<br />
'''Prerequisities''': Some experience in C++, and also with JavaScript.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
=== Project [350 hours]: Develop a validation and standardization filter ===<br />
<br />
'''Brief explanation''': Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?<br />
<br />
'''Expected results''': Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.<br />
<br />
Such a model could be used as a filter, or as a warning to flag up problematic structures.<br />
<br />
Code could be modeled on MolVS using RDKit [[https://molvs.readthedocs.io/en/latest/]]<br />
<br />
'''Prerequisites''': Experience in C++ or Python, and an interest in data science or statistics.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
===Project: [175 or 350 hours] Implement new parsers===<br />
<br />
'''Brief explanation''': There are outstanding issues on GitHub for supporting more programs (e.g. CFOUR, xtb, NBO, GAMESS dat, MRCC, DIRAC), and parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA). There may also be more programs missing that haven't been considered.<br />
<br />
'''Expected results''': Implement parsers for one or more new programs/formats, generate test data, and write unit and regression tests for each parser.<br />
<br />
'''Prerequisites''': Experience with Python, basic familiarity with computational chemistry programs, and access to the program(s) needed to generate the test data.<br />
<br />
'''Mentors''': Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com) and/or Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
===Project: [175 or 350 hours] Implement new bridges===<br />
<br />
'''Brief explanation''': There are outstanding issues on GitHub for more integrations with external programs (e.g. chemfiles, RDKit) via their Python bindings. There may also be more programs missing that haven't been considered.<br />
<br />
'''Expected results''': Implement bridges for one or more new programs, along with writing unit tests and documentation for each bridge.<br />
<br />
'''Prerequisites''': Experience with Python and ideally familiarity with the program that is being bridged.<br />
<br />
'''Mentors''': Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com) and/or Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
===Project [350 hours]: Implement new methods===<br />
<br />
'''Brief explanation''': There are outstanding issues on GitHub for more analysis methods being added directly to cclib (e.g. calculating geometric parameters). There may also be other methods that are desirable to include which haven't been considered.<br />
<br />
'''Expected results''': Implement one or more new methods, along with writing unit tests and documentation for each method.<br />
<br />
'''Prerequisites''': Experience with Python and familiarity with the method(s) being added, depending on the complexity of the method.<br />
<br />
'''Mentors''': Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com) and/or Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
===Project [350 hours]: Julia bindings===<br />
<br />
'''Brief explanation''': The Julia programming language (https://julialang.org/) is growing in popularity for computational chemistry as a language that both production-level computation and analysis can be performed in seamlessly. In order to analyze computational chemistry outputs from traditional programs in Julia, rather than reimplement all cclib functionality in Julia, we should be able to call cclib from Julia directly and reuse its core functionality.<br />
<br />
'''Expected results''': Julia bindings to cclib IO functionality and a Julia-native representation of cclib data objects, with each cclib attribute accessible as a native Julia type. The bindings should be available on the default Julia package registry. The remainder of the project is more open-ended, but an example application of using the bindings would be ideal.<br />
<br />
'''Prerequisites''': Experience with Python and/or Julia, and ideally some familiarity with important quantities from computational chemistry outputs.<br />
<br />
'''Mentors''': Eric Berquist (eric.john.berquist at gmail dot com)<br />
<br />
===Project [350 hours]: Additional visualization for OpenChemVault===<br />
<br />
'''Brief explanation''': OpenChemVault (https://github.com/cclib/openchemvault) is capable of parsing output files, storing them, and displaying geometries, but any sort of additional visualization (such as plotting molecular orbitals or spectra) is missing. The capabilities of GaussSum (http://gausssum.sourceforge.net/) are a possible starting point.<br />
<br />
'''Expected results''': Implement one or more new visualizations for the OpenChemVault web interface.<br />
<br />
'''Prerequisites''': Experience with Python common visualizations that are desirable for computational chemistry outputs. No previous experience with JavaScript is necessary.<br />
<br />
'''Mentors''': Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com) and/or Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, C#, and JavaScript. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project [350 hours]: Implement Molecular Interaction Fields calculations in the RDKit ===<br />
<br />
'''Brief explanation:''' There is an old PR for the RDKit that implements molecular interaction fields: https://github.com/rdkit/rdkit/pull/318. This was never merged because the author ran out of time. At this point a lot of work would be required to update and finish this PR, but the results would be super useful for the RDKit community.<br />
<br />
'''Expected results:''' A C++ implementation of the GRID calculator code along with a robust set of test cases. Wrappers for the calculator so that it is accessible from within the Python and SWIG (Java and C#) wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project [175 or 350 hours]: Implement additional fingerprints in the RDKit ===<br />
<br />
'''Brief explanation:''' There are a number of chemical fingerprint types which it would be useful to have natively available in the RDKit; in this project you will implement one or more of them. Some ideas for fingerprints to be included are:<br />
<br />
# Pubchem fingerprint: https://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdf<br />
# CSFP: https://doi.org/10.1021/acs.jcim.9b00571<br />
# Physicochemical property fingerprints: porting the existing Python implementation (https://github.com/rdkit/rdkit/blob/7153918af4dff37c768577441c5286b425e6bf3d/rdkit/Chem/AtomPairs/Sheridan.py) to C++<br />
<br />
The number to be implemented depends on whether you are doing this as a 175 or 350 hour project. <br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprints along with a robust set of test cases. Wrappers for the calculators so that they are accessible from with the Python and SWIG (Java and C#) wrappers.<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
== QC-Devs Project Ideas ==<br />
<br />
QC-Devs (https://qcdevs.org/) develops various free, open-source, and cross-platform libraries for scientific computing, especially theoretical and computational chemistry. Our goal is to make programming accessible to chemists and promote precepts of sustainable software development. The two main pieces of the QC-Devs ecosystem are:<br />
<br />
<ul><br />
<li><blockquote><p>HORTON (electronic structure theory): [https://quantumelephant.org/ <u>https://quantumelephant.org/</u>]</p></blockquote></li><br />
<li><blockquote><p>ChemTools (molecular structure and reactivity): [https://chemtools.org/ <u>https://chemtools.org/</u>]</p></blockquote></li></ul><br />
<br />
All our repositories are hosted on Theochem organization ([https://github.com/theochem <u>https://github.com/theochem</u>]) on GitHub.<br />
<br />
=== Project [175 hours or 350 hours]: Visualization of Molecular Structure and Reactivity ===<br />
<br />
'''Brief Explanation:''' ChemTools ([https://github.com/theochem/chemtools <u>https://github.com/theochem/chemtools</u>]) is a post-processing library for extracting chemical insight from quantum chemistry calculations. Currently, ChemTools relies on Visual Molecular Dynamics (VMD) and Matplotlib for visualization. ChemTools has the functionality to generate visualization scripts for VMD, so the user can easily generate informative plots like iso-surface of electron density colored by electrostatic potential. Visualization of (annotated) molecular structures and molecular structure changes along reaction pathways are also of interest, but the implementations are unpolished.<br />
<br />
'''Expected Results:'''<br />
<br />
<blockquote>'''175 hours:''' Add functionality to ChemTools to generate visualization scripts for ChimeraX ([https://www.cgl.ucsf.edu/chimerax/ <u>https://www.cgl.ucsf.edu/chimerax/</u>]). The current functionality for VMD can be used as a template.<br />
<br />
'''350 hours:''' Add ChemTools as a back-end for SEQCROW ([https://cxtoolshed.rbvi.ucsf.edu/apps/seqcrow <u>https://cxtoolshed.rbvi.ucsf.edu/apps/seqcrow</u>]), a free and open-source bundle ([https://github.com/QChASM/SEQCROW <u>https://github.com/QChASM/SEQCROW</u>]) in the ChimeraX toolshed ([https://cxtoolshed.rbvi.ucsf.edu/ <u>https://cxtoolshed.rbvi.ucsf.edu/</u>]) for building molecules and interacting with the output of quantum chemistry calculations.<br />
<br />
'''Difficulty Level:''' Intermediate (175) to High-Intermediate (350)<br />
</blockquote><br />
'''Relevant Skills:''' Experience with Python, visualization, and software interfacing (350)<br />
<br />
'''Mentors:''' Ali Tehrani (alirezatehrani24 at gmail dot com), Gabriela Sanchez Diaz (sanchezg at mcmaster dot ca), Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca), and Esteban Vohringer-Martinez (estebanvohringer at qcmmlab dot com).<br />
<br />
=== Project [175 or 350 hours]: Extended interoperability of GOpt and Quantum Chemistry Software ===<br />
<br />
'''Brief Explanation:''' ChemTools ([https://github.com/theochem/chemtools <u>https://github.com/theochem/chemtools</u>]) is a post-processing library for extracting chemical insight from quantum chemistry calculations. Currently, ChemTools relies on modules of the HORTON library to compute the basic quantities required for its analysis. The goal of this project is to extend the interoperability of ChemTools, so that it can use the [https://github.com/psi4 <u>Psi4</u>] ([https://github.com/psi4 <u>https://github.com/psi4</u>]) &amp; [https://github.com/pyscf/pyscf <u>PySCF</u>] ([https://github.com/pyscf/pyscf <u>https://github.com/pyscf/pyscf</u>]) packages and take advantage of their features.<br />
<br />
'''Expected Results:'''<br />
<br />
<blockquote>'''175 hours:''' Writing wrappers for Psi4 or PySCF to compute various quantum mechanical properties and provide those properties to ChemTools for further analysis. The current wrappers for HORTON can be used as a template. Both Psi4 &amp; PySCF have Python interfaces.<br />
<br />
'''350 hours:''' Writing wrappers for Psi4 and PySCF.<br />
</blockquote><br />
'''Difficulty Level:''' Intermediate<br />
<br />
'''Relevant Skills:''' Experience with scientific Python, advanced Numpy, object-oriented programming, and knowledge of quantum chemistry software<br />
<br />
'''Mentors:''' Gabriela Sanchez Diaz (sanchezg at mcmaster dot ca), Ali Tehrani (alirezatehrani24 at gmail dot com), and Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca).<br />
<br />
=== Project [175 hours] Molecule Alignment with Procrustes Algorithm ===<br />
<br />
'''Brief Explanation:''' Procrustes ([https://github.com/theochem/procrustes <u>https://github.com/theochem/procrustes</u>]) is a library for finding the optimal transformation that makes two matrices as close as possible to each other. Permutation Procrustes methods can be used for molecular alignment ([https://link.springer.com/article/10.1007/s10910-012-0119-2 ''<u>J Math Chem</u>'' <u>(2013) 51:927ΓÇô936</u>]). The goal of this progress is to develop a utility that uses the Procrustes package to perform molecular alignment.<br />
<br />
'''Expected Results:''' An open-source Python software.<br />
<br />
<blockquote>'''175 hours:''' Using the Procrustes package, write a utility that takes two molecular structures and optimizes their alignment. In addition to simply optimizing the structural alignment, provide atom-atom mapping and extensions to more general problems (e.g., multi-molecule alignment).<br />
</blockquote><br />
'''Difficulty Level:''' Advanced<br />
<br />
'''Relevant Skills:''' Experience with scientific Python, advanced Numpy, and object-oriented programming.<br />
<br />
'''Mentors:''' Fanwang Meng (fwmeng88 at gmail dot com) and Paul Ayers (ayers at mcmaster dot ca).<br />
<br />
=== Project [350 hours]: Faster Molecular Integrals with Density-Fitting ===<br />
<br />
'''Brief Explanation:''' [https://github.com/theochem/gbasis <u>GBasis</u>] ([https://github.com/theochem/gbasis <u>https://github.com/theochem/gbasis</u>]) is a library for evaluating and analytically integrating Gaussian-type orbitals and their related quantities, especially molecular integrals. In many applications, the computational bottleneck is the evaluation of two-electron integrals, as the number of two-electron integrals grows as the fourth power of the basis-set size. By introducing an auxiliary, density-fitting, basis, this power is reduced to the third power of the basis-set size, which in many cases eliminates the computational bottleneck, since there are often other facets of the computation that scale more severely than this. The goal of this project is to implement density-fitting methods into GBasis.<br />
<br />
'''Expected Results:'''<br />
<br />
<blockquote>'''350 hours:''' Extension of GBasis to support density fitting. This involves expanding products of basis functions in the auxiliary basis, evaluating 2-electron integrals in the auxiliary basis, and using these two entities to construct molecular integrals more efficiently.<br />
</blockquote><br />
'''Difficulty Level:''' Intermediate to Advanced<br />
<br />
'''Relevant Skills:''' Experience with scientific Python, advanced Numpy, and object-oriented programming.<br />
<br />
'''Mentors:''' Ali Tehrani (alirezatehrani24 at gmail dot com), Gabriela Sanchez Diaz (sanchezg at mcmaster dot ca), and Paul Ayers (ayers at mcmaster dot ca).<br />
<br />
=== Project [350 hours]: Computing The Pair Density From Wave-function ===<br />
<br />
'''Brief Explanation:''' The electron pair density represents the probability of observing two electrons at two points in space. It provides key quantitative and qualitative information about electron correlation, as well as qualitative information about chemical bonding and, in particular, about how Lewis structures emerge from quantum mechanics.<br />
<br />
'''Expected Results:'''<br />
<br />
<blockquote>'''350 hours:''' To provide a Python function to compute the pair-density using GBasis ([https://github.com/theochem/gbasis <u>https://github.com/theochem/gbasis</u>]) as a Python function, starting from wave-function information that is read with IOData ([https://github.com/theochem/iodata <u>https://github.com/theochem/iodata</u>]). Key indicators like the intracule and extracule should be supported.<br />
</blockquote><br />
'''Difficulty Level:''' Intermediate<br />
<br />
'''Relevant Skills:''' Experience with scientific Python, advanced Numpy, and object-oriented programming.<br />
<br />
'''Mentors:''' Ali Tehrani (alirezatehrani24 at gmail dot com), Gabriela Sanchez Diaz (sanchezg at mcmaster dot ca), and Paul Ayers (ayers at mcmaster dot ca).<br />
<br />
<br />
'''Project: [350 hours] '''Extended Interoperability of Denspart to Open Force Field Software<br />
<br />
'''Brief Explanation: '''Partitioning a chemical system and its various properties into atomic contributions is not only qualitatively useful for chemical analysis but important for quantitative computational chemistry (e.g., charges in molecular mechanics force fields). This can be done by partitioning the molecular electron density into fuzzy atomic densities using Denspart ( package and class within ChemTools package), but there are alternative approaches based on topological partitioning of molecular density or partitioning of molecular orbitals. The goal of this project is to develop a unified framework determining the molecular contributions of specified atoms and/or functional groups.<br />
<br />
'''Expected Results:'''<br />
<br />
# Design a unified DensPart API with interfaces to the two aforementioned packages.<br />
# Develop a module for computing various non-covalent force-field energy terms, especially those that are useful for molecular mechanics force field parameterization/development.<br />
# Design a framework for enhancing molecular modeling using Denspart (based on quantum chemistry calculations)<br />
<br />
'''Difficulty Level:''' Intermediate<br />
<br />
'''Relevant Skills:''' Experience with scientific Python, advanced Numpy, and object-oriented programming, familiarity with quantum chemistry, molecular mechanics, and force-fields.<br />
<br />
'''Mentors:''' Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca), Toon Verstraelen (Toon.Verstraelen at ugent dot be), and Esteban Vohringer-Martinez (estebanvohringer at qcmmlab dot com).<br />
<br />
'''Project: [350 hours] '''Extended Interoperability of IOData to Open Force Field Software<br />
<br />
'''Brief Explanation: '''IOData () is a free and open-source Python library for parsing, storing, and converting various file formats commonly used by quantum chemistry, molecular dynamics, and plane-wave density-functional-theory software programs. It also supports a flexible framework for generating input files for various software packages. The goal of this project is to extend the IOData framework to describe a chemical system for molecule mechanics (MM) modeling, specifically in the context of non-covalent interactions of open force-fields.<br />
<br />
'''Expected Results:'''<br />
<br />
# Embed a compatible data structure for non-bonded parameters derived from electron density partitioning methods into the IOData framework.<br />
# Generate an Open Force Field Toolkit compatible output ()<br />
# Extend IOData framework to accommodate bonded parameters to have a full interface for molecular modeling.<br />
<br />
'''Difficulty Level:''' Intermediate<br />
<br />
'''Relevant Skills:''' Experience with scientific Python, advanced Numpy, and object-oriented programming, familiarity with molecular mechanics and force-fields.<br />
<br />
'''Mentors:''' Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca), Toon Verstraelen (Toon.Verstraelen at ugent dot be), Esteban Vohringer-Martinez (estebanvohringer at qcmmlab dot com).<br />
<br />
'''Project: [350 hours] '''Deriving Atomic and Bond Properties from Molecule Orbital Partitioning<br />
<br />
'''Brief Explanation: '''In traditional orbital-based atomic partitioning methods for molecular wavefunctions, one observes that the results are strongly dependent on the molecular basis set, and tend to work poorly for large atom-centered basis sets and be inapplicable to delocalized (e.g., plane wave) basis functions. One way to avoid these issues is using various sorts of quasi-atomic orbitals (QUAOs), including quasi-atomic molecular basis orbitals (QUAMBO), intrinsic atomic orbitals (IAOs), and other choices. Another alternative is to use (approximate) orbital-based equivalencies to spatial decomposition methods like the Hirshfeld or Bader partitioning. The goal of this project is to implement these approaches in the OrbsTools module of ChemTools ().<br />
<br />
'''Expected Results:'''<br />
<br />
# Design an API for various quasi-atomic orbital methods and implement QUAMBO, IAO, QUAO and possibly other quasi-atomic orbital methods.<br />
# Reenvision the Mulliken and Lowdin atom/bond analysis based thereupon.<br />
# Implement the orbital-based partitioning based on an arbitrary spatial partitioning method, the Mulliken and Lowdin atom/bond analysis based thereupon.''' '''<br />
<br />
'''Difficulty Level:''' Intermediate to Hard<br />
<br />
'''Relevant Skills:''' Experience with scientific Python, advanced Numpy, and object-oriented programming, familiarity with quantum chemistry methods.<br />
<br />
'''Mentors:''' Gabriela Sánchez Díaz (sanchezg at mcmaster dot ca), Marco Martinez-Gonzalez (mmg870630 at gmail dot com), Paul Ayers (ayers at mcmaster dot ca), and Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca).<br />
<br />
'''Project: [350 hours] '''Model Hamiltonian (interface to FCIDump and other solid-state codes)<br />
<br />
'''Brief Explanation: '''In many cases, the true many-electron Hamiltonian is intractable to solve, so model Hamiltonians that capture key features of the physicochemical system qualitatively. Classic examples include the (extended) Hubbard, Ising, and Heisenberg model Hamiltonians. The goal of the package is to build an framework for constructing Model Hamiltonians and outputting them into a format that is conducive to traditional packages for solving the quantum many-body problem.<br />
<br />
'''Expected Results:'''<br />
<br />
1. Provide support for Huckel/Hubbard/PPP parameters for various atom and connectivity types.<br />
<br />
2. Write user-friendly APIs for constructing model Hamiltonians of various types: Heisenberg, Ising, generalized Richardson-Gaudin, t-J, and t-J-U-V models.<br />
<br />
3. Write utilities to output model Hamiltonians into formats conducive to external programs, including FCIDump and Triqs.<br />
<br />
4. Create and write example tutorials.<br />
<br />
'''Difficulty Level:''' Intermediate to Hard<br />
<br />
'''Relevant Skills:''' Experience with scientific Python, advanced Numpy, and object-oriented programming, and familiarity with quantum mechanics.<br />
<br />
'''Mentors:''' Gabriela Sánchez Díaz (sanchezg at mcmaster dot ca), Marco Martinez-Gonzalez (mmg870630 at gmail dot com), Paul Ayers (ayers at mcmaster dot ca), and Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca).<br />
<br />
== CalcUS Project Ideas ==<br />
<br />
[https://github.com/cyllab/CalcUS CalcUS] is a platform aiming to democratize access to quantum chemistry by providing a user-friendly web-based interface to simplify running and analyzing quantum mechanical calculations.<br />
<br />
=== Project [175 hours]: Develop large-scale calculation management tools ===<br />
<br />
'''Brief Explanation:''' Quantum chemistry projects can involve performing calculations on a large number of structures (10-100) with different parameters. CalcUS should have features to make this process seamless and highly automated, from launching the calculations to reporting the results.<br />
<br />
'''Expected Results:''' Create a variation of the calculation web UI, aimed specifically at batch calculations with variable parameters, design and implement the workflow to handle these batch calculations, implement results gathering and reporting in a convenient format, write relevant unit and/or integration tests.<br />
<br />
'''Prerequisites:''' Knowledge of HTML, Javascript and Python. Familiarity with JQuery, Django and PostgreSQL is helpful.<br />
<br />
'''Mentor:''' Raphaël Robidas (raphael dot robidas at usherbrooke dot ca) Claude Legault (claude dot legault at usherbrooke dot ca)<br />
<br />
== ccinput Project Ideas ==<br />
<br />
[https://github.com/cyllab/ccinput ccinput] is a library and standalone tool to create computational chemistry input files.<br />
<br />
=== Project [350 hours]: Add support for NWChem ===<br />
<br />
'''Brief Explanation:''' Implementing the creation of NWChem input files for most of its features.<br />
<br />
'''Expected Results:''' Implementing the creation of NWChem input files which follow the correct structures, implementing support of core keywords and modifiers, adding the necessary static data about NWChem (supported methods, solvents, etc.), creation of extensive unit tests for all new features, writing any necessary documentation about these new features.<br />
<br />
'''Prerequisites:''' Knowledge of Python. Familiarity with quantum chemistry is helpful, but not required.<br />
<br />
'''Mentor:''' Raphaël Robidas (raphael dot robidas at usherbrooke dot ca), Claude Legault (claude dot legault at usherbrooke dot ca)<br />
<br />
==3Dmol.js Project Ideas==<br />
<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project [175 hours]: More cartoon options for nucleic acids. ===<br />
<br />
'''Brief explanation:''' Implement additional visualizations of nucleic acids.<br />
<br />
'''Expected results:''' See https://github.com/3dmol/3Dmol.js/issues/559 <br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
<br />
=== Project [175 or 350 hours]: Improve 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Make significant improvements to 3Dmol.js functionality or performance.<br />
<br />
'''Expected results:''' This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request. <br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
==gnina Project Ideas==<br />
<br />
[https://github.com/gnina gnina] is a C/C++ framework for applying deep learning to molecular docking.<br />
<br />
=== Project [175 or 350 hours]: Improve gnina ===<br />
<br />
'''Brief explanation:''' Make significant improvements to gnina functionality or performance.<br />
<br />
'''Expected results:''' This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request. <br />
<br />
'''Prerequisites:''' Experience with CUDA/C/C++ programming and the basics of deep learning.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology. Additional project ideas are discussed at https://forum.deepchem.io/<br />
<br />
=== Project [350 hours]: Layer Documentation ===<br />
<br />
'''Brief explanation:''' DeepChem is moving towards a concept of first class layers. Improving the documentation for existing layers will help us make our current collection of layers more useful for the community. <br />
<br />
'''Expected results:''' This project should also add a tutorial for using the layers to the DeepChem tutorial series, and should plan to add a few new layers as well.<br />
<br />
'''Prerequisites:''' PyTorch/TensorFlow, Python<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at deepforestsci dot com)<br />
<br />
=== Project [350 hours]: PyTorch Porting ===<br />
<br />
'''Brief explanation:''' DeepChem is shifting towards using PyTorch as its primary backend, but many models are still implemented in TensorFlow. A good project could be to pick a TensorFlow model or two, then port its layers and model into PyTorch along with suitable unit tests. <br />
<br />
'''Expected results:''' At least one model should be ported from TensorFlow to PyTorch successfully with associated unit tests. See See https://github.com/deepchem/deepchem/issues/2863<br />
<br />
'''Prerequisites:''' PyTorch/TensorFlow, Python<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at deepforestsci dot com)<br />
<br />
=== Project [350 hours]: HuggingFace Integration ===<br />
<br />
'''Brief explanation:''' HuggingFace Integration: Last year, we had a few student projects explore HuggingFace/DeepChem integration, but these projects were not able to merge in HuggingFace models into DeepChem. <br />
<br />
'''Expected results:''' This project would create a working HuggingFace model in DeepChem along with tutorials on how to use HuggingFace with DeepChem. <br />
<br />
'''Prerequisites:''' PyTorch/TensorFlow, Python<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at deepforestsci dot com)<br />
<br />
=== Project [350 hours]: Improved PINNs Support ===<br />
<br />
'''Brief explanation:''' Improving our PINNs Support: One of the exciting new features in DeepChem 2.6.0 is support for PINNs, a class of techniques to solve PDEs with neural networks. The API for this class is still rudimentary and supports only a limited class of models and requires handcoding the loss. <br />
<br />
'''Expected results:''' Extend the API to allow for a broader class of PDEs to be implemented. I’d suggest using Schrodinger’s equation as a test since Schrodinger can be solved in 1D as a toy and extended to arbitrarily high dimensions for larger molecules.<br />
<br />
'''Prerequisites:''' PyTorch/TensorFlow, Python<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at deepforestsci dot com)<br />
<br />
=== Project [350 hours]: Improved Equivariance Support ===<br />
<br />
'''Brief explanation:''' Improve Equivariant Support: DeepChem has no support for equivariant models. Given the increasing importance of equivariance for scientific machine learning this is a major oversight. <br />
<br />
'''Expected results:''' This project would aim to add a tutorial about equivariant modeling and add an equivariant model to DeepChem. You may want to use e3nn or another library to facilitate implementation.<br />
<br />
'''Prerequisites:''' PyTorch/TensorFlow, Python<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at deepforestsci dot com)<br />
<br />
=== Project [350 hours]: Improved Antibody Support ===<br />
<br />
'''Brief explanation:''' Improving Antibody Support: DeepChem at present doesn’t have much tooling or support for working with anbtibodies. <br />
<br />
'''Expected results:''' This project would add suitable antibody datasets to MoleculeNet and create a tutorial walking users through antibody design and modeling with DeepChem. If necessary, students may add antibody-specific models as well.<br />
<br />
'''Prerequisites:''' PyTorch/TensorFlow, Python<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at deepforestsci dot com)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
<br />
===Project [350 hours]: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2022&diff=715GSoC Ideas 20222022-02-16T07:10:22Z<p>Greg.landrum: /* RDKit Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. While we have participated in the last few Google Summer of Code programs and will apply again in 2022, there is '''no guarantee''' that we will be selected again for GSoC in 2022.<br />
<br />
One important factor is that GSoC in 2022 will include both shorter projects and longer projects. You should consider the appropriate timeline for your project proposal.<br />
<br />
We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* Size of the project (Medium = ~175 hours of work, ~6 weeks) or (Large = ~350 hours of work, ~12 weeks)<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://two.avogadro.cc/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project [Large]: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (i.e., Python) in Avogadro 2<br />
<br />
'''Expected results:''' Enable an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python. Python bindings exist, using PyBind11 with the new codebase, and the Avogadro 2 core libraries are pip installable. Extending the coverage of the API from the rudimentary parts of core/io would be a good starting point. An ideal solution would connect to PySide, to allow scripting to add UI like menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
Example scripts, documentation, are highly encouraged.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python, some experience with PyBind11, Qt for Python, PySide suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (mhanwell at bnl dot gov)<br />
<br />
=== Project [Medium]: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project [Medium or Large]: Tools for Interactive Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Building solvent boxes, implementing standard molecular dynamics using in-progress optimization framework.<br />
<br />
'''Expected results:''' Avogadro (v1) has interactive force field optimization allowing building and manipulation (e.g., push-pull atoms into position). Some users call this 'video game mode' ;-) A new optimization framework is in progress, including calling external programs for energies and forces. The project would enable building out MD simulations, including tools to add water or solvent boxes, build larger systems (e.g., via PackMol integration) and implement simple MD integration and thermostats.<br />
<br />
'''Prerequisites:'' Experience in C++, ideally with knowledge of molecular dynamics methods and tools. Some Python would be helpful <br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project [Large]: Efficient Molecular Surfaces / Orbitals ===<br />
<br />
'''Brief explanation:''' Generating and rendering molecular surfaces is a common task, from solvent-accessible and solvent-excluded surfaces to molecular orbitals, electron density, spin density, etc.<br />
<br />
'''Expected results:''' An efficient multi-threaded or GPU-enabled surface generation and rendering framework for Avogadro, including mapping properties as color maps onto the surface. Ideally, this would include integration with features of QC-Devs and other packages for calculating various properties or surfaces and/or rendering them for animations.<br />
<br />
'''Prerequisites:'' Experience in C++, ideally with knowledge of OpenGL shaders. Some understanding of quantum chemistry would be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project [Medium]: Integrate CoordGen library ===<br />
<br />
'''Expected results:''' Schrodinger has released a BSD-licensed library for 2D chemical structure layout (https://github.com/schrodinger/coordgenlibs) and it has been successfully integrated into RDKit. The student will be responsible for integrating CoordGen into Open Babel. Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project [Medium]: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project [Medium]: Test Framework Overhaul ===<br />
<br />
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.<br />
<br />
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.<br />
<br />
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.<br />
<br />
=== Project [Large]: Develop a JavaScript version of Open Babel ===<br />
<br />
'''Brief explanation:''' Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.<br />
<br />
'''Expected results''': Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.<br />
<br />
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)<br />
<br />
'''Prerequisities''': Some experience in C++, and also with JavaScript.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
=== Project [Large]: Develop a validation and standardization filter ===<br />
<br />
'''Brief explanation''': Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?<br />
<br />
'''Expected results''': Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.<br />
<br />
Such a model could be used as a filter, or as a warning to flag up problematic structures.<br />
<br />
Code could be modeled on MolVS using RDKit [[https://molvs.readthedocs.io/en/latest/]]<br />
<br />
'''Prerequisites''': Experience in C++ or Python, and an interest in data science or statistics.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
===Project: Implement new parsers===<br />
<br />
'''Brief explanation''': There are outstanding issues on GitHub for supporting more programs (e.g. CFOUR, xtb, NBO, GAMESS dat, MRCC, DIRAC), and parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA). There may also be more programs missing that haven't been considered.<br />
<br />
'''Expected results''': Implement parsers for one or more new programs/formats, generate test data, and write unit and regression tests for each parser.<br />
<br />
'''Prerequisites''': Experience with Python, basic familiarity with computational chemistry programs, and access to the program(s) needed to generate the test data.<br />
<br />
'''Mentors''': Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com) and/or Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
===Project: Implement new bridges===<br />
<br />
'''Brief explanation''': There are outstanding issues on GitHub for more integrations with external programs (e.g. chemfiles, RDKit) via their Python bindings. There may also be more programs missing that haven't been considered.<br />
<br />
'''Expected results''': Implement bridges for one or more new programs, along with writing unit tests and documentation for each bridge.<br />
<br />
'''Prerequisites''': Experience with Python and ideally familiarity with the program that is being bridged.<br />
<br />
'''Mentors''': Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com) and/or Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
===Project: Implement new methods===<br />
<br />
'''Brief explanation''': There are outstanding issues on GitHub for more analysis methods being added directly to cclib (e.g. calculating geometric parameters). There may also be other methods that are desirable to include which haven't been considered.<br />
<br />
'''Expected results''': Implement one or more new methods, along with writing unit tests and documentation for each method.<br />
<br />
'''Prerequisites''': Experience with Python and familiarity with the method(s) being added, depending on the complexity of the method.<br />
<br />
'''Mentors''': Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com) and/or Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
===Project: Julia bindings===<br />
<br />
'''Brief explanation''': The Julia programming language (https://julialang.org/) is growing in popularity for computational chemistry as a language that both production-level computation and analysis can be performed in seamlessly. In order to analyze computational chemistry outputs from traditional programs in Julia, rather than reimplement all cclib functionality in Julia, we should be able to call cclib from Julia directly and reuse its core functionality.<br />
<br />
'''Expected results''': Julia bindings to cclib IO functionality and a Julia-native representation of cclib data objects, with each cclib attribute accessible as a native Julia type. The bindings should be available on the default Julia package registry. The remainder of the project is more open-ended, but an example application of using the bindings would be ideal.<br />
<br />
'''Prerequisites''': Experience with Python and/or Julia, and ideally some familiarity with important quantities from computational chemistry outputs.<br />
<br />
'''Mentors''': Eric Berquist (eric.john.berquist at gmail dot com)<br />
<br />
===Project: Additional visualization for OpenChemVault===<br />
<br />
'''Brief explanation''': OpenChemVault (https://github.com/cclib/openchemvault) is capable of parsing output files, storing them, and displaying geometries, but any sort of additional visualization (such as plotting molecular orbitals or spectra) is missing. The capabilities of GaussSum (http://gausssum.sourceforge.net/) are a possible starting point.<br />
<br />
'''Expected results''': Implement one or more new visualizations for the OpenChemVault web interface.<br />
<br />
'''Prerequisites''': Experience with Python common visualizations that are desirable for computational chemistry outputs. No previous experience with JavaScript is necessary.<br />
<br />
'''Mentors''': Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com) and/or Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, C#, and JavaScript. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Port xyz2mol to the RDKit core ===<br />
<br />
'''Brief explanation:''' Assignment of bond orders to molecules where only atomic coordinates are available is a challenging problem. The xyz2mol package from Prof. Jan H. Jensen's research group in Denmark, https://github.com/jensengroup/xyz2mol, is a robust and well-tested solution to the problem. The goal of this project is to port the xyz2mol code from Python to C++ and integrate it into the core RDKit. Jan Jensen will help us on this project by answering questions and providing advice on the re-implementation. <br />
<br />
'''Expected results:''' A C++ implementation of the xyz2mol code along with a robust set of test cases. Wrappers for the calculator so that it is accessible from within the Python and SWIG (Java and C#) wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Joey Storer (JWStorer at dow.com)<br />
<br />
<br />
=== Project: Implement Molecular Interaction Fields calculations in the RDKit ===<br />
<br />
'''Brief explanation:''' There is an old PR for the RDKit that implements molecular interaction fields: https://github.com/rdkit/rdkit/pull/318. This was never merged because the author ran out of time. At this point a lot of work would be required to update and finish this PR, but the results would be super useful for the RDKit community.<br />
<br />
'''Expected results:''' A C++ implementation of the GRID calculator code along with a robust set of test cases. Wrappers for the calculator so that it is accessible from within the Python and SWIG (Java and C#) wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project [Large]: RDKit+OpenMM GPU Molecular Force Fields ===<br />
<br />
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.<br />
<br />
'''Expected results:''' OpenMM supports a wide range of force fields, but not the classical MMFF94 or UFF methods implemented in RDKit. Needed is C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. The work would likely involve completing the MMFF94 implementation described by Paolo Tosco at the 2017 RDKit UGM (https://github.com/rdkit/UGM_2017/blob/master/Presentations/Tosco_RDKit_OpenMM_integration.pdf) and extending to other force fields like UFF. Another approach is the small-molecule support used by OpenFF: https://github.com/openmm/openmmforcefields<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' TBA, likely Geoff Hutchison (geoffh at pitt.edu) and others<br />
<br />
== QC-Devs Project Ideas ==<br />
<br />
QC-Devs (https://qcdevs.org/) develops various free, open-source, and cross-platform libraries for scientific computing, especially theoretical and computational chemistry. Our goal is to make programming accessible to chemists and promote precepts of sustainable software development. The two main pieces of the QC-Devs ecosystem are:<br />
HORTON (electronic structure theory): https://quantumelephant.org/<br />
ChemTools (molecular structure and reactivity): https://chemtools.org/<br />
All our repositories are hosted on Theochem organization (https://github.com/theochem) on GitHub.<br />
<br />
=== Project: Visualization of Molecular Structure and Reactivity ===<br />
'''Brief Explanation:''' ChemTools (https://github.com/theochem/chemtools) is a post-processing library for extracting chemical insight from quantum chemistry calculations. Currently, ChemTools relies on Visual Molecular Dynamics (VMD) and Matplotlib for visualization. ChemTools has the functionality to generate visualization scripts for VMD, so the user can easily generate informative plots like iso-surface of electron density colored by electrostatic potential. <br />
<br />
'''Expected Results:''' Add functionality to ChemTools to generate visualization scripts for PyMol, IQMol, and Avogadro. The current functionality for VMD can be used as a template.<br />
Difficulty Level: Intermediate<br />
<br />
'''Relevant Skills:''' Experience with Python and visualization<br />
<br />
'''Mentor:''' Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca), Gabriela Sánchez Díaz (sanchezg at mcmaster dot ca), and Esteban Vohringer-Martinez (estebanvohringer at qcmmlab dot com)<br />
<br />
=== Project: Visualize Chemical Reactions ===<br />
'''Brief Explanation:''' GOpt (https://github.com/theochem/gopt) is a Python library for optimizing molecular structures and determining chemical reaction pathways. Currently, GOpt can output a series of chemically relevant numerical structures (e.g., structures along the intrinsic reaction coordinate; optimization trajectories), but there is no interface to visualize these structures or perform structural or chemical analysis of them. The goal of this project is to generate visualization scripts for Avogadro, PyMol and/or IQMol, all of which can provide animations of reaction pathways and optimization trajectories. A stretch goal is to provide a workflow linking GOpt to ChemTools (https://github.com/theochem/chemtools), so that structural and reactivity indicators can be computed and visualized along reaction pathways.<br />
<br />
'''Expected Results:''' Add functionality to GOpt to generate visualization scripts for Avogadro, PyMol and/or IQMol. (Stretch goal: Interface Gopt and ChemTools to facilitate chemical reaction path analysis.)<br />
<br />
'''Difficulty Level:''' Easy<br />
<br />
'''Relevant Skills:''' Experience with Python<br />
<br />
'''Mentor:''' Derrick Yang (yxt1991 at gmail dot com) and Paul Ayers (ayers at mcmaster dot ca)<br />
<br />
=== Project: Extended interoperability of GOpt and Quantum Chemistry Software ===<br />
'''Brief Explanation:''' GOpt (https://github.com/theochem/gopt) is a Python library for optimizing molecular structures and determining chemical reaction pathways. Currently, it obtains the required information (e.g. atomic forces and Hessian matrix) for optimization from the Gaussian quantum chemistry package. The goal of this project is to make it possible for GOpt to use Psi4, PySCF, ORCA, and NWChem at every step of the optimization. <br />
<br />
'''Expected Results:''' Expanding the scope of the GOpt library by increasing the number of quantum chemistry packages it can use for studying chemical reactions. You are expected to use IOData (https://github.com/theochem/iodata) which is a Python library for parsing, storing, and writing various quantum chemistry file formats and generating input files for quantum chemistry packages. This involves:<br />
GOpt using IOData to write an appropriate input file for the above-mentioned quantum chemistry package. <br />
GOpt using IOData to parse the (formatted) output files from these quantum chemistry packages to extract the necessary information (energy, gradient, Hessian, etc.) required.<br />
<br />
'''Difficulty Level:''' Intermediate<br />
<br />
'''Relevant Skills:''' Experience with Python.<br />
<br />
'''Mentor:''' Derrick Yang (yxt1991 at gmail dot com), Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca), and Paul Ayers (ayers at mcmaster dot ca)<br />
<br />
=== Project: Implement Workflows for Calculation and Usage of Databases of Isolated Atom Densities ===<br />
'''Brief Explanation:''' A database of atomic electron densities is often used to analyze electron densities of gas-phase molecules or condensed phases. In practice, there are many ways to calculate the electron density, using different theoretical models and computational tools. As a consequence, such a database is not a one-time effort, but rather a procedure that is regularly repeated with different computational settings and theoretical models. Setting up and processing such calculations by hand (for different elements, ions, spin states, ...) is extremely tedious and error-prone. The implementation of an easy-to-use workflow would heavily reduce the burden of researchers who make use of such databases. This project also aims to facilitate the exchange and archival of atomic density databases.<br />
<br />
'''Expected Results:'''<br />
Extension of Denspart (https://github.com/theochem/denspart) with a database that can store (spherical) atomic electron densities together with atomic metadata. This program currently uses a hard-coded database.<br />
Development and implementation of a JSON specification for archival and exchange of atomic density databases.<br />
Implementation of a workflow for setting up new databases. This involves (i) the generation of input files for existing quantum chemistry codes together with a suitable job script to execute the calculations on an HPC and (ii) processing the outputs of these calculations. This workflow will be implemented using other packages in the HORTON project, such as IOData, Grid, and GBasis. (See https://github.com/theochem)<br />
<br />
'''Difficulty Level:''' Intermediate<br />
<br />
'''Relevant Skills:''' Experience with Python, NumPy <br />
<br />
'''Mentor:''' Toon Verstraelen (Toon.Verstraelen at ugent dot be) and Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca)<br />
<br />
=== Project: Orthogonal Procrustes for Rectangular Matrices ===<br />
'''Brief Explanation:''' Procrustes (https://github.com/theochem/procrustes) is a library for finding the optimal transformation that makes two matrices as close as possible to each other. Procrustes analysis has numerous applications in object recognition, though our primary interest pertains to its utility for quantifying chemical and physical (dis)similarity of molecular structures. Currently, when two input matrices have different numbers of columns, the smaller matrix is augmented by columns of zeros (zero-padding). An alternative to this artificial approach was recently proposed for the special case of orthogonal transformations [SIAM Journal of Matrix vol. 41, pp. 957-983 (2020)]. The goal of this process is to implement the SCFRTR method (algorithm 5.1) from this reference into the Procrustes library. <br />
Expected Results: Extension of Procrustes to include the SCFRTR algorithm as an alternative to zero-padding for unbalanced orthogonal Procrustes problems. <br />
<br />
'''Difficulty Level:''' Advanced<br />
<br />
'''Relevant Skills:''' Experience with Python, NumPy, and numerical analysis<br />
<br />
'''Mentor:''' Ali Tehrani (19at27 at queensu dot ca), David Kim (david.kim.91 at gmail dot com), Paul Ayers (ayers at mcmaster dot ca)<br />
<br />
=== Project: Faster Molecular Integrals with Density-Fitting ===<br />
'''Brief Explanation:''' GBasis (https://github.com/theochem/gbasis) is a library for evaluating and analytically integrating Gaussian-type orbitals and their related quantities, especially molecular integrals. In many applications, the computational bottleneck is the evaluation of two-electron integrals, as the number of two-electron integrals grows as the fourth power of the basis-set size. By introducing an auxiliary, density-fitting, basis, this power is reduced to the third power of the basis-set size, which in many cases eliminates the computational bottleneck, since there are often other facets of the computation that scale more severely than this. The goal of this project is to implement density-fitting methods into GBasis. <br />
<br />
'''Expected Results:''' Extension of GBasis to support density fitting. This involves expanding products of basis functions in the auxiliary basis, evaluating 2-electron integrals in the auxiliary basis, and using these two entities to construct molecular integrals more efficiently.<br />
<br />
'''Difficulty Level:''' Intermediate to Advanced<br />
<br />
'''Relevant Skills:''' Experience with Python, NumPy<br />
<br />
'''Mentor:''' Ali Tehrani (19at27 at queensu dot ca), David Kim (david.kim.91 at gmail dot com), Paul Ayers (ayers at mcmaster dot ca)<br />
<br />
== CalcUS Project Ideas ==<br />
<br />
[https://github.com/cyllab/CalcUS CalcUS] is a platform aiming to democratize access to quantum chemistry by providing a user-friendly web-based interface to simplify running and analyzing quantum mechanical calculations.<br />
<br />
=== Project [Medium]: Improving the web frontend ===<br />
<br />
'''Brief Explanation:''' CalcUS aims to provide all the relevant information from the calculations directly in the web interface, as well as tools to analyze those results. However, some useful elements of the interface are missing or suboptimal. In particular, [https://github.com/jspreadsheet/ce Jspreadsheet] should be implemented to allow data analysis in the browser. Multiple other aspects of the interface could be improved, either related to style of functionalities.<br />
<br />
'''Expected Results:''' Replace the current spreadsheet for Jspreadsheet and configure it, implement data loading from the database (PostgreSQL) and saving/download of the spreadsheet; customize elements of the UI such as alerts, error pages; keep the web pages as responsive as possible; generally improve the code and fix encountered bugs.<br />
<br />
'''Prerequisites:''' Knowledge of HTML and Javascript and at least some knowledge of Python. Familiarity with JQuery, Django and PostgreSQL is helpful.<br />
<br />
'''Mentor:''' Raphaël Robidas (raphael dot robidas at usherbrooke dot com)<br />
<br />
<br />
=== Project [Medium]: Develop large-scale calculation management tools ===<br />
<br />
'''Brief Explanation:''' Quantum chemistry projects can involve performing calculations on a large number of structures (10-100) with different parameters. CalcUS should have features to make this process seamless and highly automated, from launching the calculations to reporting the results.<br />
<br />
'''Expected Results:''' Create a variation of the calculation web UI, aimed specifically at batch calculations with variable parameters, design and implement the workflow to handle these batch calculations, implement results gathering and reporting in a convenient format, write relevant unit and/or integration tests.<br />
<br />
'''Prerequisites:''' Knowledge of HTML, Javascript and Python. Familiarity with JQuery, Django and PostgreSQL is helpful.<br />
<br />
'''Mentor:''' Raphaël Robidas (raphael dot robidas at usherbrooke dot com)<br />
<br />
<br />
=== Project [Large]: Implement multi-step calculation protocols ===<br />
<br />
'''Brief Explanation:''' Quantum chemistry projects often involve the same series of sequential calculations. Currently, each calculation has to be launched manually, which is often not necessary. This project aims to add the feature to create custom multi-step calculation protocols as well as the underlying mechanics which make the protocols run smoothly.<br />
<br />
'''Expected Results:''' Add an interface to create multi-step protocols, create the data structures to store these protocols and their progress, integrate the automated launch of subsequent steps using the current calculation handling code, add simple verifications after each step completion, write relevant unit and/or integration tests.<br />
<br />
'''Prerequisites:''' Knowledge of HTML, Javascript and Python. Familiarity with JQuery, Django and PostgreSQL is helpful.<br />
<br />
'''Mentor:''' Raphaël Robidas (raphael dot robidas at usherbrooke dot com)<br />
<br />
== ccinput Project Ideas ==<br />
<br />
[https://github.com/cyllab/ccinput ccinput] is a library and standalone tool to create computational chemistry input files.<br />
<br />
=== Project [Large]: Add support for NWChem ===<br />
<br />
'''Brief Explanation:''' Implementing the creation of NWChem input files for most of its features.<br />
<br />
'''Expected Results:'''' Implementing the creation of NWChem input files which follow the correct structures, implementing support of various keywords and modifiers, allowing the use of the Basis Set Exchange data, adding the relevant static data about NWChem (supported methods, solvents, etc.), creation of extensive unit tests for all features, writing the documentation.<br />
<br />
'''Prerequisites:''' Knowledge of Python. Familiarity with quantum chemistry is helpful, but not required.<br />
<br />
'''Mentor:''' Raphaël Robidas (raphael dot robidas at usherbrooke dot com)<br />
<br />
==3Dmol.js Project Ideas==<br />
<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Improve 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Make significant improvements to 3Dmol.js functionality or performance.<br />
<br />
'''Expected results:''' This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request. <br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
==gnina Project Ideas==<br />
<br />
[https://github.com/gnina gnina] is a C/C++ framework for applying deep learning to molecular docking.<br />
<br />
=== Project: Improve gnina ===<br />
<br />
'''Brief explanation:''' Make significant improvements to gnina functionality or performance.<br />
<br />
'''Expected results:''' This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request. <br />
<br />
'''Prerequisites:''' Experience with CUDA/C/C++ programming and the basics of deep learning.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology. Additional project ideas are discussed at https://forum.deepchem.io/t/google-summer-of-code-ideas/356.<br />
<br />
=== Project: PyTorch Lightning Implementation ===<br />
<br />
'''Brief explanation:''' Allow for implementation of DeepChem models in PyTorch Lightning.<br />
<br />
'''Expected results:''' PyTorch lightning is a popular framework for PyTorch. This project would look into enabling the easy construction of PyTorch lightning based models for DeepChem. Completion of this project should require the implementation of a good test suite and a jupyter notebook tutorial for implementing PyTorch Lightning models in DeepChem.<br />
<br />
'''Prerequisites:''' PyTorch Lightning, Python<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at deepforestsci dot com)<br />
<br />
=== Project: Semiconductor Modeling Support ===<br />
<br />
'''Brief explanation:''' Add support for semiconductor modeling deep learning tools.<br />
<br />
'''Expected results:''' This project would involve implementing semiconductor models from https://arxiv.org/ftp/arxiv/papers/2101/2101.04383.pdf. These models should be added to DeepChem along with suitable tests, and a suitable jupyter notebook usage tutorial.<br />
<br />
'''Prerequisites:''' PyTorch/TensorFlow, Python<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at deepforestsci dot com)<br />
<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2022&diff=710GSoC Ideas 20222022-02-08T14:32:34Z<p>Greg.landrum: /* RDKit Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. While we have participated in the last few Google Summer of Code programs and will apply again in 2022, there is '''no guarantee''' that we will be selected again for GSoC in 2022.<br />
<br />
One important factor is that GSoC in 2022 will include both shorter projects and longer projects. You should consider the appropriate timeline for your project proposal.<br />
<br />
We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://two.avogadro.cc/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Python-based Compute and Data Server ===<br />
<br />
'''Brief explanation:''' Avogadro would be more powerful with a local compute and data server<br />
<br />
'''Expected results:''' A number of projects have build servers for larger projects that can also do compute, Jupyter, etc. Python has a number of lightweight data server frameworks such as FastAPI where RESTful APIs can be developed rapidly. Using this as a basis along with PostgreSQL, EdgeDB, or other database technologies the project would build a lightweight data layer for storing, searching, and visualizing data. Ideally this would be packaged in a container, and deployable to the cloud or run locally via pip or conda. A stretch goal would be to implement simple queuing and execution of jobs within the server API reusing Python projects to handle queuing, execution, etc.<br />
<br />
'''Prerequisites:''' Experience in Python, some experience with C++/Qt and RESTful APIs.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (mhanwell at bnl.gov)<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (i.e., Python) in Avogadro 2<br />
<br />
'''Expected results:''' Enable an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python. Python bindings exist, using PyBind11 with the new codebase, and the Avogadro 2 core libraries are pip installable. Extending the coverage of the API from the rudimentary parts of core/io would be a good starting point. An ideal solution would connect to PySide2, to allow scripting to add UI like menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
Example scripts, documentation, are highly encouraged.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python, some experience with PyBind11, Qt for Python, PySide suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (mhanwell at bnl dot gov)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Tools for Interactive Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Building solvent boxes, implementing standard molecular dynamics using in-progress optimization framework.<br />
<br />
'''Expected results:''' Avogadro (v1) has interactive force field optimization allowing building and manipulation (e.g., push-pull atoms into position). Some users call this 'video game mode' ;-) A new optimization framework is in progress, including calling external programs for energies and forces. The project would enable building out MD simulations, including tools to add water or solvent boxes, build larger systems (e.g., via PackMol integration) and implement simple MD integration and thermostats.<br />
<br />
'''Prerequisites:'' Experience in C++, ideally with knowledge of molecular dynamics methods and tools. Some Python would be helpful <br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Integrate CoordGen library ===<br />
<br />
'''Expected results:''' Schrodinger has released a BSD-licensed library for 2D chemical structure layout (https://github.com/schrodinger/coordgenlibs) and it has been successfully integrated into RDKit. The student will be responsible for integrating CoordGen into Open Babel. Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Test Framework Overhaul ===<br />
<br />
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.<br />
<br />
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.<br />
<br />
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.<br />
<br />
=== Project: Develop a JavaScript version of Open Babel ===<br />
<br />
'''Brief explanation:''' Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.<br />
<br />
'''Expected results''': Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.<br />
<br />
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)<br />
<br />
'''Prerequisities''': Some experience in C++, and also with JavaScript.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
=== Project: Develop a validation and standardization filter ===<br />
<br />
'''Brief explanation''': Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?<br />
<br />
'''Expected results''': Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.<br />
<br />
Such a model could be used as a filter, or as a warning to flag up problematic structures.<br />
<br />
Code could be modeled on MolVS using RDKit [[https://molvs.readthedocs.io/en/latest/]]<br />
<br />
'''Prerequisites''': Experience in C++ or Python, and an interest in data science or statistics.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
===Project: Implement new parsers===<br />
<br />
'''Brief explanation''': There are outstanding issues on GitHub for supporting more programs (e.g. CFOUR, xtb, NBO, GAMESS dat, MRCC, DIRAC), and parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA). There may also be more programs missing that haven't been considered.<br />
<br />
'''Expected results''': Implement parsers for one or more new programs/formats, generate test data, and write unit and regression tests for each parser.<br />
<br />
'''Prerequisites''': Experience with Python, basic familiarity with computational chemistry programs, and access to the program(s) needed to generate the test data.<br />
<br />
'''Mentors''': Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com) and/or Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com)<br />
<br />
===Project: Implement new bridges===<br />
<br />
'''Brief explanation''': There are outstanding issues on GitHub for more integrations with external programs (e.g. chemfiles, RDKit) via their Python bindings. There may also be more programs missing that haven't been considered.<br />
<br />
'''Expected results''': Implement bridges for one or more new programs, along with writing unit tests and documentation for each bridge.<br />
<br />
'''Prerequisites''': Experience with Python and ideally familiarity with the program that is being bridged.<br />
<br />
'''Mentors''': Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com) and/or Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com)<br />
<br />
===Project: Implement new methods===<br />
<br />
'''Brief explanation''': There are outstanding issues on GitHub for more analysis methods being added directly to cclib (e.g. calculating geometric parameters). There may also be other methods that are desirable to include which haven't been considered.<br />
<br />
'''Expected results''': Implement one or more new methods, along with writing unit tests and documentation for each method.<br />
<br />
'''Prerequisites''': Experience with Python and familiarity with the method(s) being added, depending on the complexity of the method.<br />
<br />
'''Mentors''': Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com) and/or Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com)<br />
<br />
===Project: Julia bindings===<br />
<br />
'''Brief explanation''': The Julia programming language (https://julialang.org/) is growing in popularity for computational chemistry as a language that both production-level computation and analysis can be performed in seamlessly. In order to analyze computational chemistry outputs from traditional programs in Julia, rather than reimplement all cclib functionality in Julia, we should be able to call cclib from Julia directly and reuse its core functionality.<br />
<br />
'''Expected results''': Julia bindings to cclib IO functionality and a Julia-native representation of cclib data objects, with each cclib attribute accessible as a native Julia type. The bindings should be available on the default Julia package registry. The remainder of the project is more open-ended, but an example application of using the bindings would be ideal.<br />
<br />
'''Prerequisites''': Experience with Python and/or Julia, and ideally some familiarity with important quantities from computational chemistry outputs.<br />
<br />
'''Mentors''': Eric Berquist (eric.john.berquist at gmail dot com)<br />
<br />
===Project: Additional visualization for OpenChemVault===<br />
<br />
'''Brief explanation''': OpenChemVault (https://github.com/cclib/openchemvault) is capable of parsing output files, storing them, and displaying geometries, but any sort of additional visualization (such as plotting molecular orbitals or spectra) is missing. The capabilities of GaussSum (http://gausssum.sourceforge.net/) are a possible starting point.<br />
<br />
'''Expected results''': Implement one or more new visualizations for the OpenChemVault web interface.<br />
<br />
'''Prerequisites''': Experience with Python common visualizations that are desirable for computational chemistry outputs. No previous experience with JavaScript is necessary.<br />
<br />
'''Mentors''': Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com) and/or Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Implement Molecular Interaction Fields calculations in the RDKit ===<br />
<br />
'''Brief explanation:''' There is an old PR for the RDKit that implements molecular interaction fields: https://github.com/rdkit/rdkit/pull/318. This was never merged because the author ran out of time. At this point a lot of work would be required to update and finish this PR, but the results would be super useful for the RDKit community.<br />
<br />
'''Expected results:''' A C++ implementation of the GRID calculator code along with a robust set of test cases. Wrappers for the reader so that it is accessible from within the Python and SWIG (Java and C#) wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit+OpenMM GPU Molecular Force Fields ===<br />
<br />
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.<br />
<br />
'''Expected results:''' OpenMM supports a wide range of force fields, but not the classical MMFF94 or UFF methods implemented in RDKit. Needed is C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. The work would likely involve completing the MMFF94 implementation described by Paolo Tosco at the 2017 RDKit UGM (https://github.com/rdkit/UGM_2017/blob/master/Presentations/Tosco_RDKit_OpenMM_integration.pdf) and extending to other force fields like UFF.<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' TBA, likely Geoff Hutchison (geoffh at pitt.edu) and others<br />
<br />
== QC-Devs Project Ideas ==<br />
<br />
QC-Devs (https://qcdevs.org/) develops various free, open-source, and cross-platform libraries for scientific computing, especially theoretical and computational chemistry. Our goal is to make programming accessible to chemists and promote precepts of sustainable software development. The two main pieces of the QC-Devs ecosystem are:<br />
HORTON (electronic structure theory): https://quantumelephant.org/<br />
ChemTools (molecular structure and reactivity): https://chemtools.org/<br />
All our repositories are hosted on Theochem organization (https://github.com/theochem) on GitHub.<br />
<br />
=== Project: Visualization of Molecular Structure and Reactivity ===<br />
'''Brief Explanation:''' ChemTools (https://github.com/theochem/chemtools) is a post-processing library for extracting chemical insight from quantum chemistry calculations. Currently, ChemTools relies on Visual Molecular Dynamics (VMD) and Matplotlib for visualization. ChemTools has the functionality to generate visualization scripts for VMD, so the user can easily generate informative plots like iso-surface of electron density colored by electrostatic potential. <br />
<br />
'''Expected Results:''' Add functionality to ChemTools to generate visualization scripts for PyMol, IQMol, and Avogadro. The current functionality for VMD can be used as a template.<br />
Difficulty Level: Intermediate<br />
<br />
'''Relevant Skills:''' Experience with Python and visualization<br />
<br />
'''Mentor:''' Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca), Gabriela Sánchez Díaz (sanchezg at mcmaster dot ca), and Esteban Vohringer-Martinez (estebanvohringer at qcmmlab dot com)<br />
<br />
=== Project: Visualize Chemical Reactions ===<br />
'''Brief Explanation:''' GOpt (https://github.com/theochem/gopt) is a Python library for optimizing molecular structures and determining chemical reaction pathways. Currently, GOpt can output a series of chemically relevant numerical structures (e.g., structures along the intrinsic reaction coordinate; optimization trajectories), but there is no interface to visualize these structures or perform structural or chemical analysis of them. The goal of this project is to generate visualization scripts for Avogadro, PyMol and/or IQMol, all of which can provide animations of reaction pathways and optimization trajectories. A stretch goal is to provide a workflow linking GOpt to ChemTools (https://github.com/theochem/chemtools), so that structural and reactivity indicators can be computed and visualized along reaction pathways.<br />
<br />
'''Expected Results:''' Add functionality to GOpt to generate visualization scripts for Avogadro, PyMol and/or IQMol. (Stretch goal: Interface Gopt and ChemTools to facilitate chemical reaction path analysis.)<br />
<br />
'''Difficulty Level:''' Easy<br />
<br />
'''Relevant Skills:''' Experience with Python<br />
<br />
'''Mentor:''' Derrick Yang (yxt1991 at gmail dot com) and Paul Ayers (ayers at mcmaster dot ca)<br />
<br />
=== Project: Extended interoperability of GOpt and Quantum Chemistry Software ===<br />
'''Brief Explanation:''' GOpt (https://github.com/theochem/gopt) is a Python library for optimizing molecular structures and determining chemical reaction pathways. Currently, it obtains the required information (e.g. atomic forces and Hessian matrix) for optimization from the Gaussian quantum chemistry package. The goal of this project is to make it possible for GOpt to use Psi4, PySCF, ORCA, and NWChem at every step of the optimization. <br />
<br />
'''Expected Results:''' Expanding the scope of the GOpt library by increasing the number of quantum chemistry packages it can use for studying chemical reactions. You are expected to use IOData (https://github.com/theochem/iodata) which is a Python library for parsing, storing, and writing various quantum chemistry file formats and generating input files for quantum chemistry packages. This involves:<br />
GOpt using IOData to write an appropriate input file for the above-mentioned quantum chemistry package. <br />
GOpt using IOData to parse the (formatted) output files from these quantum chemistry packages to extract the necessary information (energy, gradient, Hessian, etc.) required.<br />
<br />
'''Difficulty Level:''' Intermediate<br />
<br />
'''Relevant Skills:''' Experience with Python.<br />
<br />
'''Mentor:''' Derrick Yang (yxt1991 at gmail dot com), Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca), and Paul Ayers (ayers at mcmaster dot ca)<br />
<br />
=== Project: Implement Workflows for Calculation and Usage of Databases of Isolated Atom Densities ===<br />
'''Brief Explanation:''' A database of atomic electron densities is often used to analyze electron densities of gas-phase molecules or condensed phases. In practice, there are many ways to calculate the electron density, using different theoretical models and computational tools. As a consequence, such a database is not a one-time effort, but rather a procedure that is regularly repeated with different computational settings and theoretical models. Setting up and processing such calculations by hand (for different elements, ions, spin states, ...) is extremely tedious and error-prone. The implementation of an easy-to-use workflow would heavily reduce the burden of researchers who make use of such databases. This project also aims to facilitate the exchange and archival of atomic density databases.<br />
<br />
'''Expected Results:'''<br />
Extension of Denspart (https://github.com/theochem/denspart) with a database that can store (spherical) atomic electron densities together with atomic metadata. This program currently uses a hard-coded database.<br />
Development and implementation of a JSON specification for archival and exchange of atomic density databases.<br />
Implementation of a workflow for setting up new databases. This involves (i) the generation of input files for existing quantum chemistry codes together with a suitable job script to execute the calculations on an HPC and (ii) processing the outputs of these calculations. This workflow will be implemented using other packages in the HORTON project, such as IOData, Grid, and GBasis. (See https://github.com/theochem)<br />
<br />
'''Difficulty Level:''' Intermediate<br />
<br />
'''Relevant Skills:''' Experience with Python, NumPy <br />
<br />
'''Mentor:''' Toon Verstraelen (Toon.Verstraelen at ugent dot be) and Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca)<br />
<br />
=== Project: Orthogonal Procrustes for Rectangular Matrices ===<br />
'''Brief Explanation:''' Procrustes (https://github.com/theochem/procrustes) is a library for finding the optimal transformation that makes two matrices as close as possible to each other. Procrustes analysis has numerous applications in object recognition, though our primary interest pertains to its utility for quantifying chemical and physical (dis)similarity of molecular structures. Currently, when two input matrices have different numbers of columns, the smaller matrix is augmented by columns of zeros (zero-padding). An alternative to this artificial approach was recently proposed for the special case of orthogonal transformations [SIAM Journal of Matrix vol. 41, pp. 957-983 (2020)]. The goal of this process is to implement the SCFRTR method (algorithm 5.1) from this reference into the Procrustes library. <br />
Expected Results: Extension of Procrustes to include the SCFRTR algorithm as an alternative to zero-padding for unbalanced orthogonal Procrustes problems. <br />
<br />
'''Difficulty Level:''' Advanced<br />
<br />
'''Relevant Skills:''' Experience with Python, NumPy, and numerical analysis<br />
<br />
'''Mentor:''' Ali Tehrani (19at27 at queensu dot ca), David Kim (david.kim.91 at gmail dot com), Paul Ayers (ayers at mcmaster dot ca)<br />
<br />
=== Project: Faster Molecular Integrals with Density-Fitting ===<br />
'''Brief Explanation:''' GBasis (https://github.com/theochem/gbasis) is a library for evaluating and analytically integrating Gaussian-type orbitals and their related quantities, especially molecular integrals. In many applications, the computational bottleneck is the evaluation of two-electron integrals, as the number of two-electron integrals grows as the fourth power of the basis-set size. By introducing an auxiliary, density-fitting, basis, this power is reduced to the third power of the basis-set size, which in many cases eliminates the computational bottleneck, since there are often other facets of the computation that scale more severely than this. The goal of this project is to implement density-fitting methods into GBasis. <br />
<br />
'''Expected Results:''' Extension of GBasis to support density fitting. This involves expanding products of basis functions in the auxiliary basis, evaluating 2-electron integrals in the auxiliary basis, and using these two entities to construct molecular integrals more efficiently.<br />
<br />
'''Difficulty Level:''' Intermediate to Advanced<br />
<br />
'''Relevant Skills:''' Experience with Python, NumPy<br />
<br />
'''Mentor:''' Ali Tehrani (19at27 at queensu dot ca), David Kim (david.kim.91 at gmail dot com), Paul Ayers (ayers at mcmaster dot ca)<br />
<br />
==3Dmol.js Project Ideas==<br />
<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Improve 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Make significant improvements to 3Dmol.js functionality or performance.<br />
<br />
'''Expected results:''' This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request. <br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
==gnina Project Ideas==<br />
<br />
[https://github.com/gnina gnina] is a C/C++ framework for applying deep learning to molecular docking.<br />
<br />
=== Project: Improve gnina ===<br />
<br />
'''Brief explanation:''' Make significant improvements to gnina functionality or performance.<br />
<br />
'''Expected results:''' This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request. <br />
<br />
'''Prerequisites:''' Experience with CUDA/C/C++ programming and the basics of deep learning.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== NWChem Project Ideas ==<br />
NWChem is widely used open-source computational chemistry software ([http://nwchem-sw.org]) that tackles a wide variety of scientific problems.<br />
<br />
=== Project NWChem-JSON ===<br />
<br />
'''Brief explanation:''' Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics. <br />
<br />
'''Expected results:''' Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed. <br />
<br />
'''Prerequisites:''' Experience with Fortran90 and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== Project NWChem-Python-Jupyter Interface ===<br />
<br />
'''Brief explanation:''' Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks<br />
<br />
'''Expected results:''' NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite. <br />
<br />
'''Prerequisites:''' Experience with Fortran and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== JSON-LD for Chemical Data ===<br />
<br />
'''Brief explanation:''' Transforming NWChem and Chemical JSON formats to JSON-LD<br />
<br />
'''Expected results:''' Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.<br />
<br />
'''Prerequisites:''' Experience with Python and JSON-LD<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology. Additional project ideas are discussed at https://forum.deepchem.io/t/google-summer-of-code-ideas/356.<br />
<br />
=== Project: PyTorch Lightning Implementation ===<br />
<br />
'''Brief explanation:''' Allow for implementation of DeepChem models in PyTorch Lightning.<br />
<br />
'''Expected results:''' PyTorch lightning is a popular framework for PyTorch. This project would look into enabling the easy construction of PyTorch lightning based models for DeepChem. Completion of this project should require the implementation of a good test suite and a jupyter notebook tutorial for implementing PyTorch Lightning models in DeepChem.<br />
<br />
'''Prerequisites:''' PyTorch Lightning, Python<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at deepforestsci dot com)<br />
<br />
=== Project: Semiconductor Modeling Support ===<br />
<br />
'''Brief explanation:''' Add support for semiconductor modeling deep learning tools.<br />
<br />
'''Expected results:''' This project would involve implementing semiconductor models from https://arxiv.org/ftp/arxiv/papers/2101/2101.04383.pdf. These models should be added to DeepChem along with suitable tests, and a suitable jupyter notebook usage tutorial.<br />
<br />
'''Prerequisites:''' PyTorch/TensorFlow, Python<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at deepforestsci dot com)<br />
<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2020&diff=682GSoC Ideas 20202020-03-08T06:37:06Z<p>Greg.landrum: /* RDKit Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. While we have participated in the last few Google Summer of Code programs and will apply again in 2020, there is '''no guarantee''' that we will be selected again for GSoC in 2020.<br />
<br />
We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Rudimentary support for residues, and reading secondary structure (e.g., PDB format) is now present. Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels are desired. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (i.e., Python) in Avogadro 2<br />
<br />
'''Expected results:''’ Enable an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python. Initial Python bindings have been re-implemented using PyBind11 with the new codebase, and the Avogadro 2 core libraries are pip installable. Extending the coverage of the API from the rudimentary parts of core/io would be a good starting point. An ideal solution would connect to PySide2, to allow scripting to add UI like menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
Example scripts, documentation, are highly encouraged.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python, some experience with PyBind11, Qt for Python, PySide suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (marcus dot hanwell at kitware dot com)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
<br />
=== Project: Improve Avogadro Quantum Processing & Analysis ===<br />
<br />
'''Brief explanation:''' Visualizing quantum mechanical data like orbitals, electron density, etc. is slow. Replace Avogadro's current orbital rendering to use the efficient the Gau2Grid library [[https://gau2grid.readthedocs.io/en/latest/]] and add analysis tools.<br />
<br />
'''Expected results:''' A very fast real-time rendering of volumetric quantum chemical within Avogadro, ideally including processing and analysis of surfaces / volumes, orbitals, etc. For example, sometimes the gradient or the Laplacian of a surface are useful. Add tools to add/subtract or join / intersect surfaces and map properties (e.g., electrostatic potential mapped onto the electron density).<br />
<br />
'''Prerequisites:''' Experience in C++, an understanding of vectorization and intrinsics would be helpful.<br />
<br />
'''Mentor:''' Daniel G. A. Smith (dgasmith at vt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Integrate CoordGen library ===<br />
<br />
'''Expected results:''' Schrodinger has released a BSD-licensed library for 2D chemical structure layout (https://github.com/schrodinger/coordgenlibs) and it has been successfully integrated into RDKit. The student will be responsible for integrating CoordGen into Open Babel. Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Test Framework Overhaul ===<br />
<br />
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.<br />
<br />
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.<br />
<br />
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.<br />
<br />
=== Project: Develop a JavaScript version of Open Babel ===<br />
<br />
'''Brief explanation:''' Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.<br />
<br />
'''Expected results''': Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.<br />
<br />
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)<br />
<br />
'''Prerequisities''': Some experience in C++, and also with JavaScript.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
=== Project: Develop a validation and standardization filter ===<br />
<br />
'''Brief explanation''': Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?<br />
<br />
'''Expected results''': Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.<br />
<br />
Such a model could be used as a filter, or as a warning to flag up problematic structures.<br />
<br />
Code could be modeled on MolVS using RDKit [[https://molvs.readthedocs.io/en/latest/]]<br />
<br />
'''Prerequisites''': Experience in C++ or Python, and an interest in data science or statistics.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
=== Project: Support for QCSchema JSON output ===<br />
<br />
'''Bried explanation:''' The library already allows importing and exporting data between several formats. The QCSchema is a new JSON format that tries to standardize the way computational chemistry data is written and shared, so supporting the effort can be useful.<br />
<br />
'''Expected results:''' Implement JSON output that conforms to the conventions of the [https://github.com/MolSSI/QCSchema MolSSI QCSchema].<br />
<br />
'''Suggested readings:'''<br />
* This [https://github.com/cclib/cclib/issues/643 cclib issue] and the references there.<br />
<br />
'''Prerequisites:''' Experience with Python, some experience with physics and chemistry also recommended.<br />
<br />
'''Mentor:''' Eric Berquist (eric.john.berquist at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Expected results:''' Implement additional analysis and quantum calculation methods, such as ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges, with examples and tests.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on GitHub for supporting more programs, and parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Discovering computational chemistry content online ===<br />
<br />
'''Brief explanation:''' There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!<br />
<br />
'''Expected results:''' Build a crawler that identifies and indexes computational chemistry logfiles online, and provides the ability to extract the data they contain with cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
==3Dmol.js Project Ideas==<br />
<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Improve 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Make significant improvements to 3Dmol.js functionality or performance.<br />
<br />
'''Expected results:''' This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request. <br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
==gnina Project Ideas==<br />
<br />
[https://github.com/gnina gnina] is a C/C++ framework for applying deep learning to molecular docking.<br />
<br />
=== Project: Improve gnina ===<br />
<br />
'''Brief explanation:''' Make significant improvements to gnina functionality or performance.<br />
<br />
'''Expected results:''' This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request. <br />
<br />
'''Prerequisites:''' Experience with CUDA/C/C++ programming and the basics of deep learning.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Integrate trained neural networks into the RDKit ===<br />
<br />
'''Brief explanation:''' There's a lot of work going on to train and use neural networks that use the RDKit. It would be great to be able to use some of those trained networks from inside the RDKit itself. A couple of examples that immediately come to mind here are ANI-2X (https://chemrxiv.org/articles/Extending_the_Applicability_of_the_ANI_Deep_Learning_Molecular_Potential_to_Sulfur_and_Halogens/11819268) and CDDD (https://github.com/jrwnter/cddd). The idea in this project would be create the required Python and C++ infrastructure to translate a trained neural network into a form that it can be used from C++ and then integrate ANI-2X using that infrastructure. As a stretch goal the trained network for CDDD would be integrated.<br />
<br />
'''Expected results:''' Code (probably in Python) to translate a trained neural network using one of the standard NN libraries to a form that it can be used from C++. Code to actually execute the network in C++. A port of the ANI-2X network to the RDKit's ForceField library using this new code. Wrappers for the new functionality so that it is accessible from within the Python and SWIG (Java and C#) wrappers. An comprehensive set of tests for the new functionality.<br />
<br />
'''Prerequisites:''' C++, Python<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
<br />
=== Project: Implement a generalized file reader ===<br />
<br />
'''Brief explanation:''' Implementation of a flexible generic interface for reading molecular file formats (things like .smi, .sdf, and the compressed versions thereof). The reader should recognize the file format automatically so that the user does not need to worry about this.<br />
<br />
'''Expected results:''' A C++ implementation of a generalized file reader for the RDKit along with a robust set of test cases. Wrappers for the reader so that it is accessible from within the Python and SWIG (Java and C#) wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: Implement Molecular Interaction Fields calculations in the RDKit ===<br />
<br />
'''Brief explanation:''' There is an old PR for the RDKit that implements molecular interaction fields: https://github.com/rdkit/rdkit/pull/318. This was never merged because the author ran out of time. At this point a lot of work would be required to update and finish this PR, but the results would be super useful for the RDKit community.<br />
<br />
'''Expected results:''' A C++ implementation of the GRID calculator code along with a robust set of test cases. Wrappers for the reader so that it is accessible from within the Python and SWIG (Java and C#) wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit+OpenMM GPU Molecular Force Fields ===<br />
<br />
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.<br />
<br />
'''Expected results:''' OpenMM supports a wide range of force fields, but not the classical MMFF94 or UFF methods implemented in RDKit. Needed is C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. The work would likely involve completing the MMFF94 implementation described by Paolo Tosco at the 2017 RDKit UGM (https://github.com/rdkit/UGM_2017/blob/master/Presentations/Tosco_RDKit_OpenMM_integration.pdf) and extending to other force fields like UFF.<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' TBA, likely Geoff Hutchison (geoffh at pitt.edu) and others<br />
<br />
=== Project: MongoDB integration ===<br />
<br />
'''Brief explanation:''' <br />
MongoDB (https://www.mongodb.com/) is an open-source cross-platform document oriented NoSQL database program optimized for performance. Its flexible schema can accommodate hierarchical relationships between chemical compounds. To enable chemical intelligence in mongoDB queries, an integration with RDKit is necessary. This project will allow us to build a frame to perform similarity, substructure, and identity searches. We will leverage on the document structure of the database to store multiple representations of each molecule. While adding new functionalities and developing existing capabilities we will keep an eye on performance, to ensure optimal scalability (indexing, shards, multiprocessing, etc.). We will learn from, and possibly build upon other work that has been done for chemistry integration into MongoDB (e.g. http://wiki.openchemistry.org/MongoChem and http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb).<br />
<br />
'''Expected results:''' A stable and performant RDKit extension to MongoDB that provides chemical functionalities on a document database.<br />
<br />
'''Prerequisites:''' Python<br />
<br />
'''Mentor:''' Marco Stenta (marco.stenta at syngenta.com)<br />
<br />
== NWChem Project Ideas ==<br />
NWChem is widely used open-source computational chemistry software ([http://nwchem-sw.org]) that tackles a wide variety of scientific problems.<br />
<br />
=== Project NWChem-JSON ===<br />
<br />
'''Brief explanation:''' Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics. <br />
<br />
'''Expected results:''' Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed. <br />
<br />
'''Prerequisites:''' Experience with Fortran90 and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== Project NWChem-Python-Jupyter Interface ===<br />
<br />
'''Brief explanation:''' Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks<br />
<br />
'''Expected results:''' NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite. <br />
<br />
'''Prerequisites:''' Experience with Fortran and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== JSON-LD for Chemical Data ===<br />
<br />
'''Brief explanation:''' Transforming NWChem and Chemical JSON formats to JSON-LD<br />
<br />
'''Expected results:''' Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.<br />
<br />
'''Prerequisites:''' Experience with Python and JSON-LD<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.<br />
<br />
=== Project: Dynamic DeepChem ===<br />
<br />
'''Brief explanation:''' Lay the groundwork for a version of DeepChem based on Jax<br />
<br />
'''Expected results:''' DeepChem was originally built on Theano then later ported to TensorFlow's grab mode. We are currently working on porting it to eager mode default. It seems sensible that the next big transition will be to more powerful automatic differentiation frameworks like Jax. This project would require students to implement core DeepChem models such as graph convolutions in Jax and demonstrate that they can be saved and loaded. This work would likely see it's way into the next main version of DeepChem.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath dot ramsundar at gmail dot com) <br />
<br />
<br />
=== Project: Improvements to Transfer Learning ===<br />
<br />
'''Brief explanation:''' Expand out DeepChem's transfer learning framework and machinery.<br />
<br />
'''Expected results:''' [https://arxiv.org/abs/1712.02734 ChemNet] discusses a powerful model independent transfer learning protocol. We had a GSoC student expand out this framework over last summer ([https://forum.deepchem.io/t/transfer-learning-for-molecular-property-prediction/44 post]). We'd like to see work expanding this framework out further and adding in new ideas, perhaps borrowing from recent research on transformers.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath dot ramsundar at gmail dot com) <br />
<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
<br />
== Psi4 Project Ideas ==<br />
<br />
[http://psicode.org psi4] is an open-source hybrid Python/C++ suite of ab initio quantum chemistry programs designed for efficient, high-accuracy simulations of a variety of molecular properties.<br />
<br />
=== Project: Quantum Chemistry with Deep Learning Toolkits ===<br />
<br />
'''Brief explanation:''' Integrate GPU tensors tools like TensorFlow and PyTorch with the Psi4NumPy (https://github.com/psi4/psi4numpy) to explore the performance of these high-level tools with quantum chemistry.<br />
<br />
'''Expected results:''' A small module that can evaluate quantum chemistry on GPUs.<br />
<br />
'''Prerequisites:''' Tensorflow or PyTorch knowledge, linear algebra, and an understanding of general tensor contraction. No quantum chemistry knowledge required.<br />
<br />
'''Mentor:''' Daniel G. A. Smith (dgasmith at vt.edu) <br />
<br />
=== Project: Parallelization of Task Graph Computations ===<br />
<br />
'''Brief explanation:''' Improve Psi4's task graph computation integration with the MolSSI QCFractal (https://github.com/MolSSI/QCFractal) project for massively parallel quantum chemistry.<br />
<br />
'''Expected results:''' Massively parallel implementations of crystal computations, n-body interactions<br />
<br />
'''Prerequisites:''' Python experience and task-graph experience (such as Dask), an understand quantum chemistry understanding would be helpful.<br />
<br />
'''Mentor:''' Lori Burns (lori.burns at gmail.com) or Roberto Di Remigio (roberto.diremigio at gmail.com)<br />
<br />
=== Project: Avogadro visualization integration ===<br />
<br />
'''Brief explanation:''' Integration of Psi4 volumetric data with Avogadro's rendering tools.<br />
<br />
'''Expected results:''' Automatic integration of Psi4s volumetric data such as cube files, F-SAPT energy decomposition analysis routines, and vibrational frequencies.<br />
<br />
'''Prerequisites:''' Python experience and Avogadro integration, a small amount of quantum chemistry understanding would be helpful.<br />
<br />
'''Mentor:''' Justin Turney (justin.turney at gmail.com) or Andrew James (amjames2 vt.edu)<br />
<br />
== MSDK / MZmine Project Ideas ==<br />
<br />
=== Project: New Visualization Modules ===<br />
<br />
'''Brief explanation:''' Implement new, JavaFX-based visualization modules for MZmine [https://mzmine.github.io] such as 3D plot and Cloud Plot.<br />
<br />
'''Expected results:''' A replacement module for the aging and barely functional 3D visualizer [https://mzmine.github.io/img/screenshots/3D.png], as well as new visualization tools for data analysis.<br />
<br />
'''Prerequisites:''' Java, JavaFX (preferred), experience with 3D graphics helpful but not required.<br />
<br />
'''Mentor:''' Tomas Pluskal (plusik at gmail.com)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2020&diff=678GSoC Ideas 20202020-01-28T13:29:32Z<p>Greg.landrum: /* RDKit Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. While we have participated in the last few Google Summer of Code programs and will apply again in 2020, there is '''no guarantee''' that we will be selected again for GSoC in 2020.<br />
<br />
We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Rudimentary support for residues, and reading secondary structure (e.g., PDB format) is now present. Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels are desired. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (i.e., Python) in Avogadro 2<br />
<br />
'''Expected results:''’ Enable an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python. Initial Python bindings have been re-implemented using PyBind11 with the new codebase, and the Avogadro 2 core libraries are pip installable. Extending the coverage of the API from the rudimentary parts of core/io would be a good starting point. An ideal solution would connect to PySide2, to allow scripting to add UI like menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
Example scripts, documentation, are highly encouraged.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python, some experience with PyBind11, Qt for Python, PySide suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (marcus dot hanwell at kitware dot com)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
<br />
=== Project: Improve Avogadro Quantum Processing & Analysis ===<br />
<br />
'''Brief explanation:''' Visualizing quantum mechanical data like orbitals, electron density, etc. is slow. Replace Avogadro's current orbital rendering to use the efficient the Gau2Grid library [[https://gau2grid.readthedocs.io/en/latest/]] and add analysis tools.<br />
<br />
'''Expected results:''' A very fast real-time rendering of volumetric quantum chemical within Avogadro, ideally including processing and analysis of surfaces / volumes, orbitals, etc. For example, sometimes the gradient or the Laplacian of a surface are useful. Add tools to add/subtract or join / intersect surfaces and map properties (e.g., electrostatic potential mapped onto the electron density).<br />
<br />
'''Prerequisites:''' Experience in C++, an understanding of vectorization and intrinsics would be helpful.<br />
<br />
'''Mentor:''' Daniel G. A. Smith (dgasmith at vt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Integrate CoordGen library ===<br />
<br />
'''Expected results:''' Schrodinger has released a BSD-licensed library for 2D chemical structure layout (https://github.com/schrodinger/coordgenlibs) and it has been successfully integrated into RDKit. The student will be responsible for integrating CoordGen into Open Babel. Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Test Framework Overhaul ===<br />
<br />
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.<br />
<br />
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.<br />
<br />
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.<br />
<br />
=== Project: Develop a JavaScript version of Open Babel ===<br />
<br />
'''Brief explanation:''' Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.<br />
<br />
'''Expected results''': Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.<br />
<br />
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)<br />
<br />
'''Prerequisities''': Some experience in C++, and also with JavaScript.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
=== Project: Develop a validation and standardization filter ===<br />
<br />
'''Brief explanation''': Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?<br />
<br />
'''Expected results''': Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.<br />
<br />
Such a model could be used as a filter, or as a warning to flag up problematic structures.<br />
<br />
Code could be modeled on MolVS using RDKit [[https://molvs.readthedocs.io/en/latest/]]<br />
<br />
'''Prerequisites''': Experience in C++ or Python, and an interest in data science or statistics.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
=== Project: Support for QCSchema JSON output ===<br />
<br />
'''Bried explanation:''' The library already allows importing and exporting data between several formats. The QCSchema is a new JSON format that tries to standardize the way computational chemistry data is written and shared, so supporting the effort can be useful.<br />
<br />
'''Expected results:''' Implement JSON output that conforms to the conventions of the [https://github.com/MolSSI/QCSchema MolSSI QCSchema].<br />
<br />
'''Suggested readings:'''<br />
* This [https://github.com/cclib/cclib/issues/643 cclib issue] and the references there.<br />
<br />
'''Prerequisites:''' Experience with Python, some experience with physics and chemistry also recommended.<br />
<br />
'''Mentor:''' Eric Berquist (eric.john.berquist at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Expected results:''' Implement additional analysis and quantum calculation methods, such as ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges, with examples and tests.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on GitHub for supporting more programs, and parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Discovering computational chemistry content online ===<br />
<br />
'''Brief explanation:''' There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!<br />
<br />
'''Expected results:''' Build a crawler that identifies and indexes computational chemistry logfiles online, and provides the ability to extract the data they contain with cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
==3Dmol.js Project Ideas==<br />
<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Improve 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Make significant improvements to 3Dmol.js functionality or performance.<br />
<br />
'''Expected results:''' This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request. <br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
==gnina Project Ideas==<br />
<br />
[https://github.com/gnina gnina] is a C/C++ framework for applying deep learning to molecular docking.<br />
<br />
=== Project: Improve gnina ===<br />
<br />
'''Brief explanation:''' Make significant improvements to gnina functionality or performance.<br />
<br />
'''Expected results:''' This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request. <br />
<br />
'''Prerequisites:''' Experience with CUDA/C/C++ programming and the basics of deep learning.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Make RDKit 2D molecular sketches beautiful ===<br />
<br />
'''Brief explanation:''' The RDKit's current molecular rendering works, but the results are far from beautiful. We'd like to change that as well as add a bunch of new functionality. The focus will be on the SVG and PNG (using Cairo) renderers, but if things go well we can also add one that uses the JS canvas.<br />
<br />
'''Expected results:''' Improvements to/a rewrite of the RDKit MolDraw2D C++ class to make the drawings look better and add features like arbitrary atom labels and annotations. Wrappers for the new functionality so that it is accessible from within the Python and SWIG (Java and C#) wrappers. An comprehensive set of tests for the new functionality.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
<br />
=== Project: Implement a generalized file reader ===<br />
<br />
'''Brief explanation:''' Implementation of a flexible generic interface for reading molecular file formats (things like .smi, .sdf, and the compressed versions thereof). The reader should recognize the file format automatically so that the user does not need to worry about this.<br />
<br />
'''Expected results:''' A C++ implementation of a generalized file reader for the RDKit along with a robust set of test cases. Wrappers for the reader so that it is accessible from within the Python and SWIG (Java and C#) wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: Implement Molecular Interaction Fields calculations in the RDKit ===<br />
<br />
'''Brief explanation:''' There is an old PR for the RDKit that implements molecular interaction fields: https://github.com/rdkit/rdkit/pull/318. This was never merged because the author ran out of time. At this point a lot of work would be required to update and finish this PR, but the results would be super useful for the RDKit community.<br />
<br />
'''Expected results:''' A C++ implementation of the GRID calculator code along with a robust set of test cases. Wrappers for the reader so that it is accessible from within the Python and SWIG (Java and C#) wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit+OpenMM GPU Molecular Force Fields ===<br />
<br />
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.<br />
<br />
'''Expected results:''' OpenMM supports a wide range of force fields, but not the classical MMFF94 or UFF methods implemented in RDKit. Needed is C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. The work would likely involve completing the MMFF94 implementation described by Paolo Tosco at the 2017 RDKit UGM (https://github.com/rdkit/UGM_2017/blob/master/Presentations/Tosco_RDKit_OpenMM_integration.pdf) and extending to other force fields like UFF.<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' TBA, likely Geoff Hutchison (geoffh at pitt.edu) and others<br />
<br />
=== Project: MongoDB integration ===<br />
<br />
'''Brief explanation:''' <br />
MongoDB (https://www.mongodb.com/) is an open-source cross-platform document oriented NoSQL database program optimized for performance. Its flexible schema can accommodate hierarchical relationships between chemical compounds. To enable chemical intelligence in mongoDB queries, an integration with RDKit is necessary. This project will allow us to build a frame to perform similarity, substructure, and identity searches. We will leverage on the document structure of the database to store multiple representations of each molecule. While adding new functionalities and developing existing capabilities we will keep an eye on performance, to ensure optimal scalability (indexing, shards, multiprocessing, etc.). We will learn from, and possibly build upon other work that has been done for chemistry integration into MongoDB (e.g. http://wiki.openchemistry.org/MongoChem and http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb).<br />
<br />
'''Expected results:''' A stable and performant RDKit extension to MongoDB that provides chemical functionalities on a document database.<br />
<br />
'''Prerequisites:''' Python<br />
<br />
'''Mentor:''' Marco Stenta (marco.stenta at syngenta.com)<br />
<br />
== NWChem Project Ideas ==<br />
NWChem is widely used open-source computational chemistry software ([http://nwchem-sw.org]) that tackles a wide variety of scientific problems.<br />
<br />
=== Project NWChem-JSON ===<br />
<br />
'''Brief explanation:''' Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics. <br />
<br />
'''Expected results:''' Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed. <br />
<br />
'''Prerequisites:''' Experience with Fortran90 and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== Project NWChem-Python-Jupyter Interface ===<br />
<br />
'''Brief explanation:''' Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks<br />
<br />
'''Expected results:''' NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite. <br />
<br />
'''Prerequisites:''' Experience with Fortran and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== JSON-LD for Chemical Data ===<br />
<br />
'''Brief explanation:''' Transforming NWChem and Chemical JSON formats to JSON-LD<br />
<br />
'''Expected results:''' Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.<br />
<br />
'''Prerequisites:''' Experience with Python and JSON-LD<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.<br />
<br />
=== Project: Transfer Learning Framework ===<br />
<br />
'''Brief explanation:''' Create easy to use tools for common transfer learning scenarios.<br />
<br />
'''Expected results:''' [https://arxiv.org/abs/1712.02734 ChemNet] discusses a powerful model independent transfer learning protocol. We would want to reproduce the results, and be able to apply the transfer learning protocol to arbitrary TensorGraph models. Jupyter notebook tutorials and blog posts will be expected over the course of the summer.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Data Interfaces ===<br />
<br />
'''Brief explanation:''' Transition deepchem.data.Dataset to tf.data.<br />
<br />
'''Expected results:''' DeepChem data objects were created before tf.data existed. We need to make our existing Featurizers, Transformers, and Models work over tf.data objects. Jupyter notebook tutorials and blog posts for how to use the new improved interfaces.<br />
<br />
'''Prerequisites:''' Python, some Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Model Visualization ===<br />
<br />
'''Brief explanation:''' Node Importance Visualizations from Graph Models<br />
<br />
'''Expected results:''' An argument often used against deep learning methods is that they are not understandable. This project would be to implement [https://github.com/debbiemarkslab/neural-fingerprint-theano visual neural graph fingerprints] into DeepChem. Stretch goals would be to implement [https://arxiv.org/abs/1605.01713 DeepLift] or masking techniques for atom level visualizations.<br />
<br />
'''Prerequisites:''' Python, Tensorflow, rdkit<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
<br />
== Psi4 Project Ideas ==<br />
<br />
[http://psicode.org psi4] is an open-source hybrid Python/C++ suite of ab initio quantum chemistry programs designed for efficient, high-accuracy simulations of a variety of molecular properties.<br />
<br />
=== Project: Quantum Chemistry with Deep Learning Toolkits ===<br />
<br />
'''Brief explanation:''' Integrate GPU tensors tools like TensorFlow and PyTorch with the Psi4NumPy (https://github.com/psi4/psi4numpy) to explore the performance of these high-level tools with quantum chemistry.<br />
<br />
'''Expected results:''' A small module that can evaluate quantum chemistry on GPUs.<br />
<br />
'''Prerequisites:''' Tensorflow or PyTorch knowledge, linear algebra, and an understanding of general tensor contraction. No quantum chemistry knowledge required.<br />
<br />
'''Mentor:''' Daniel G. A. Smith (dgasmith at vt.edu) <br />
<br />
=== Project: Parallelization of Task Graph Computations ===<br />
<br />
'''Brief explanation:''' Improve Psi4's task graph computation integration with the MolSSI QCFractal (https://github.com/MolSSI/QCFractal) project for massively parallel quantum chemistry.<br />
<br />
'''Expected results:''' Massively parallel implementations of crystal computations, n-body interactions<br />
<br />
'''Prerequisites:''' Python experience and task-graph experience (such as Dask), an understand quantum chemistry understanding would be helpful.<br />
<br />
'''Mentor:''' Lori Burns (lori.burns at gmail.com) or Roberto Di Remigio (roberto.diremigio at gmail.com)<br />
<br />
=== Project: Avogadro visualization integration ===<br />
<br />
'''Brief explanation:''' Integration of Psi4 volumetric data with Avogadro's rendering tools.<br />
<br />
'''Expected results:''' Automatic integration of Psi4s volumetric data such as cube files, F-SAPT energy decomposition analysis routines, and vibrational frequencies.<br />
<br />
'''Prerequisites:''' Python experience and Avogadro integration, a small amount of quantum chemistry understanding would be helpful.<br />
<br />
'''Mentor:''' Justin Turney (justin.turney at gmail.com) or Andrew James (amjames2 vt.edu)<br />
<br />
== MSDK / MZmine Project Ideas ==<br />
<br />
=== Project: New Visualization Modules ===<br />
<br />
'''Brief explanation:''' Implement new, JavaFX-based visualization modules for MZmine [https://mzmine.github.io] such as 3D plot and Cloud Plot.<br />
<br />
'''Expected results:''' A replacement module for the aging and barely functional 3D visualizer [https://mzmine.github.io/img/screenshots/3D.png], as well as new visualization tools for data analysis.<br />
<br />
'''Prerequisites:''' Java, JavaFX (preferred), experience with 3D graphics helpful but not required.<br />
<br />
'''Mentor:''' Tomas Pluskal (plusik at gmail.com)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2020&diff=676GSoC Ideas 20202020-01-14T14:49:31Z<p>Greg.landrum: /* RDKit Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. While we have participated in the last few Google Summer of Code programs and will apply again in 2020, there is '''no guarantee''' that we will be selected again for GSoC in 2020.<br />
<br />
We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Rudimentary support for residues, and reading secondary structure (e.g., PDB format) is now present. Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels are desired. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (i.e., Python) in Avogadro 2<br />
<br />
'''Expected results:''’ Enable an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python. Initial Python bindings have been re-implemented using PyBind11 with the new codebase, and the Avogadro 2 core libraries are pip installable. Extending the coverage of the API from the rudimentary parts of core/io would be a good starting point. An ideal solution would connect to PySide2, to allow scripting to add UI like menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
Example scripts, documentation, are highly encouraged.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python, some experience with PyBind11, Qt for Python, PySide suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (marcus dot hanwell at kitware dot com)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
<br />
=== Project: Improve Avogadro Quantum Processing & Analysis ===<br />
<br />
'''Brief explanation:''' Visualizing quantum mechanical data like orbitals, electron density, etc. is slow. Replace Avogadro's current orbital rendering to use the efficient the Gau2Grid library [[https://gau2grid.readthedocs.io/en/latest/]] and add analysis tools.<br />
<br />
'''Expected results:''' A very fast real-time rendering of volumetric quantum chemical within Avogadro, ideally including processing and analysis of surfaces / volumes, orbitals, etc. For example, sometimes the gradient or the Laplacian of a surface are useful. Add tools to add/subtract or join / intersect surfaces and map properties (e.g., electrostatic potential mapped onto the electron density).<br />
<br />
'''Prerequisites:''' Experience in C++, an understanding of vectorization and intrinsics would be helpful.<br />
<br />
'''Mentor:''' Daniel G. A. Smith (dgasmith at vt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Integrate CoordGen library ===<br />
<br />
'''Expected results:''' Schrodinger has released a BSD-licensed library for 2D chemical structure layout (https://github.com/schrodinger/coordgenlibs) and it has been successfully integrated into RDKit. The student will be responsible for integrating CoordGen into Open Babel. Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Test Framework Overhaul ===<br />
<br />
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.<br />
<br />
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.<br />
<br />
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.<br />
<br />
=== Project: Develop a JavaScript version of Open Babel ===<br />
<br />
'''Brief explanation:''' Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.<br />
<br />
'''Expected results''': Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.<br />
<br />
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)<br />
<br />
'''Prerequisities''': Some experience in C++, and also with JavaScript.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
=== Project: Develop a validation and standardization filter ===<br />
<br />
'''Brief explanation''': Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?<br />
<br />
'''Expected results''': Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.<br />
<br />
Such a model could be used as a filter, or as a warning to flag up problematic structures.<br />
<br />
Code could be modeled on MolVS using RDKit [[https://molvs.readthedocs.io/en/latest/]]<br />
<br />
'''Prerequisites''': Experience in C++ or Python, and an interest in data science or statistics.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
=== Project: Support for QCSchema JSON output ===<br />
<br />
'''Bried explanation:''' The library already allows importing and exporting data between several formats. The QCSchema is a new JSON format that tries to standardize the way computational chemistry data is written and shared, so supporting the effort can be useful.<br />
<br />
'''Expected results:''' Implement JSON output that conforms to the conventions of the [https://github.com/MolSSI/QCSchema MolSSI QCSchema].<br />
<br />
'''Suggested readings:'''<br />
* This [https://github.com/cclib/cclib/issues/643 cclib issue] and the references there.<br />
<br />
'''Prerequisites:''' Experience with Python, some experience with physics and chemistry also recommended.<br />
<br />
'''Mentor:''' Eric Berquist (eric.john.berquist at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Expected results:''' Implement additional analysis and quantum calculation methods, such as ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges, with examples and tests.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on GitHub for supporting more programs, and parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Discovering computational chemistry content online ===<br />
<br />
'''Brief explanation:''' There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!<br />
<br />
'''Expected results:''' Build a crawler that identifies and indexes computational chemistry logfiles online, and provides the ability to extract the data they contain with cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
==3Dmol.js Project Ideas==<br />
<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Google Cardboard for 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Implement low cost virtual reality visualization using Google Cardboard<br />
<br />
'''Expected results:''' [Google Cardboard https://en.wikipedia.org/wiki/Google_Cardboard] is a VR experience using commodity smartphones and either a paperboard/cardboard mount or an inexpensive pre-made mount. The project would produce an implementation using the Cardboard SDK for 3Dmol.js, allowing both individual VR use and synchronized classroom use (e.g., one "guide" and multiple synchronized viewers).<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
==gnina Project Ideas==<br />
<br />
[https://github.com/gnina gnina] is a C/C++ framework for applying deep learning to molecular docking.<br />
<br />
=== Project: Implement flexible docking with CNN scoring ===<br />
<br />
'''Brief explanation:''' Train convolutional neural networks to properly evaluate protein side chain positions and ligand binding<br />
<br />
'''Expected results:''' High-throughput docking typically make a rigid receptor assumption for performance reasons, but in reality the receptor is flexible and changes conformation upon ligand binding. The project will include the generating of an appropriate training set of protein and ligand structures to teach a neural network to properly score flexibly docked structures. The trained networks will then be used to guide the docking process and iteratively refine the training set.<br />
<br />
'''Prerequisites:''' Experience with C/C++ programming and the basics of deep learning.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Implement a generalized file reader ===<br />
<br />
'''Brief explanation:''' Implementation of a flexible generic interface for reading molecular file formats (things like .smi, .sdf, and the compressed versions thereof). The reader should recognize the file format automatically so that the user does not need to worry about this.<br />
<br />
'''Expected results:''' A C++ implementation of a generalized file reader for the RDKit along with a robust set of test cases. Wrappers for the reader so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: Implement Molecular Interaction Fields calculations in the RDKit ===<br />
<br />
'''Brief explanation:''' There is an old PR for the RDKit that implements molecular interaction fields: https://github.com/rdkit/rdkit/pull/318. This was never merged because the author ran out of time. At this point a lot of work would be required to update and finish this PR, but the results would be super useful for the RDKit community.<br />
<br />
'''Expected results:''' A C++ implementation of the GRID calculator code along with a robust set of test cases. Wrappers for the reader so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit+OpenMM GPU Molecular Force Fields ===<br />
<br />
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.<br />
<br />
'''Expected results:''' OpenMM supports a wide range of force fields, but not the classical MMFF94 or UFF methods implemented in RDKit. Needed is C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. The work would likely involve completing the MMFF94 implementation described by Paolo Tosco at the 2017 RDKit UGM (https://github.com/rdkit/UGM_2017/blob/master/Presentations/Tosco_RDKit_OpenMM_integration.pdf) and extending to other force fields like UFF.<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' TBA, likely Geoff Hutchison (geoffh at pitt.edu) and others<br />
<br />
=== Project: MongoDB integration ===<br />
<br />
'''Brief explanation:''' <br />
MongoDB (https://www.mongodb.com/) is an open-source cross-platform document oriented NoSQL database program optimized for performance. Its flexible schema can accommodate hierarchical relationships between chemical compounds. To enable chemical intelligence in mongoDB queries, an integration with RDKit is necessary. This project will allow us to build a frame to perform similarity, substructure, and identity searches. We will leverage on the document structure of the database to store multiple representations of each molecule. While adding new functionalities and developing existing capabilities we will keep an eye on performance, to ensure optimal scalability (indexing, shards, multiprocessing, etc.). We will learn from, and possibly build upon other work that has been done for chemistry integration into MongoDB (e.g. http://wiki.openchemistry.org/MongoChem and http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb).<br />
<br />
'''Expected results:''' A stable and performant RDKit extension to MongoDB that provides chemical functionalities on a document database.<br />
<br />
'''Prerequisites:''' Python<br />
<br />
'''Mentor:''' Marco Stenta (marco.stenta at syngenta.com)<br />
<br />
== NWChem Project Ideas ==<br />
NWChem is widely used open-source computational chemistry software ([http://nwchem-sw.org]) that tackles a wide variety of scientific problems.<br />
<br />
=== Project NWChem-JSON ===<br />
<br />
'''Brief explanation:''' Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics. <br />
<br />
'''Expected results:''' Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed. <br />
<br />
'''Prerequisites:''' Experience with Fortran90 and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== Project NWChem-Python-Jupyter Interface ===<br />
<br />
'''Brief explanation:''' Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks<br />
<br />
'''Expected results:''' NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite. <br />
<br />
'''Prerequisites:''' Experience with Fortran and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== JSON-LD for Chemical Data ===<br />
<br />
'''Brief explanation:''' Transforming NWChem and Chemical JSON formats to JSON-LD<br />
<br />
'''Expected results:''' Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.<br />
<br />
'''Prerequisites:''' Experience with Python and JSON-LD<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.<br />
<br />
=== Project: Transfer Learning Framework ===<br />
<br />
'''Brief explanation:''' Create easy to use tools for common transfer learning scenarios.<br />
<br />
'''Expected results:''' [https://arxiv.org/abs/1712.02734 ChemNet] discusses a powerful model independent transfer learning protocol. We would want to reproduce the results, and be able to apply the transfer learning protocol to arbitrary TensorGraph models. Jupyter notebook tutorials and blog posts will be expected over the course of the summer.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Data Interfaces ===<br />
<br />
'''Brief explanation:''' Transition deepchem.data.Dataset to tf.data.<br />
<br />
'''Expected results:''' DeepChem data objects were created before tf.data existed. We need to make our existing Featurizers, Transformers, and Models work over tf.data objects. Jupyter notebook tutorials and blog posts for how to use the new improved interfaces.<br />
<br />
'''Prerequisites:''' Python, some Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Model Visualization ===<br />
<br />
'''Brief explanation:''' Node Importance Visualizations from Graph Models<br />
<br />
'''Expected results:''' An argument often used against deep learning methods is that they are not understandable. This project would be to implement [https://github.com/debbiemarkslab/neural-fingerprint-theano visual neural graph fingerprints] into DeepChem. Stretch goals would be to implement [https://arxiv.org/abs/1605.01713 DeepLift] or masking techniques for atom level visualizations.<br />
<br />
'''Prerequisites:''' Python, Tensorflow, rdkit<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
<br />
== Psi4 Project Ideas ==<br />
<br />
[http://psicode.org psi4] is an open-source hybrid Python/C++ suite of ab initio quantum chemistry programs designed for efficient, high-accuracy simulations of a variety of molecular properties.<br />
<br />
=== Project: Quantum Chemistry with Deep Learning Toolkits ===<br />
<br />
'''Brief explanation:''' Integrate GPU tensors tools like TensorFlow and PyTorch with the Psi4NumPy (https://github.com/psi4/psi4numpy) to explore the performance of these high-level tools with quantum chemistry.<br />
<br />
'''Expected results:''' A small module that can evaluate quantum chemistry on GPUs.<br />
<br />
'''Prerequisites:''' Tensorflow or PyTorch knowledge, linear algebra, and an understanding of general tensor contraction. No quantum chemistry knowledge required.<br />
<br />
'''Mentor:''' Daniel G. A. Smith (dgasmith at vt.edu) <br />
<br />
=== Project: Parallelization of Task Graph Computations ===<br />
<br />
'''Brief explanation:''' Improve Psi4's task graph computation integration with the MolSSI QCFractal (https://github.com/MolSSI/QCFractal) project for massively parallel quantum chemistry.<br />
<br />
'''Expected results:''' Massively parallel implementations of crystal computations, n-body interactions<br />
<br />
'''Prerequisites:''' Python experience and task-graph experience (such as Dask), an understand quantum chemistry understanding would be helpful.<br />
<br />
'''Mentor:''' Lori Burns (lori.burns at gmail.com) or Roberto Di Remigio (roberto.diremigio at gmail.com)<br />
<br />
=== Project: Avogadro visualization integration ===<br />
<br />
'''Brief explanation:''' Integration of Psi4 volumetric data with Avogadro's rendering tools.<br />
<br />
'''Expected results:''' Automatic integration of Psi4s volumetric data such as cube files, F-SAPT energy decomposition analysis routines, and vibrational frequencies.<br />
<br />
'''Prerequisites:''' Python experience and Avogadro integration, a small amount of quantum chemistry understanding would be helpful.<br />
<br />
'''Mentor:''' Justin Turney (justin.turney at gmail.com) or Andrew James (amjames2 vt.edu)<br />
<br />
== MSDK / MZmine Project Ideas ==<br />
<br />
=== Project: New Visualization Modules ===<br />
<br />
'''Brief explanation:''' Implement new, JavaFX-based visualization modules for MZmine [https://mzmine.github.io] such as 3D plot and Cloud Plot.<br />
<br />
'''Expected results:''' A replacement module for the aging and barely functional 3D visualizer [https://mzmine.github.io/img/screenshots/3D.png], as well as new visualization tools for data analysis.<br />
<br />
'''Prerequisites:''' Java, JavaFX (preferred), experience with 3D graphics helpful but not required.<br />
<br />
'''Mentor:''' Tomas Pluskal (plusik at gmail.com)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2020&diff=675GSoC Ideas 20202020-01-14T14:47:59Z<p>Greg.landrum: Remove completed project from 2019</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. While we have participated in the last few Google Summer of Code programs and will apply again in 2020, there is '''no guarantee''' that we will be selected again for GSoC in 2020.<br />
<br />
We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Rudimentary support for residues, and reading secondary structure (e.g., PDB format) is now present. Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels are desired. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (i.e., Python) in Avogadro 2<br />
<br />
'''Expected results:''’ Enable an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python. Initial Python bindings have been re-implemented using PyBind11 with the new codebase, and the Avogadro 2 core libraries are pip installable. Extending the coverage of the API from the rudimentary parts of core/io would be a good starting point. An ideal solution would connect to PySide2, to allow scripting to add UI like menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
Example scripts, documentation, are highly encouraged.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python, some experience with PyBind11, Qt for Python, PySide suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (marcus dot hanwell at kitware dot com)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
<br />
=== Project: Improve Avogadro Quantum Processing & Analysis ===<br />
<br />
'''Brief explanation:''' Visualizing quantum mechanical data like orbitals, electron density, etc. is slow. Replace Avogadro's current orbital rendering to use the efficient the Gau2Grid library [[https://gau2grid.readthedocs.io/en/latest/]] and add analysis tools.<br />
<br />
'''Expected results:''' A very fast real-time rendering of volumetric quantum chemical within Avogadro, ideally including processing and analysis of surfaces / volumes, orbitals, etc. For example, sometimes the gradient or the Laplacian of a surface are useful. Add tools to add/subtract or join / intersect surfaces and map properties (e.g., electrostatic potential mapped onto the electron density).<br />
<br />
'''Prerequisites:''' Experience in C++, an understanding of vectorization and intrinsics would be helpful.<br />
<br />
'''Mentor:''' Daniel G. A. Smith (dgasmith at vt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Integrate CoordGen library ===<br />
<br />
'''Expected results:''' Schrodinger has released a BSD-licensed library for 2D chemical structure layout (https://github.com/schrodinger/coordgenlibs) and it has been successfully integrated into RDKit. The student will be responsible for integrating CoordGen into Open Babel. Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Test Framework Overhaul ===<br />
<br />
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.<br />
<br />
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.<br />
<br />
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.<br />
<br />
=== Project: Develop a JavaScript version of Open Babel ===<br />
<br />
'''Brief explanation:''' Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.<br />
<br />
'''Expected results''': Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.<br />
<br />
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)<br />
<br />
'''Prerequisities''': Some experience in C++, and also with JavaScript.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
=== Project: Develop a validation and standardization filter ===<br />
<br />
'''Brief explanation''': Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?<br />
<br />
'''Expected results''': Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.<br />
<br />
Such a model could be used as a filter, or as a warning to flag up problematic structures.<br />
<br />
Code could be modeled on MolVS using RDKit [[https://molvs.readthedocs.io/en/latest/]]<br />
<br />
'''Prerequisites''': Experience in C++ or Python, and an interest in data science or statistics.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
=== Project: Support for QCSchema JSON output ===<br />
<br />
'''Bried explanation:''' The library already allows importing and exporting data between several formats. The QCSchema is a new JSON format that tries to standardize the way computational chemistry data is written and shared, so supporting the effort can be useful.<br />
<br />
'''Expected results:''' Implement JSON output that conforms to the conventions of the [https://github.com/MolSSI/QCSchema MolSSI QCSchema].<br />
<br />
'''Suggested readings:'''<br />
* This [https://github.com/cclib/cclib/issues/643 cclib issue] and the references there.<br />
<br />
'''Prerequisites:''' Experience with Python, some experience with physics and chemistry also recommended.<br />
<br />
'''Mentor:''' Eric Berquist (eric.john.berquist at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Expected results:''' Implement additional analysis and quantum calculation methods, such as ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges, with examples and tests.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on GitHub for supporting more programs, and parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Discovering computational chemistry content online ===<br />
<br />
'''Brief explanation:''' There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!<br />
<br />
'''Expected results:''' Build a crawler that identifies and indexes computational chemistry logfiles online, and provides the ability to extract the data they contain with cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
==3Dmol.js Project Ideas==<br />
<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Google Cardboard for 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Implement low cost virtual reality visualization using Google Cardboard<br />
<br />
'''Expected results:''' [Google Cardboard https://en.wikipedia.org/wiki/Google_Cardboard] is a VR experience using commodity smartphones and either a paperboard/cardboard mount or an inexpensive pre-made mount. The project would produce an implementation using the Cardboard SDK for 3Dmol.js, allowing both individual VR use and synchronized classroom use (e.g., one "guide" and multiple synchronized viewers).<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
==gnina Project Ideas==<br />
<br />
[https://github.com/gnina gnina] is a C/C++ framework for applying deep learning to molecular docking.<br />
<br />
=== Project: Implement flexible docking with CNN scoring ===<br />
<br />
'''Brief explanation:''' Train convolutional neural networks to properly evaluate protein side chain positions and ligand binding<br />
<br />
'''Expected results:''' High-throughput docking typically make a rigid receptor assumption for performance reasons, but in reality the receptor is flexible and changes conformation upon ligand binding. The project will include the generating of an appropriate training set of protein and ligand structures to teach a neural network to properly score flexibly docked structures. The trained networks will then be used to guide the docking process and iteratively refine the training set.<br />
<br />
'''Prerequisites:''' Experience with C/C++ programming and the basics of deep learning.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Implement a generalized file reader ===<br />
<br />
'''Brief explanation:''' Implementation of a flexible generic interface for reading molecular file formats (things like .smi, .sdf, and the compressed versions thereof). The reader should recognize the file format automatically so that the user does not need to worry about this.<br />
<br />
'''Expected results:''' A C++ implementation of a generalized file reader for the RDKit along with a robust set of test cases. Wrappers for the reader so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: Implement Molecular Interaction Fields calculations in the RDKit ===<br />
<br />
'''Brief explanation:''' There is an old PR for the RDKit that implements molecular interaction fields: https://github.com/rdkit/rdkit/pull/318. This was never merged because the author ran out of time. At this point a lot of work would be required to update and finish this PR, but the results would be super useful for the RDKit community.<br />
<br />
'''Expected results:''' A C++ implementation of the GRID calculator code along with a robust set of test cases. Wrappers for the reader so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit+OpenMM GPU Molecular Force Fields ===<br />
<br />
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.<br />
<br />
'''Expected results:''' OpenMM supports a wide range of force fields, but not the classical MMFF94 or UFF methods implemented in RDKit. Needed is C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. The work would likely involve completing the MMFF94 implementation described by Paolo Tosco at the 2017 RDKit UGM (https://github.com/rdkit/UGM_2017/blob/master/Presentations/Tosco_RDKit_OpenMM_integration.pdf) and extending to other force fields like UFF.<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' TBA, likely Geoff Hutchison (geoffh at pitt.edu) and others<br />
<br />
<br />
=== Project: MongoDB integration ===<br />
<br />
'''Brief explanation:''' <br />
MongoDB (https://www.mongodb.com/) is an open-source cross-platform document oriented NoSQL database program optimized for performance. Its flexible schema can accommodate hierarchical relationships between chemical compounds. To enable chemical intelligence in mongoDB queries, an integration with RDKit is necessary. This project will allow us to build a frame to perform similarity, substructure, and identity searches. We will leverage on the document structure of the database to store multiple representations of each molecule. While adding new functionalities and developing existing capabilities we will keep an eye on performance, to ensure optimal scalability (indexing, shards, multiprocessing, etc.). We will learn from, and possibly build upon other work that has been done for chemistry integration into MongoDB (e.g. http://wiki.openchemistry.org/MongoChem and http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb).<br />
<br />
'''Expected results:''' A stable and performant RDKit extension to MongoDB that provides chemical functionalities on a document database.<br />
<br />
'''Prerequisites:''' Python<br />
<br />
'''Mentor:''' Marco Stenta (marco.stenta at syngenta.com)<br />
<br />
=== Project: GPU Implementation of the Distance-Geometry Forcefield ===<br />
<br />
'''Brief explanation:''' <br />
The RDKit uses a simple force field internally as part of its distance-geometry driven conformation generation (2D->3D conversion) process. Minimization using this force-field consumes a large part of the runtime of the conformation generation process. The goal of this project is to port the distance geometry forcefield and minimizer to run on a GPU with the goal of speeding up conformation generation.<br />
<br />
'''Expected results:''' <br />
A stable and well tested C++ implementation of the RDKit's distance-geometry forcefield.<br />
<br />
'''Prerequisites:''' C++ and GPU (Cuda, OpenCL, etc.)<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
== NWChem Project Ideas ==<br />
NWChem is widely used open-source computational chemistry software ([http://nwchem-sw.org]) that tackles a wide variety of scientific problems.<br />
<br />
=== Project NWChem-JSON ===<br />
<br />
'''Brief explanation:''' Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics. <br />
<br />
'''Expected results:''' Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed. <br />
<br />
'''Prerequisites:''' Experience with Fortran90 and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== Project NWChem-Python-Jupyter Interface ===<br />
<br />
'''Brief explanation:''' Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks<br />
<br />
'''Expected results:''' NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite. <br />
<br />
'''Prerequisites:''' Experience with Fortran and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== JSON-LD for Chemical Data ===<br />
<br />
'''Brief explanation:''' Transforming NWChem and Chemical JSON formats to JSON-LD<br />
<br />
'''Expected results:''' Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.<br />
<br />
'''Prerequisites:''' Experience with Python and JSON-LD<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.<br />
<br />
=== Project: Transfer Learning Framework ===<br />
<br />
'''Brief explanation:''' Create easy to use tools for common transfer learning scenarios.<br />
<br />
'''Expected results:''' [https://arxiv.org/abs/1712.02734 ChemNet] discusses a powerful model independent transfer learning protocol. We would want to reproduce the results, and be able to apply the transfer learning protocol to arbitrary TensorGraph models. Jupyter notebook tutorials and blog posts will be expected over the course of the summer.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Data Interfaces ===<br />
<br />
'''Brief explanation:''' Transition deepchem.data.Dataset to tf.data.<br />
<br />
'''Expected results:''' DeepChem data objects were created before tf.data existed. We need to make our existing Featurizers, Transformers, and Models work over tf.data objects. Jupyter notebook tutorials and blog posts for how to use the new improved interfaces.<br />
<br />
'''Prerequisites:''' Python, some Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Model Visualization ===<br />
<br />
'''Brief explanation:''' Node Importance Visualizations from Graph Models<br />
<br />
'''Expected results:''' An argument often used against deep learning methods is that they are not understandable. This project would be to implement [https://github.com/debbiemarkslab/neural-fingerprint-theano visual neural graph fingerprints] into DeepChem. Stretch goals would be to implement [https://arxiv.org/abs/1605.01713 DeepLift] or masking techniques for atom level visualizations.<br />
<br />
'''Prerequisites:''' Python, Tensorflow, rdkit<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
<br />
== Psi4 Project Ideas ==<br />
<br />
[http://psicode.org psi4] is an open-source hybrid Python/C++ suite of ab initio quantum chemistry programs designed for efficient, high-accuracy simulations of a variety of molecular properties.<br />
<br />
=== Project: Quantum Chemistry with Deep Learning Toolkits ===<br />
<br />
'''Brief explanation:''' Integrate GPU tensors tools like TensorFlow and PyTorch with the Psi4NumPy (https://github.com/psi4/psi4numpy) to explore the performance of these high-level tools with quantum chemistry.<br />
<br />
'''Expected results:''' A small module that can evaluate quantum chemistry on GPUs.<br />
<br />
'''Prerequisites:''' Tensorflow or PyTorch knowledge, linear algebra, and an understanding of general tensor contraction. No quantum chemistry knowledge required.<br />
<br />
'''Mentor:''' Daniel G. A. Smith (dgasmith at vt.edu) <br />
<br />
=== Project: Parallelization of Task Graph Computations ===<br />
<br />
'''Brief explanation:''' Improve Psi4's task graph computation integration with the MolSSI QCFractal (https://github.com/MolSSI/QCFractal) project for massively parallel quantum chemistry.<br />
<br />
'''Expected results:''' Massively parallel implementations of crystal computations, n-body interactions<br />
<br />
'''Prerequisites:''' Python experience and task-graph experience (such as Dask), an understand quantum chemistry understanding would be helpful.<br />
<br />
'''Mentor:''' Lori Burns (lori.burns at gmail.com) or Roberto Di Remigio (roberto.diremigio at gmail.com)<br />
<br />
=== Project: Avogadro visualization integration ===<br />
<br />
'''Brief explanation:''' Integration of Psi4 volumetric data with Avogadro's rendering tools.<br />
<br />
'''Expected results:''' Automatic integration of Psi4s volumetric data such as cube files, F-SAPT energy decomposition analysis routines, and vibrational frequencies.<br />
<br />
'''Prerequisites:''' Python experience and Avogadro integration, a small amount of quantum chemistry understanding would be helpful.<br />
<br />
'''Mentor:''' Justin Turney (justin.turney at gmail.com) or Andrew James (amjames2 vt.edu)<br />
<br />
== MSDK / MZmine Project Ideas ==<br />
<br />
=== Project: New Visualization Modules ===<br />
<br />
'''Brief explanation:''' Implement new, JavaFX-based visualization modules for MZmine [https://mzmine.github.io] such as 3D plot and Cloud Plot.<br />
<br />
'''Expected results:''' A replacement module for the aging and barely functional 3D visualizer [https://mzmine.github.io/img/screenshots/3D.png], as well as new visualization tools for data analysis.<br />
<br />
'''Prerequisites:''' Java, JavaFX (preferred), experience with 3D graphics helpful but not required.<br />
<br />
'''Mentor:''' Tomas Pluskal (plusik at gmail.com)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2019&diff=650GSoC Ideas 20192019-01-23T10:48:13Z<p>Greg.landrum: /* RDKit Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. While we have participated in the last few Google Summer of Code programs and will apply again in 2019, there is '''no guarantee''' that we will be selected again for GSoC 2019.<br />
<br />
We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''’ Enable an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. Initial Python bindings have been re-implemented using PyBind11 with the new codebase. An ideal solution would connect to Qt for Python, to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
Example scripts, documentation, are highly encouraged.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python, some experience with PyBind11, Qt for Python, PySide suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (marcus dot hanwell at kitware dot com)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Test Framework Overhaul ===<br />
<br />
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.<br />
<br />
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.<br />
<br />
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.<br />
<br />
=== Project: Develop a JavaScript version of Open Babel ===<br />
<br />
'''Brief explanation:''' Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.<br />
<br />
'''Expected results''': Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.<br />
<br />
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)<br />
<br />
'''Prerequisities''': Some experience in C++, and also with JavaScript.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
=== Project: Develop a validation and standardization filter ===<br />
<br />
'''Brief explanation''': Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?<br />
<br />
'''Expected results''': Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.<br />
<br />
Such a model could be used as a filter, or as a warning to flag up problematic structures.<br />
<br />
Code could be modeled on MolVS using RDKit [[https://molvs.readthedocs.io/en/latest/]]<br />
<br />
'''Prerequisites''': Experience in C++ or Python, and an interest in data science or statistics.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() functions in parsers are long and contain a lot of business logic. They should be refactored into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement the best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Discovering computational chemistry content online ===<br />
<br />
'''Brief explanation:''' There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!<br />
<br />
'''Expected results:''' Build a crawler that identifies and indexes computational chemistry content online, and provides the ability to extract data with cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Machine learning applied to parsing computational chemistry output ===<br />
<br />
'''Bried explanation:''' Can we teach a machine to parse computational chemistry logfiles at least as well as cclib already does? What machine learning approach here would be most appropriate? Is it useful to include prior (chemical) knowldedge or soft constraints to guide parser learning?<br />
<br />
'''Expected results:''' Identify and implement a machine learning pipeline that attempts to reproduce or complement cclib's various parsers.<br />
<br />
'''Prerequisites:''' Experience with Python, machine learning, and ideally familiarity with computational chemistry.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
==3Dmol.js Project Ideas==<br />
<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://developer.download.nvidia.com/books/HTML/gpugems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Google Cardboard for 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Implement low cost virtual reality visualization using Google Cardboard<br />
<br />
'''Expected results:''' [Google Cardboard https://en.wikipedia.org/wiki/Google_Cardboard] is a VR experience using commodity smartphones and either a paperboard/cardboard mount or an inexpensive pre-made mount. The project would produce an implementation using the Cardboard SDK for 3Dmol.js, allowing both individual VR use and synchronized classroom use (e.g., one "guide" and multiple synchronized viewers).<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Implement a generalized file reader ===<br />
<br />
'''Brief explanation:''' Implementation of a flexible generic interface for reading molecular file formats (things like .smi, .sdf, and the compressed versions thereof). The reader should recognize the file format automatically so that the user does not need to worry about this.<br />
<br />
'''Expected results:''' A C++ implementation of a generalized file reader for the RDKit along with a robust set of test cases. Wrappers for the reader so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: Implement Molecular Interaction Fields calculations in the RDKit ===<br />
<br />
'''Brief explanation:''' There is an old PR for the RDKit that implements molecular interaction fields: https://github.com/rdkit/rdkit/pull/318. This was never merged because the author ran out of time. At this point a lot of work would be required to update and finish this PR, but the results would be super useful for the RDKit community.<br />
<br />
'''Expected results:''' A C++ implementation of the GRID calculator code along with a robust set of test cases. Wrappers for the reader so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: neo4j integration ===<br />
<br />
'''Brief explanation:''' <br />
The RDKit already has strong integration with the open-source relational database PostgreSQL, in this project you'll build a similar extension for the open-source graph database neo4j (https://neo4j.com/). The concept of the knowledge graph, which stores the relationships between objects in addition to the objects themselves, has become widespread in data management and integration. This project will allow us to build and query knowledge graphs storing molecular and chemical information.<br />
<br />
'''Expected results:''' <br />
An RDKit extension to neo4j that provides chemical functionality for finding entry points into the graph and to efficiently filter paths using chemical knowledge while traversing the graph.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Christian Pilger (christian.pilger at basf.com)<br />
<br />
=== Project: RDKit - OpenMM Integration ===<br />
<br />
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.<br />
<br />
'''Expected results:''' C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. The work would likely involve completing the MMFF94 implementation described by Paolo Tosco at the 2017 RDKit UGM (https://github.com/rdkit/UGM_2017/blob/master/Presentations/Tosco_RDKit_OpenMM_integration.pdf) and extending to other force fields like UFF.<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' TBA, likely Geoff Hutchison (geoffh at pitt.edu) and others<br />
<br />
<br />
=== Project: MongoDB integration ===<br />
<br />
'''Brief explanation:''' <br />
MongoDB (https://www.mongodb.com/) is an open-source cross-platform document oriented NoSQL database program optimized for performance. Its flexible schema can accommodate hierarchical relationships between chemical compounds. To enable chemical intelligence in mongoDB queries, an integration with RDKit is necessary. This project will allow us to build a frame to perform similarity, substructure, and identity searches. We will leverage on the document structure of the database to store multiple representations of each molecule. While adding new functionalities and developing existing capabilities we will keep an eye on performance, to ensure optimal scalability (indexing, shards, multiprocessing, etc.). We will learn from, and possibly build upon other work that has been done for chemistry integration into MongoDB (e.g. http://wiki.openchemistry.org/MongoChem and http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb).<br />
<br />
'''Expected results:''' A stable and performant RDKit extension to MongoDB that provides chemical functionalities on a document database.<br />
<br />
'''Prerequisites:''' Python<br />
<br />
'''Mentor:''' Marco Stenta (marco.stenta at syngenta.com)<br />
<br />
=== Project: GPU Implementation of the Distance-Geometry Forcefield ===<br />
<br />
'''Brief explanation:''' <br />
The RDKit uses a simple force field internally as part of its distance-geometry driven conformation generation (2D->3D conversion) process. Minimization using this force-field consumes a large part of the runtime of the conformation generation process. The goal of this project is to port the distance geometry forcefield and minimizer to run on a GPU with the goal of speeding up conformation generation.<br />
<br />
'''Expected results:''' <br />
A stable and well tested C++ implementation of the RDKit's distance-geometry forcefield.<br />
<br />
'''Prerequisites:''' C++ and GPU (Cuda, OpenCL, etc.)<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
== NWChem Project Ideas ==<br />
NWChem is widely used open-source computational chemistry software ([http://nwchem-sw.org]) that tackles a wide variety of scientific problems.<br />
<br />
=== Project NWChem-JSON ===<br />
<br />
'''Brief explanation:''' Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics. <br />
<br />
'''Expected results:''' Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed. <br />
<br />
'''Prerequisites:''' Experience with Fortran90 and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== Project NWChem-Python-Jupyter Interface ===<br />
<br />
'''Brief explanation:''' Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks<br />
<br />
'''Expected results:''' NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite. <br />
<br />
'''Prerequisites:''' Experience with Fortran and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== JSON-LD for Chemical Data ===<br />
<br />
'''Brief explanation:''' Transforming NWChem and Chemical JSON formats to JSON-LD<br />
<br />
'''Expected results:''' Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.<br />
<br />
'''Prerequisites:''' Experience with Python and JSON-LD<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.<br />
<br />
=== Project: Transfer Learning Framework ===<br />
<br />
'''Brief explanation:''' Create easy to use tools for common transfer learning scenarios.<br />
<br />
'''Expected results:''' [https://arxiv.org/abs/1712.02734 ChemNet] discusses a powerful model independent transfer learning protocol. We would want to reproduce the results, and be able to apply the transfer learning protocol to arbitrary TensorGraph models. Jupyter notebook tutorials and blog posts will be expected over the course of the summer.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Data Interfaces ===<br />
<br />
'''Brief explanation:''' Transition deepchem.data.Dataset to tf.data.<br />
<br />
'''Expected results:''' DeepChem data objects were created before tf.data existed. We need to make our existing Featurizers, Transformers, and Models work over tf.data objects. Jupyter notebook tutorials and blog posts for how to use the new improved interfaces.<br />
<br />
'''Prerequisites:''' Python, some Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Model Visualization ===<br />
<br />
'''Brief explanation:''' Node Importance Visualizations from Graph Models<br />
<br />
'''Expected results:''' An argument often used against deep learning methods is that they are not understandable. This project would be to implement [https://github.com/debbiemarkslab/neural-fingerprint-theano visual neural graph fingerprints] into DeepChem. Stretch goals would be to implement [https://arxiv.org/abs/1605.01713 DeepLift] or masking techniques for atom level visualizations.<br />
<br />
'''Prerequisites:''' Python, Tensorflow, rdkit<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2019&diff=649GSoC Ideas 20192019-01-21T07:36:37Z<p>Greg.landrum: /* RDKit Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. While we have participated in the last few Google Summer of Code programs and will apply again in 2019, there is '''no guarantee''' that we will be selected again for GSoC 2019.<br />
<br />
We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''’ Enable an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. Initial Python bindings have been re-implemented using PyBind11 with the new codebase. An ideal solution would connect to Qt for Python, to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
Example scripts, documentation, are highly encouraged.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python, some experience with PyBind11, Qt for Python, PySide suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (marcus dot hanwell at kitware dot com)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Test Framework Overhaul ===<br />
<br />
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.<br />
<br />
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.<br />
<br />
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.<br />
<br />
=== Project: Develop a JavaScript version of Open Babel ===<br />
<br />
'''Brief explanation:''' Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.<br />
<br />
'''Expected results''': Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.<br />
<br />
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)<br />
<br />
'''Prerequisities''': Some experience in C++, and also with JavaScript.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
=== Project: Develop a validation and standardization filter ===<br />
<br />
'''Brief explanation''': Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?<br />
<br />
'''Expected results''': Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.<br />
<br />
Such a model could be used as a filter, or as a warning to flag up problematic structures.<br />
<br />
Code could be modeled on MolVS using RDKit [[https://molvs.readthedocs.io/en/latest/]]<br />
<br />
'''Prerequisites''': Experience in C++ or Python, and an interest in data science or statistics.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() functions in parsers are long and contain a lot of business logic. They should be refactored into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement the best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Discovering computational chemistry content online ===<br />
<br />
'''Brief explanation:''' There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!<br />
<br />
'''Expected results:''' Build a crawler that identifies and indexes computational chemistry content online, and provides the ability to extract data with cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Machine learning applied to parsing computational chemistry output ===<br />
<br />
'''Bried explanation:''' Can we teach a machine to parse computational chemistry logfiles at least as well as cclib already does? What machine learning approach here would be most appropriate? Is it useful to include prior (chemical) knowldedge or soft constraints to guide parser learning?<br />
<br />
'''Expected results:''' Identify and implement a machine learning pipeline that attempts to reproduce or complement cclib's various parsers.<br />
<br />
'''Prerequisites:''' Experience with Python, machine learning, and ideally familiarity with computational chemistry.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
==3Dmol.js Project Ideas==<br />
<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://developer.download.nvidia.com/books/HTML/gpugems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Google Cardboard for 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Implement low cost virtual reality visualization using Google Cardboard<br />
<br />
'''Expected results:''' [Google Cardboard https://en.wikipedia.org/wiki/Google_Cardboard] is a VR experience using commodity smartphones and either a paperboard/cardboard mount or an inexpensive pre-made mount. The project would produce an implementation using the Cardboard SDK for 3Dmol.js, allowing both individual VR use and synchronized classroom use (e.g., one "guide" and multiple synchronized viewers).<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Implement a generalized file reader ===<br />
<br />
'''Brief explanation:''' Implementation of a flexible generic interface for reading molecular file formats (things like .smi, .sdf, and the compressed versions thereof). The reader should recognize the file format automatically so that the user does not need to worry about this.<br />
<br />
'''Expected results:''' A C++ implementation of a generalized file reader for the RDKit along with a robust set of test cases. Wrappers for the reader so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: Implement Molecular Interaction Fields calculations in the RDKit ===<br />
<br />
'''Brief explanation:''' There is an old PR for the RDKit that implements molecular interaction fields: https://github.com/rdkit/rdkit/pull/318. This was never merged because the author ran out of time. At this point a lot of work would be required to update and finish this PR, but the results would be super useful for the RDKit community.<br />
<br />
'''Expected results:''' A C++ implementation of the GRID calculator code along with a robust set of test cases. Wrappers for the reader so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: neo4j integration ===<br />
<br />
'''Brief explanation:''' <br />
The RDKit already has strong integration with the open-source relational database PostgreSQL, in this project you'll build a similar extension for the open-source graph database neo4j (https://neo4j.com/). The concept of the knowledge graph, which stores the relationships between objects in addition to the objects themselves, has become widespread in data management and integration. This project will allow us to build and query knowledge graphs storing molecular and chemical information.<br />
<br />
'''Expected results:''' <br />
An RDKit extension to neo4j that provides chemical functionality for finding entry points into the graph and to efficiently filter paths using chemical knowledge while traversing the graph.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Christian Pilger (christian.pilger at basf.com)<br />
<br />
=== Project: RDKit - OpenMM Integration ===<br />
<br />
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.<br />
<br />
'''Expected results:''' C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. The work would likely involve completing the MMFF94 implementation described by Paolo Tosco at the 2017 RDKit UGM (https://github.com/rdkit/UGM_2017/blob/master/Presentations/Tosco_RDKit_OpenMM_integration.pdf) and extending to other force fields like UFF.<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' TBA, likely Geoff Hutchison (geoffh at pitt.edu) and others<br />
<br />
<br />
=== Project: MongoDB integration ===<br />
<br />
'''Brief explanation:''' <br />
MongoDB (https://www.mongodb.com/) is an open-source cross-platform document oriented NoSQL database program optimized for performance. Its flexible schema can accommodate hierarchical relationships between chemical compounds. To enable chemical intelligence in mongoDB queries, an integration with RDKit is necessary. This project will allow us to build a frame to perform similarity, substructure, and identity searches. We will leverage on the document structure of the database to store multiple representations of each molecule. While adding new functionalities and developing existing capabilities we will keep an eye on performance, to ensure optimal scalability (indexing, shards, multiprocessing, etc.). We will learn from, and possibly build upon other work that has been done for chemistry integration into MongoDB (e.g. http://wiki.openchemistry.org/MongoChem and http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb).<br />
<br />
'''Expected results:''' A stable and performant RDKit extension to MongoDB that provides chemical functionalities on a document database.<br />
<br />
'''Prerequisites:''' Python<br />
<br />
'''Mentor:''' Marco Stenta (marco.stenta at syngenta.com)<br />
<br />
=== Project: Implement the Analog Series-Based Scaffold method ===<br />
<br />
'''Brief explanation:''' <br />
The concept of the chemical scaffold is central to our understanding and analysis of many medicinal chemistry datasets. There are multiple ways to define the scaffold of a set of molecules, of these the "Murcko scaffold" is probably the most common, but it's probably also one of the worst (though that's probably a bit harsh since the idea of scaffold is not very well defined). A more data-driven approach is described in these two open-access articles and the references therein:<br />
https://www.future-science.com/doi/10.4155/fsoa-2017-0102<br />
https://www.future-science.com/doi/10.4155/fsoa-2017-0135<br />
It would be quite useful to have an RDKit implementation of this method.<br />
<br />
'''Expected results:''' <br />
A stable and well tested RDKit implementation, Python or C++ based, of the Analog Series-Based Scaffold method.<br />
<br />
'''Prerequisites:''' Python or C++<br />
<br />
'''Mentor:''' Nik Stiefl (nikolaus.stiefl at novartis dot com )<br />
<br />
== NWChem Project Ideas ==<br />
NWChem is widely used open-source computational chemistry software ([http://nwchem-sw.org]) that tackles a wide variety of scientific problems.<br />
<br />
=== Project NWChem-JSON ===<br />
<br />
'''Brief explanation:''' Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics. <br />
<br />
'''Expected results:''' Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed. <br />
<br />
'''Prerequisites:''' Experience with Fortran90 and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== Project NWChem-Python-Jupyter Interface ===<br />
<br />
'''Brief explanation:''' Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks<br />
<br />
'''Expected results:''' NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite. <br />
<br />
'''Prerequisites:''' Experience with Fortran and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== JSON-LD for Chemical Data ===<br />
<br />
'''Brief explanation:''' Transforming NWChem and Chemical JSON formats to JSON-LD<br />
<br />
'''Expected results:''' Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.<br />
<br />
'''Prerequisites:''' Experience with Python and JSON-LD<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.<br />
<br />
=== Project: Transfer Learning Framework ===<br />
<br />
'''Brief explanation:''' Create easy to use tools for common transfer learning scenarios.<br />
<br />
'''Expected results:''' [https://arxiv.org/abs/1712.02734 ChemNet] discusses a powerful model independent transfer learning protocol. We would want to reproduce the results, and be able to apply the transfer learning protocol to arbitrary TensorGraph models. Jupyter notebook tutorials and blog posts will be expected over the course of the summer.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Data Interfaces ===<br />
<br />
'''Brief explanation:''' Transition deepchem.data.Dataset to tf.data.<br />
<br />
'''Expected results:''' DeepChem data objects were created before tf.data existed. We need to make our existing Featurizers, Transformers, and Models work over tf.data objects. Jupyter notebook tutorials and blog posts for how to use the new improved interfaces.<br />
<br />
'''Prerequisites:''' Python, some Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Model Visualization ===<br />
<br />
'''Brief explanation:''' Node Importance Visualizations from Graph Models<br />
<br />
'''Expected results:''' An argument often used against deep learning methods is that they are not understandable. This project would be to implement [https://github.com/debbiemarkslab/neural-fingerprint-theano visual neural graph fingerprints] into DeepChem. Stretch goals would be to implement [https://arxiv.org/abs/1605.01713 DeepLift] or masking techniques for atom level visualizations.<br />
<br />
'''Prerequisites:''' Python, Tensorflow, rdkit<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2018&diff=630GSoC Ideas 20182018-02-28T04:08:43Z<p>Greg.landrum: /* Project: Implement the Analog Series-Based Scaffold method */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve [http://www.openchemistry.org/avogadro2 Avogadro 2], [http://cclib.sourceforge.net cclib], [http://3dmol.csb.pitt.edu/ 3DMol], and [http://openbabel.org/ Open Babel]. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Integrate with VTK: Volume Rendering and Charts ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data<br />
<br />
'''Expected results:''' The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2<br />
<br />
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''' Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented using PyBind11 with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python or JavaScript, some experience with PyBind11, SWIG, Boost.Python, or similar packages (SIP, PySide, etc.) suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (marcus dot hanwell at kitware dot com)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Fast, Efficient Fragment-Based Coordinate Generation ===<br />
<br />
'''Brief explanation:''' A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.<br />
<br />
'''Expected results:''' Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail. <br />
<br />
Importantly, the approach is '''highly''' efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling. <br />
<br />
A recent research paper describing one such approach can be found [http://link.springer.com/article/10.1186/s13321-015-0095-1 here]<br />
<br />
'''Prerequisites:''' Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Bayesian Optimization of Conformer Geometries ===<br />
<br />
'''Brief explanation:''' Most molecules have multiple energetically-accessible geometries (conformations). In even medium-sized molecules, there may be '''thousands''' or '''''millions''''' of possibilities. Intelligent search strategies (i.e., Bayesian optimization) are needed to find the best geometries in the shortest amount of time.<br />
<br />
'''Expected results:''' An efficient implementation of Bayesian optimization of molecular dihedral angles and testing against known molecular geometries (crystal structures) and libraries of conformers. In principle, the goal is to balance exploration of the multiple degrees of freedom and exploitation of known data (i.e., local optimization). A key test is to compare against existing Monte Carlo and genetic algorithm methods already implemented in Open Babel.<br />
<br />
In many molecules, the degrees of freedom (dihedral angles) are non-independent, so detecting correlations between dimensions, dimensional reduction, etc. should likely improve performance. Combining data science and machine learning techniques may allow the code to detect such conditions based on the molecular structure (i.e., this is not a '''completely''' black-box optimization - we know some of the physics involved).<br />
<br />
'''Prerequisites:''' Experience in C++ or Python. Knowledge of data science or statistics (e.g., Bayesian inference, data mining) is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Test Framework Overhaul ===<br />
<br />
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.<br />
<br />
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.<br />
<br />
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.<br />
<br />
=== Project: Develop a JavaScript version of Open Babel ===<br />
<br />
'''Brief explanation:''' Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.<br />
<br />
'''Expected results''': Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.<br />
<br />
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)<br />
<br />
'''Prerequisities''': Some experience in C++, and also with JavaScript.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
=== Project: Develop a validation and standardization filter ===<br />
<br />
'''Brief explanation''': Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?<br />
<br />
'''Expected results''': Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.<br />
<br />
Such a model could be used as a filter, or as a warning to flag up problematic structures.<br />
<br />
Code could be modeled on MolVS using RDKit [[https://molvs.readthedocs.io/en/latest/]]<br />
<br />
'''Prerequisites''': Experience in C++ or Python, and an interest in data science or statistics.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Add color to Open Babel output ===<br />
<br />
'''Brief explanation''': A general framework for specifying color at the terminal would allow us to add color to output, e.g. nitrogen in blue, improve the display of warning messages (red), and highlight substructures.<br />
<br />
'''Expected results''': When writing results to a terminal, it's possible to use a whole range of colors (and styles) to enhance the display. By putting a general framework in place, that works cross-platform, this could be very helpful in a wide range of scenarios. For example, when substructure searching, the matched atoms could be highlighted, which would greatly aid in visualizing the match.<br />
<br />
'''Prerequisites''': Experience in C++.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
=== Project: Implement QC JSON schema in cclib ===<br />
<br />
'''Brief explanation:''' Incorporate the [https://github.com/MolSSI/QC_JSON_Schema MolSSI JSON schema], which is currently in the design stage.<br />
<br />
'''Expected results:''' Implement a reader and writer according to the schema, and provide feedback to help drive the schema design to completion. Optionally also improve the code and tests for existing reader/writer classes in cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally some familiarity with JSON, quantum chemistry and computational chemistry programs.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() functions in parsers are long and contain a lot of business logic. They should be refactored into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement the best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Discovering computational chemistry content online ===<br />
<br />
'''Brief explanation:''' There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!<br />
<br />
'''Expected results:''' Build a crawler that identifies and indexes computational chemistry content online, and provides the ability to extract data with cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Machine learning applied to parsing computational chemistry output ===<br />
<br />
'''Bried explanation:''' Can we teach a machine to parse computational chemistry logfiles at least as well as cclib already does? What machine learning approach here would be most appropriate? Is it useful to include prior (chemical) knowldedge or soft constraints to guide parser learning?<br />
<br />
'''Expected results:''' Identify and implement a machine learning pipeline that attempts to reproduce or complement cclib's various parsers.<br />
<br />
'''Prerequisites:''' Experience with Python, machine learning, and ideally familiarity with computational chemistry.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
==3Dmol.js Project Ideas==<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://developer.download.nvidia.com/books/HTML/gpugems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Google Cardboard for 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Implement low cost virtual reality visualization using Google Cardboard<br />
<br />
'''Expected results:''' [Google Cardboard https://en.wikipedia.org/wiki/Google_Cardboard] is a VR experience using commodity smartphones and either a paperboard/cardboard mount or an inexpensive pre-made mount. The project would produce an implementation using the Cardboard SDK for 3Dmol.js, allowing both individual VR use and synchronized classroom use (e.g., one "guide" and multiple synchronized viewers).<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Create a generalized fingerprinting function ===<br />
<br />
'''Brief explanation:''' The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008<br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Nadine Schneider (nadine-1.schneider at novartis dot com )<br />
<br />
=== Project: RDKit - MMTF Integration ===<br />
<br />
'''Brief explanation:''' Implementation of the MMTF file format in the RDKit. See the similar OpenBabel project for more details.<br />
<br />
'''Expected results:''' A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MolVS ===<br />
<br />
'''Brief explanation:''' MolVS (https://molvs.readthedocs.io/en/latest/) provides very useful functionality for molecular validation and standardization. MolVS is built using the RDKit, but in this project we will expand its capabilities and integrate it into the RDKit project. An eventual end goal, though not necessarily one for this project, will be to have a C++ implementation of the algorithm that is part of the core RDKit. Matt Swain (the original author of MolVS) will collaborate with us on this.<br />
<br />
'''Expected results:''' A Python or C++ implementation of an extended version of MolVS that can be integrated into the RDKit core. The extensions will include support for sets of atom types that are to be allowed/disallowed. We will also add ideas borrowed from Standardiser (https://github.com/flatkinson/standardiser) <br />
<br />
'''Prerequisites:''' Python, C++ would be an advantage but is not required<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: neo4j integration ===<br />
<br />
'''Brief explanation:''' <br />
The RDKit already has strong integration with the open-source relational database PostgreSQL, in this project you'll build a similar extension for the open-source graph database neo4j (https://neo4j.com/). The concept of the knowledge graph, which stores the relationships between objects in addition to the objects themselves, has become widespread in data management and integration. This project will allow us to build and query knowledge graphs storing molecular and chemical information.<br />
<br />
'''Expected results:''' <br />
An RDKit extension to neo4j that provides chemical functionality for finding entry points into the graph and to efficiently filter paths using chemical knowledge while traversing the graph.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Christian Pilger (christian.pilger at basf.com)<br />
<br />
=== Project: RDKit - OpenMM Integration ===<br />
<br />
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.<br />
<br />
'''Expected results:''' C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. The work would likely involve completing the MMFF94 implementation described by Paolo Tosco at the 2017 RDKit UGM (https://github.com/rdkit/UGM_2017/blob/master/Presentations/Tosco_RDKit_OpenMM_integration.pdf) and extending to other force fields like UFF.<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' TBA, likely Geoff Hutchison (geoffh at pitt.edu) and others<br />
<br />
<br />
=== Project: MongoDB integration ===<br />
<br />
'''Brief explanation:''' <br />
MongoDB (https://www.mongodb.com/) is an open-source cross-platform document oriented NoSQL database program optimized for performance. Its flexible schema can accommodate hierarchical relationships between chemical compounds. To enable chemical intelligence in mongoDB queries, an integration with RDKit is necessary. This project will allow us to build a frame to perform similarity, substructure, and identity searches. We will leverage on the document structure of the database to store multiple representations of each molecule. While adding new functionalities and developing existing capabilities we will keep an eye on performance, to ensure optimal scalability (indexing, shards, multiprocessing, etc.). We will learn from, and possibly build upon other work that has been done for chemistry integration into MongoDB (e.g. http://wiki.openchemistry.org/MongoChem and http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb).<br />
<br />
'''Expected results:''' A stable and performant RDKit extension to MongoDB that provides chemical functionalities on a document database.<br />
<br />
'''Prerequisites:''' Python<br />
<br />
'''Mentor:''' Marco Stenta (marco.stenta at syngenta.com)<br />
<br />
=== Project: Implement the Analog Series-Based Scaffold method ===<br />
<br />
'''Brief explanation:''' <br />
The concept of the chemical scaffold is central to our understanding and analysis of many medicinal chemistry datasets. There are multiple ways to define the scaffold of a set of molecules, of these the "Murcko scaffold" is probably the most common, but it's probably also one of the worst (though that's probably a bit harsh since the idea of scaffold is not very well defined). A more data-driven approach is described in these two open-access articles and the references therein:<br />
https://www.future-science.com/doi/10.4155/fsoa-2017-0102<br />
https://www.future-science.com/doi/10.4155/fsoa-2017-0135<br />
It would be quite useful to have an RDKit implementation of this method.<br />
<br />
'''Expected results:''' <br />
A stable and well tested RDKit implementation, Python or C++ based, of the Analog Series-Based Scaffold method.<br />
<br />
'''Prerequisites:''' Python or C++<br />
<br />
'''Mentor:''' Nik Stiefl (nikolaus.stiefl at novartis dot com )<br />
<br />
== MSDK / MZmine Project Ideas ==<br />
<br />
Mass spectrometry is an analytical technique that measures the mass of small molecules with high precision. The data coming from mass spectrometry instruments is complex and multi-dimensional. Mass spectrometry development kit (MSDK [http://msdk.github.io]) is a Java library of algorithms for processing such mass spectrometry data. The goals of the library are to provide a flexible data model with Java interfaces for mass-spectrometry related objects (including raw spectra, processed data sets, identification results etc.) and to integrate the existing algorithms that are currently scattered around various Java-based graphical tools. MZmine [https://mzmine.github.io] is an open-source software for mass-spectrometry data processing. A new version, MZmine 3, which is currently under development, is based on JavaFX for GUI and on MSDK for data processing algorithms. <br />
<br />
<br />
=== Project: MSDK - Feature Detection ===<br />
<br />
'''Brief explanation:''' Provide a native Java implementation of some popular LC-MS feature detection algorithms from the R world (centWave, massifquant, CAMERA, etc.) [https://www.bioconductor.org/packages/3.7/bioc/manuals/xcms/man/xcms.pdf]. Further development of ADAP-3D module [https://github.com/msdk/msdk/tree/master/msdk-featuredetection-adap3d] for intelligent, parameter-less feature detection.<br />
<br />
'''Prerequisites:''' Java, preferably some knowledge about mass spectrometry<br />
<br />
'''Mentor:''' Dmitry Avtonomov <dmitriy.avtonomov@gmail.com>, Xiuxia Du <Xiuxia.Du@uncc.edu><br />
<br />
<br />
=== Project: MSDK - Spectral Database Search ===<br />
<br />
'''Brief explanation:''' Develop new MSDK modules for spectral search in offline and online databases (especially MoNA [http://mona.fiehnlab.ucdavis.edu]).<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Gert Wohlgemuth <berlinguyinca@gmail.com><br />
<br />
<br />
=== Project: MSDK - New IO Modules ===<br />
<br />
'''Brief explanation:''' Develop new MSDK-IO modules for currently unsupported file formats, like mzDB [https://github.com/mzdb/mzdb-specs], mz5 [http://software.steenlab.org/mz5/], or imzML [https://ms-imaging.org/wp/imzml/], and improve the existing support for reading native vendor formats. Update mzTab [https://github.com/HUPO-PSI/mzTab] support to version 1.1 with new features for metabolomics. In addition, support for reading ion mobility data can be added to the existing mzML format reader/writer.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com> and/or Adam Tenderholt <atenderholt@gmail.com><br />
<br />
=== Project: MSDK - KNIME integration ===<br />
<br />
'''Brief explanation:''' Develop an integration layer for MSDK algorithms into the workflow platform KNIME [https://www.knime.com].<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
<br />
=== Project: MSDK / MZmine - Statistical Analysis ===<br />
<br />
'''Brief explanation:''' Develop new modules for multivariate statistics and machine learning-based analysis of mass spectrometry results. Part of this project is algorithmic, part of it is GUI development.<br />
<br />
'''Prerequisites:''' Java, preferably basic knowledge about statistics<br />
<br />
'''Mentor:''' Xiuxia Du <Xiuxia.Du@uncc.edu><br />
<br />
<br />
=== Project: MSDK / MZmine - Correlation Analysis ===<br />
<br />
'''Brief explanation:''' Develop new modules for correlation-based identification of related mass spectrometry signals.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
<br />
=== Project: MZmine - New Visualization Modules ===<br />
<br />
'''Brief explanation:''' Implement new, JavaFX-based visualization modules for MZmine such as Cloud Plot [http://pubs.acs.org/doi/abs/10.1021/ac3029745] or spectral similarity tree imaging.<br />
<br />
'''Prerequisites:''' Java (experience with JavaFX is helpful but not required)<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
== NWChem Project Ideas ==<br />
NWChem is widely used open-source computational chemistry software ([http://nwchem-sw.org]) that tackles a wide variety of scientific problems.<br />
<br />
=== Project NWChem-JSON ===<br />
<br />
'''Brief explanation:''' Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics. <br />
<br />
'''Expected results:''' Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed. <br />
<br />
'''Prerequisites:''' Experience with Fortran90 and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== Project NWChem-Python-Jupyter Interface ===<br />
<br />
'''Brief explanation:''' Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks<br />
<br />
'''Expected results:''' NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite. <br />
<br />
'''Prerequisites:''' Experience with Fortran and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== JSON-LD for Chemical Data ===<br />
<br />
'''Brief explanation:''' Transforming NWChem and Chemical JSON formats to JSON-LD<br />
<br />
'''Expected results:''' Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.<br />
<br />
'''Prerequisites:''' Experience with Python and JSON-LD<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.<br />
<br />
=== Project: Transfer Learning Framework ===<br />
<br />
'''Brief explanation:''' Create easy to use tools for common transfer learning scenarios.<br />
<br />
'''Expected results:''' [https://arxiv.org/abs/1712.02734 ChemNet] discusses a powerful model independent transfer learning protocol. We would want to reproduce the results, and be able to apply the transfer learning protocol to arbitrary TensorGraph models. Jupyter notebook tutorials and blog posts will be expected over the course of the summer.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Data Interfaces ===<br />
<br />
'''Brief explanation:''' Transition deepchem.data.Dataset to tf.data.<br />
<br />
'''Expected results:''' DeepChem data objects were created before tf.data existed. We need to make our existing Featurizers, Transformers, and Models work over tf.data objects. Jupyter notebook tutorials and blog posts for how to use the new improved interfaces.<br />
<br />
'''Prerequisites:''' Python, some Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Model Visualization ===<br />
<br />
'''Brief explanation:''' Node Importance Visualizations from Graph Models<br />
<br />
'''Expected results:''' An argument often used against deep learning methods is that they are not understandable. This project would be to implement [https://github.com/debbiemarkslab/neural-fingerprint-theano visual neural graph fingerprints] into DeepChem. Stretch goals would be to implement [https://arxiv.org/abs/1605.01713 DeepLift] or masking techniques for atom level visualizations.<br />
<br />
'''Prerequisites:''' Python, Tensorflow, rdkit<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
<br />
=== Project: Imaging Tools ===<br />
<br />
'''Brief explanation:''' Enable chemical image segmentation and property prediction.<br />
<br />
'''Expected results:''' We want an implementation of [https://arxiv.org/pdf/1505.04597.pdf U-Net], and [https://arxiv.org/pdf/1512.03385.pdf ResNet] inside of the DeepChem framework. We want both pre-trained networks on problems of chemical importance and the image data augmentation techniques used to create the models.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at datamined dot io)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
=== Project: Utilizing Virtual Reality in Chemistry Visualization and Modeling ===<br />
<br />
''Brief explanation:''' Develop a VR application or library that can be used to visualize molecular structures, possibly manipulate them. <br />
<br />
'''Expected results:''' A VR application or library that can be integrated in one of the apps above, focused on molecular structure modeling. The target is both scientific applications and an educational component. If time permits, development of an interface that allows users to manipulate the structures and get a realtime response (using fast molecular force-fields to compute responses) would be a stretch outcome.<br />
<br />
'''Prerequisites:''' Python, C++, VR SDK experience would be nice.<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov) <br />
<br />
===Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data===<br />
<br />
'''Brief explanation:''' Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.<br />
<br />
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).<br />
<br />
'''Expected results:''' Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.<br />
<br />
'''Prerequisites:''' General programming experience, and ideally experience in chemistry and matrix manipulations.<br />
<br />
'''Suggested Readings:''' <br />
* http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009<br />
* http://zhanglab.ccmb.med.umich.edu/EDTSurf/<br />
* https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf<br />
<br />
'''Mentor:''' Geoffrey Hutchison (geoffh at pitt dot edu) and Adam Tenderholt (atenderholt at gmail dot com)<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: YAeHMOP as a library ===<br />
<br />
'''Brief explanation:''' YAeHMOP (https://github.com/greglandrum/yaehmop and http://yaehmop.sourceforge.net/) is an open-source package of tools for doing extended Hueckel calculations on molecules and crystals. The software was developed as a series of command line tools that expect to read and write from files. After some years of obscurity, YAeHMOP has recently attracted attention as a plugin for Avogadro. The goal of this project is to modernize aspects of YAeHMOP and make the core computational functionality accessible as a library.<br />
<br />
'''Expected results:''' A library that is callable from C/C++ allowing the construction of an input molecule/crystal, specification of computational parameters, execution of a calculation, and capture of results. Ideally this will include a modernization of the pieces of the code that still rely on f2c-translated Fortran77 code: the eigenvalue solver can likely be replaced with eigen without too much effort; the functionality to calculate STO integrals will need to be re-written based on the original f77. The library should also include a robust set of regression tests. Stretch goals would be adding Python wrappers to the library and/or creating an RDKit plugin for it.<br />
<br />
'''Prerequisites:''' C/C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2018&diff=629GSoC Ideas 20182018-02-26T07:43:37Z<p>Greg.landrum: /* RDKit Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve [http://www.openchemistry.org/avogadro2 Avogadro 2], [http://cclib.sourceforge.net cclib], [http://3dmol.csb.pitt.edu/ 3DMol], and [http://openbabel.org/ Open Babel]. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Integrate with VTK: Volume Rendering and Charts ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data<br />
<br />
'''Expected results:''' The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2<br />
<br />
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''' Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented using PyBind11 with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python or JavaScript, some experience with PyBind11, SWIG, Boost.Python, or similar packages (SIP, PySide, etc.) suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (marcus dot hanwell at kitware dot com)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Fast, Efficient Fragment-Based Coordinate Generation ===<br />
<br />
'''Brief explanation:''' A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.<br />
<br />
'''Expected results:''' Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail. <br />
<br />
Importantly, the approach is '''highly''' efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling. <br />
<br />
A recent research paper describing one such approach can be found [http://link.springer.com/article/10.1186/s13321-015-0095-1 here]<br />
<br />
'''Prerequisites:''' Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Bayesian Optimization of Conformer Geometries ===<br />
<br />
'''Brief explanation:''' Most molecules have multiple energetically-accessible geometries (conformations). In even medium-sized molecules, there may be '''thousands''' or '''''millions''''' of possibilities. Intelligent search strategies (i.e., Bayesian optimization) are needed to find the best geometries in the shortest amount of time.<br />
<br />
'''Expected results:''' An efficient implementation of Bayesian optimization of molecular dihedral angles and testing against known molecular geometries (crystal structures) and libraries of conformers. In principle, the goal is to balance exploration of the multiple degrees of freedom and exploitation of known data (i.e., local optimization). A key test is to compare against existing Monte Carlo and genetic algorithm methods already implemented in Open Babel.<br />
<br />
In many molecules, the degrees of freedom (dihedral angles) are non-independent, so detecting correlations between dimensions, dimensional reduction, etc. should likely improve performance. Combining data science and machine learning techniques may allow the code to detect such conditions based on the molecular structure (i.e., this is not a '''completely''' black-box optimization - we know some of the physics involved).<br />
<br />
'''Prerequisites:''' Experience in C++ or Python. Knowledge of data science or statistics (e.g., Bayesian inference, data mining) is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Test Framework Overhaul ===<br />
<br />
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.<br />
<br />
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.<br />
<br />
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.<br />
<br />
=== Project: Develop a JavaScript version of Open Babel ===<br />
<br />
'''Brief explanation:''' Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.<br />
<br />
'''Expected results''': Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.<br />
<br />
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)<br />
<br />
'''Prerequisities''': Some experience in C++, and also with JavaScript.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
=== Project: Develop a validation and standardization filter ===<br />
<br />
'''Brief explanation''': Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?<br />
<br />
'''Expected results''': Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.<br />
<br />
Such a model could be used as a filter, or as a warning to flag up problematic structures.<br />
<br />
Code could be modeled on MolVS using RDKit [[https://molvs.readthedocs.io/en/latest/]]<br />
<br />
'''Prerequisites''': Experience in C++ or Python, and an interest in data science or statistics.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Add color to Open Babel output ===<br />
<br />
'''Brief explanation''': A general framework for specifying color at the terminal would allow us to add color to output, e.g. nitrogen in blue, improve the display of warning messages (red), and highlight substructures.<br />
<br />
'''Expected results''': When writing results to a terminal, it's possible to use a whole range of colors (and styles) to enhance the display. By putting a general framework in place, that works cross-platform, this could be very helpful in a wide range of scenarios. For example, when substructure searching, the matched atoms could be highlighted, which would greatly aid in visualizing the match.<br />
<br />
'''Prerequisites''': Experience in C++.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
=== Project: Implement QC JSON schema in cclib ===<br />
<br />
'''Brief explanation:''' Incorporate the [https://github.com/MolSSI/QC_JSON_Schema MolSSI JSON schema], which is currently in the design stage.<br />
<br />
'''Expected results:''' Implement a reader and writer according to the schema, and provide feedback to help drive the schema design to completion. Optionally also improve the code and tests for existing reader/writer classes in cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally some familiarity with JSON, quantum chemistry and computational chemistry programs.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() functions in parsers are long and contain a lot of business logic. They should be refactored into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement the best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Discovering computational chemistry content online ===<br />
<br />
'''Brief explanation:''' There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!<br />
<br />
'''Expected results:''' Build a crawler that identifies and indexes computational chemistry content online, and provides the ability to extract data with cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Machine learning applied to parsing computational chemistry output ===<br />
<br />
'''Bried explanation:''' Can we teach a machine to parse computational chemistry logfiles at least as well as cclib already does? What machine learning approach here would be most appropriate? Is it useful to include prior (chemical) knowldedge or soft constraints to guide parser learning?<br />
<br />
'''Expected results:''' Identify and implement a machine learning pipeline that attempts to reproduce or complement cclib's various parsers.<br />
<br />
'''Prerequisites:''' Experience with Python, machine learning, and ideally familiarity with computational chemistry.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
==3Dmol.js Project Ideas==<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://developer.download.nvidia.com/books/HTML/gpugems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Google Cardboard for 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Implement low cost virtual reality visualization using Google Cardboard<br />
<br />
'''Expected results:''' [Google Cardboard https://en.wikipedia.org/wiki/Google_Cardboard] is a VR experience using commodity smartphones and either a paperboard/cardboard mount or an inexpensive pre-made mount. The project would produce an implementation using the Cardboard SDK for 3Dmol.js, allowing both individual VR use and synchronized classroom use (e.g., one "guide" and multiple synchronized viewers).<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Create a generalized fingerprinting function ===<br />
<br />
'''Brief explanation:''' The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008<br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Nadine Schneider (nadine-1.schneider at novartis dot com )<br />
<br />
=== Project: RDKit - MMTF Integration ===<br />
<br />
'''Brief explanation:''' Implementation of the MMTF file format in the RDKit. See the similar OpenBabel project for more details.<br />
<br />
'''Expected results:''' A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MolVS ===<br />
<br />
'''Brief explanation:''' MolVS (https://molvs.readthedocs.io/en/latest/) provides very useful functionality for molecular validation and standardization. MolVS is built using the RDKit, but in this project we will expand its capabilities and integrate it into the RDKit project. An eventual end goal, though not necessarily one for this project, will be to have a C++ implementation of the algorithm that is part of the core RDKit. Matt Swain (the original author of MolVS) will collaborate with us on this.<br />
<br />
'''Expected results:''' A Python or C++ implementation of an extended version of MolVS that can be integrated into the RDKit core. The extensions will include support for sets of atom types that are to be allowed/disallowed. We will also add ideas borrowed from Standardiser (https://github.com/flatkinson/standardiser) <br />
<br />
'''Prerequisites:''' Python, C++ would be an advantage but is not required<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: neo4j integration ===<br />
<br />
'''Brief explanation:''' <br />
The RDKit already has strong integration with the open-source relational database PostgreSQL, in this project you'll build a similar extension for the open-source graph database neo4j (https://neo4j.com/). The concept of the knowledge graph, which stores the relationships between objects in addition to the objects themselves, has become widespread in data management and integration. This project will allow us to build and query knowledge graphs storing molecular and chemical information.<br />
<br />
'''Expected results:''' <br />
An RDKit extension to neo4j that provides chemical functionality for finding entry points into the graph and to efficiently filter paths using chemical knowledge while traversing the graph.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Christian Pilger (christian.pilger at basf.com)<br />
<br />
=== Project: RDKit - OpenMM Integration ===<br />
<br />
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.<br />
<br />
'''Expected results:''' C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. The work would likely involve completing the MMFF94 implementation described by Paolo Tosco at the 2017 RDKit UGM (https://github.com/rdkit/UGM_2017/blob/master/Presentations/Tosco_RDKit_OpenMM_integration.pdf) and extending to other force fields like UFF.<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' TBA, likely Geoff Hutchison (geoffh at pitt.edu) and others<br />
<br />
<br />
=== Project: MongoDB integration ===<br />
<br />
'''Brief explanation:''' <br />
MongoDB (https://www.mongodb.com/) is an open-source cross-platform document oriented NoSQL database program optimized for performance. Its flexible schema can accommodate hierarchical relationships between chemical compounds. To enable chemical intelligence in mongoDB queries, an integration with RDKit is necessary. This project will allow us to build a frame to perform similarity, substructure, and identity searches. We will leverage on the document structure of the database to store multiple representations of each molecule. While adding new functionalities and developing existing capabilities we will keep an eye on performance, to ensure optimal scalability (indexing, shards, multiprocessing, etc.). We will learn from, and possibly build upon other work that has been done for chemistry integration into MongoDB (e.g. http://wiki.openchemistry.org/MongoChem and http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb).<br />
<br />
'''Expected results:''' A stable and performant RDKit extension to MongoDB that provides chemical functionalities on a document database.<br />
<br />
'''Prerequisites:''' Python<br />
<br />
'''Mentor:''' Marco Stenta (marco.stenta at syngenta.com)<br />
<br />
=== Project: Implement the Analog Series-Based Scaffold method ===<br />
<br />
'''Brief explanation:''' <br />
The concept of the chemical scaffold is central to our understanding and analysis of many medicinal chemistry datasets. There are multiple ways to define the scaffold of a set of molecules, of these the "Murcko scaffold" is probably the most common, but it's probably also one of the worst (though that's probably a bit harsh since the idea of scaffold is not very well defined). A more data-driven approach is described in these two open-access articles and the references therein:<br />
https://www.future-science.com/doi/10.4155/fsoa-2017-0102<br />
https://www.future-science.com/doi/10.4155/fsoa-2017-0135<br />
It would be quite useful to have an RDKit implementation of this method.<br />
<br />
'''Expected results:''' <br />
A stable and well tested RDKit implementation, Python or C++ based, of the Analog Series-Based Scaffold method.<br />
<br />
'''Prerequisites:''' Python or C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics.com)<br />
<br />
== MSDK / MZmine Project Ideas ==<br />
<br />
Mass spectrometry is an analytical technique that measures the mass of small molecules with high precision. The data coming from mass spectrometry instruments is complex and multi-dimensional. Mass spectrometry development kit (MSDK [http://msdk.github.io]) is a Java library of algorithms for processing such mass spectrometry data. The goals of the library are to provide a flexible data model with Java interfaces for mass-spectrometry related objects (including raw spectra, processed data sets, identification results etc.) and to integrate the existing algorithms that are currently scattered around various Java-based graphical tools. MZmine [https://mzmine.github.io] is an open-source software for mass-spectrometry data processing. A new version, MZmine 3, which is currently under development, is based on JavaFX for GUI and on MSDK for data processing algorithms. <br />
<br />
<br />
=== Project: MSDK - Feature Detection ===<br />
<br />
'''Brief explanation:''' Provide a native Java implementation of some popular LC-MS feature detection algorithms from the R world (centWave, massifquant, CAMERA, etc.) [https://www.bioconductor.org/packages/3.7/bioc/manuals/xcms/man/xcms.pdf]. Further development of ADAP-3D module [https://github.com/msdk/msdk/tree/master/msdk-featuredetection-adap3d] for intelligent, parameter-less feature detection.<br />
<br />
'''Prerequisites:''' Java, preferably some knowledge about mass spectrometry<br />
<br />
'''Mentor:''' Dmitry Avtonomov <dmitriy.avtonomov@gmail.com>, Xiuxia Du <Xiuxia.Du@uncc.edu><br />
<br />
<br />
=== Project: MSDK - Spectral Database Search ===<br />
<br />
'''Brief explanation:''' Develop new MSDK modules for spectral search in offline and online databases (especially MoNA [http://mona.fiehnlab.ucdavis.edu]).<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Gert Wohlgemuth <berlinguyinca@gmail.com><br />
<br />
<br />
=== Project: MSDK - New IO Modules ===<br />
<br />
'''Brief explanation:''' Develop new MSDK-IO modules for currently unsupported file formats, like mzDB [https://github.com/mzdb/mzdb-specs], mz5 [http://software.steenlab.org/mz5/], or imzML [https://ms-imaging.org/wp/imzml/], and improve the existing support for reading native vendor formats. Update mzTab [https://github.com/HUPO-PSI/mzTab] support to version 1.1 with new features for metabolomics. In addition, support for reading ion mobility data can be added to the existing mzML format reader/writer.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com> and/or Adam Tenderholt <atenderholt@gmail.com><br />
<br />
=== Project: MSDK - KNIME integration ===<br />
<br />
'''Brief explanation:''' Develop an integration layer for MSDK algorithms into the workflow platform KNIME [https://www.knime.com].<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
<br />
=== Project: MSDK / MZmine - Statistical Analysis ===<br />
<br />
'''Brief explanation:''' Develop new modules for multivariate statistics and machine learning-based analysis of mass spectrometry results. Part of this project is algorithmic, part of it is GUI development.<br />
<br />
'''Prerequisites:''' Java, preferably basic knowledge about statistics<br />
<br />
'''Mentor:''' Xiuxia Du <Xiuxia.Du@uncc.edu><br />
<br />
<br />
=== Project: MSDK / MZmine - Correlation Analysis ===<br />
<br />
'''Brief explanation:''' Develop new modules for correlation-based identification of related mass spectrometry signals.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
<br />
=== Project: MZmine - New Visualization Modules ===<br />
<br />
'''Brief explanation:''' Implement new, JavaFX-based visualization modules for MZmine such as Cloud Plot [http://pubs.acs.org/doi/abs/10.1021/ac3029745] or spectral similarity tree imaging.<br />
<br />
'''Prerequisites:''' Java (experience with JavaFX is helpful but not required)<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
== NWChem Project Ideas ==<br />
NWChem is widely used open-source computational chemistry software ([http://nwchem-sw.org]) that tackles a wide variety of scientific problems.<br />
<br />
=== Project NWChem-JSON ===<br />
<br />
'''Brief explanation:''' Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics. <br />
<br />
'''Expected results:''' Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed. <br />
<br />
'''Prerequisites:''' Experience with Fortran90 and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== Project NWChem-Python-Jupyter Interface ===<br />
<br />
'''Brief explanation:''' Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks<br />
<br />
'''Expected results:''' NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite. <br />
<br />
'''Prerequisites:''' Experience with Fortran and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== JSON-LD for Chemical Data ===<br />
<br />
'''Brief explanation:''' Transforming NWChem and Chemical JSON formats to JSON-LD<br />
<br />
'''Expected results:''' Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.<br />
<br />
'''Prerequisites:''' Experience with Python and JSON-LD<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.<br />
<br />
=== Project: Transfer Learning Framework ===<br />
<br />
'''Brief explanation:''' Create easy to use tools for common transfer learning scenarios.<br />
<br />
'''Expected results:''' [https://arxiv.org/abs/1712.02734 ChemNet] discusses a powerful model independent transfer learning protocol. We would want to reproduce the results, and be able to apply the transfer learning protocol to arbitrary TensorGraph models. Jupyter notebook tutorials and blog posts will be expected over the course of the summer.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Data Interfaces ===<br />
<br />
'''Brief explanation:''' Transition deepchem.data.Dataset to tf.data.<br />
<br />
'''Expected results:''' DeepChem data objects were created before tf.data existed. We need to make our existing Featurizers, Transformers, and Models work over tf.data objects. Jupyter notebook tutorials and blog posts for how to use the new improved interfaces.<br />
<br />
'''Prerequisites:''' Python, some Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Model Visualization ===<br />
<br />
'''Brief explanation:''' Node Importance Visualizations from Graph Models<br />
<br />
'''Expected results:''' An argument often used against deep learning methods is that they are not understandable. This project would be to implement [https://github.com/debbiemarkslab/neural-fingerprint-theano visual neural graph fingerprints] into DeepChem. Stretch goals would be to implement [https://arxiv.org/abs/1605.01713 DeepLift] or masking techniques for atom level visualizations.<br />
<br />
'''Prerequisites:''' Python, Tensorflow, rdkit<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
<br />
=== Project: Imaging Tools ===<br />
<br />
'''Brief explanation:''' Enable chemical image segmentation and property prediction.<br />
<br />
'''Expected results:''' We want an implementation of [https://arxiv.org/pdf/1505.04597.pdf U-Net], and [https://arxiv.org/pdf/1512.03385.pdf ResNet] inside of the DeepChem framework. We want both pre-trained networks on problems of chemical importance and the image data augmentation techniques used to create the models.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at datamined dot io)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
=== Project: Utilizing Virtual Reality in Chemistry Visualization and Modeling ===<br />
<br />
''Brief explanation:''' Develop a VR application or library that can be used to visualize molecular structures, possibly manipulate them. <br />
<br />
'''Expected results:''' A VR application or library that can be integrated in one of the apps above, focused on molecular structure modeling. The target is both scientific applications and an educational component. If time permits, development of an interface that allows users to manipulate the structures and get a realtime response (using fast molecular force-fields to compute responses) would be a stretch outcome.<br />
<br />
'''Prerequisites:''' Python, C++, VR SDK experience would be nice.<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov) <br />
<br />
===Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data===<br />
<br />
'''Brief explanation:''' Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.<br />
<br />
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).<br />
<br />
'''Expected results:''' Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.<br />
<br />
'''Prerequisites:''' General programming experience, and ideally experience in chemistry and matrix manipulations.<br />
<br />
'''Suggested Readings:''' <br />
* http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009<br />
* http://zhanglab.ccmb.med.umich.edu/EDTSurf/<br />
* https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf<br />
<br />
'''Mentor:''' Geoffrey Hutchison (geoffh at pitt dot edu) and Adam Tenderholt (atenderholt at gmail dot com)<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: YAeHMOP as a library ===<br />
<br />
'''Brief explanation:''' YAeHMOP (https://github.com/greglandrum/yaehmop and http://yaehmop.sourceforge.net/) is an open-source package of tools for doing extended Hueckel calculations on molecules and crystals. The software was developed as a series of command line tools that expect to read and write from files. After some years of obscurity, YAeHMOP has recently attracted attention as a plugin for Avogadro. The goal of this project is to modernize aspects of YAeHMOP and make the core computational functionality accessible as a library.<br />
<br />
'''Expected results:''' A library that is callable from C/C++ allowing the construction of an input molecule/crystal, specification of computational parameters, execution of a calculation, and capture of results. Ideally this will include a modernization of the pieces of the code that still rely on f2c-translated Fortran77 code: the eigenvalue solver can likely be replaced with eigen without too much effort; the functionality to calculate STO integrals will need to be re-written based on the original f77. The library should also include a robust set of regression tests. Stretch goals would be adding Python wrappers to the library and/or creating an RDKit plugin for it.<br />
<br />
'''Prerequisites:''' C/C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2018&diff=627GSoC Ideas 20182018-02-23T15:27:45Z<p>Greg.landrum: Add another project idea</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve [http://www.openchemistry.org/avogadro2 Avogadro 2], [http://cclib.sourceforge.net cclib], [http://3dmol.csb.pitt.edu/ 3DMol], and [http://openbabel.org/ Open Babel]. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Integrate with VTK: Volume Rendering and Charts ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data<br />
<br />
'''Expected results:''' The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2<br />
<br />
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''' Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented using PyBind11 with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python or JavaScript, some experience with PyBind11, SWIG, Boost.Python, or similar packages (SIP, PySide, etc.) suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (marcus dot hanwell at kitware dot com)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Fast, Efficient Fragment-Based Coordinate Generation ===<br />
<br />
'''Brief explanation:''' A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.<br />
<br />
'''Expected results:''' Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail. <br />
<br />
Importantly, the approach is '''highly''' efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling. <br />
<br />
A recent research paper describing one such approach can be found [http://link.springer.com/article/10.1186/s13321-015-0095-1 here]<br />
<br />
'''Prerequisites:''' Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Bayesian Optimization of Conformer Geometries ===<br />
<br />
'''Brief explanation:''' Most molecules have multiple energetically-accessible geometries (conformations). In even medium-sized molecules, there may be '''thousands''' or '''''millions''''' of possibilities. Intelligent search strategies (i.e., Bayesian optimization) are needed to find the best geometries in the shortest amount of time.<br />
<br />
'''Expected results:''' An efficient implementation of Bayesian optimization of molecular dihedral angles and testing against known molecular geometries (crystal structures) and libraries of conformers. In principle, the goal is to balance exploration of the multiple degrees of freedom and exploitation of known data (i.e., local optimization). A key test is to compare against existing Monte Carlo and genetic algorithm methods already implemented in Open Babel.<br />
<br />
In many molecules, the degrees of freedom (dihedral angles) are non-independent, so detecting correlations between dimensions, dimensional reduction, etc. should likely improve performance. Combining data science and machine learning techniques may allow the code to detect such conditions based on the molecular structure (i.e., this is not a '''completely''' black-box optimization - we know some of the physics involved).<br />
<br />
'''Prerequisites:''' Experience in C++ or Python. Knowledge of data science or statistics (e.g., Bayesian inference, data mining) is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Test Framework Overhaul ===<br />
<br />
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.<br />
<br />
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.<br />
<br />
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.<br />
<br />
=== Project: Develop a JavaScript version of Open Babel ===<br />
<br />
'''Brief explanation:''' Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.<br />
<br />
'''Expected results''': Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.<br />
<br />
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)<br />
<br />
'''Prerequisities''': Some experience in C++, and also with JavaScript.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
=== Project: Develop a validation and standardization filter ===<br />
<br />
'''Brief explanation''': Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?<br />
<br />
'''Expected results''': Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.<br />
<br />
Such a model could be used as a filter, or as a warning to flag up problematic structures.<br />
<br />
Code could be modeled on MolVS using RDKit [[https://molvs.readthedocs.io/en/latest/]]<br />
<br />
'''Prerequisites''': Experience in C++ or Python, and an interest in data science or statistics.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Add color to Open Babel output ===<br />
<br />
'''Brief explanation''': A general framework for specifying color at the terminal would allow us to add color to output, e.g. nitrogen in blue, improve the display of warning messages (red), and highlight substructures.<br />
<br />
'''Expected results''': When writing results to a terminal, it's possible to use a whole range of colors (and styles) to enhance the display. By putting a general framework in place, that works cross-platform, this could be very helpful in a wide range of scenarios. For example, when substructure searching, the matched atoms could be highlighted, which would greatly aid in visualizing the match.<br />
<br />
'''Prerequisites''': Experience in C++.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
=== Project: Implement QC JSON schema in cclib ===<br />
<br />
'''Brief explanation:''' Incorporate the [https://github.com/MolSSI/QC_JSON_Schema MolSSI JSON schema], which is currently in the design stage.<br />
<br />
'''Expected results:''' Implement a reader and writer according to the schema, and provide feedback to help drive the schema design to completion. Optionally also improve the code and tests for existing reader/writer classes in cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally some familiarity with JSON, quantum chemistry and computational chemistry programs.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() functions in parsers are long and contain a lot of business logic. They should be refactored into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement the best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Discovering computational chemistry content online ===<br />
<br />
'''Brief explanation:''' There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!<br />
<br />
'''Expected results:''' Build a crawler that identifies and indexes computational chemistry content online, and provides the ability to extract data with cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Machine learning applied to parsing computational chemistry output ===<br />
<br />
'''Bried explanation:''' Can we teach a machine to parse computational chemistry logfiles at least as well as cclib already does? What machine learning approach here would be most appropriate? Is it useful to include prior (chemical) knowldedge or soft constraints to guide parser learning?<br />
<br />
'''Expected results:''' Identify and implement a machine learning pipeline that attempts to reproduce or complement cclib's various parsers.<br />
<br />
'''Prerequisites:''' Experience with Python, machine learning, and ideally familiarity with computational chemistry.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
==3Dmol.js Project Ideas==<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://http.developer.nvidia.com/GPUGems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Google Cardboard for 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Implement low cost virtual reality visualization using Google Cardboard<br />
<br />
'''Expected results:''' [Google Cardboard https://en.wikipedia.org/wiki/Google_Cardboard] is a VR experience using commodity smartphones and either a paperboard/cardboard mount or an inexpensive pre-made mount. The project would produce an implementation using the Cardboard SDK for 3Dmol.js, allowing both individual VR use and synchronized classroom use (e.g., one "guide" and multiple synchronized viewers).<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Create a generalized fingerprinting function ===<br />
<br />
'''Brief explanation:''' The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008<br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MMTF Integration ===<br />
<br />
'''Brief explanation:''' Implementation of the MMTF file format in the RDKit. See the similar OpenBabel project for more details.<br />
<br />
'''Expected results:''' A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MolVS ===<br />
<br />
'''Brief explanation:''' MolVS (https://molvs.readthedocs.io/en/latest/) provides very useful functionality for molecular validation and standardization. MolVS is built using the RDKit, but in this project we will expand its capabilities and integrate it into the RDKit project. An eventual end goal, though not necessarily one for this project, will be to have a C++ implementation of the algorithm that is part of the core RDKit. Matt Swain (the original author of MolVS) will collaborate with us on this.<br />
<br />
'''Expected results:''' A Python or C++ implementation of an extended version of MolVS that can be integrated into the RDKit core. The extensions will include support for sets of atom types that are to be allowed/disallowed. We will also add ideas borrowed from Standardiser (https://github.com/flatkinson/standardiser) <br />
<br />
'''Prerequisites:''' Python, C++ would be an advantage but is not required<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: neo4j integration ===<br />
<br />
'''Brief explanation:''' <br />
The RDKit already has strong integration with the open-source relational database PostgreSQL, in this project you'll build a similar extension for the open-source graph database neo4j (https://neo4j.com/). The concept of the knowledge graph, which stores the relationships between objects in addition to the objects themselves, has become widespread in data management and integration. This project will allow us to build and query knowledge graphs storing molecular and chemical information.<br />
<br />
'''Expected results:''' <br />
An RDKit extension to neo4j that provides chemical functionality for finding entry points into the graph and to efficiently filter paths using chemical knowledge while traversing the graph.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Christian Pilger (christian.pilger at basf.com)<br />
<br />
=== Project: RDKit - OpenMM Integration ===<br />
<br />
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.<br />
<br />
'''Expected results:''' C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. The work would likely involve completing the MMFF94 implementation described by Paolo Tosco at the 2017 RDKit UGM (https://github.com/rdkit/UGM_2017/blob/master/Presentations/Tosco_RDKit_OpenMM_integration.pdf) and extending to other force fields like UFF.<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' TBA, likely Geoff Hutchison (geoffh at pitt.edu) and others<br />
<br />
<br />
=== Project: MongoDB integration ===<br />
<br />
'''Brief explanation:''' <br />
MongoDB (https://www.mongodb.com/) is an open-source cross-platform document oriented NoSQL database program optimized for performance. Its flexible schema can accommodate hierarchical relationships between chemical compounds. To enable chemical intelligence in mongoDB queries, an integration with RDKit is necessary. This project will allow us to build a frame to perform similarity, substructure, and identity searches. We will leverage on the document structure of the database to store multiple representations of each molecule. While adding new functionalities and developing existing capabilities we will keep an eye on performance, to ensure optimal scalability (indexing, shards, multiprocessing, etc.). We will learn from, and possibly build upon other work that has been done for chemistry integration into MongoDB (e.g. http://wiki.openchemistry.org/MongoChem and http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb).<br />
<br />
'''Expected results:''' <br />
A stable and performant RDKit extension to MongoDB that provides chemical functionalities on a document database.<br />
<br />
'''Prerequisites:''' Python<br />
<br />
'''Mentor:''' Marco Stenta (marco.stenta at syngenta.com)<br />
<br />
=== Project: Implement the Analog Series-Based Scaffold method ===<br />
<br />
'''Brief explanation:''' <br />
The concept of the chemical scaffold is central to our understanding and analysis of many medicinal chemistry datasets. There are multiple ways to define the scaffold of a set of molecules, of these the "Murcko scaffold" is probably the most common, but it's probably also one of the worst (though that's probably a bit harsh since the idea of scaffold is not very well defined). A more data-driven approach is described in these two open-access articles and the references therein:<br />
https://www.future-science.com/doi/10.4155/fsoa-2017-0102<br />
https://www.future-science.com/doi/10.4155/fsoa-2017-0135<br />
It would be quite useful to have an RDKit implementation of this method.<br />
<br />
'''Expected results:''' <br />
A stable and well tested RDKit implementation, Python or C++ based, of the Analog Series-Based Scaffold method.<br />
<br />
'''Prerequisites:''' Python or C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics.com)<br />
<br />
== MSDK / MZmine Project Ideas ==<br />
<br />
Mass spectrometry is an analytical technique that measures the mass of small molecules with high precision. The data coming from mass spectrometry instruments is complex and multi-dimensional. Mass spectrometry development kit (MSDK [http://msdk.github.io]) is a Java library of algorithms for processing such mass spectrometry data. The goals of the library are to provide a flexible data model with Java interfaces for mass-spectrometry related objects (including raw spectra, processed data sets, identification results etc.) and to integrate the existing algorithms that are currently scattered around various Java-based graphical tools. MZmine [https://mzmine.github.io] is an open-source software for mass-spectrometry data processing. A new version, MZmine 3, which is currently under development, is based on JavaFX for GUI and on MSDK for data processing algorithms. <br />
<br />
<br />
=== Project: MSDK - Feature Detection ===<br />
<br />
'''Brief explanation:''' Provide a native Java implementation of some popular LC-MS feature detection algorithms from the R world (centWave, massifquant, CAMERA, etc.) [https://www.bioconductor.org/packages/3.7/bioc/manuals/xcms/man/xcms.pdf]. Further development of ADAP-3D module [https://github.com/msdk/msdk/tree/master/msdk-featuredetection-adap3d] for intelligent, parameter-less feature detection.<br />
<br />
'''Prerequisites:''' Java, preferably some knowledge about mass spectrometry<br />
<br />
'''Mentor:''' Dmitry Avtonomov <dmitriy.avtonomov@gmail.com>, Xiuxia Du <Xiuxia.Du@uncc.edu><br />
<br />
<br />
=== Project: MSDK - Spectral Database Search ===<br />
<br />
'''Brief explanation:''' Develop new MSDK modules for spectral search in offline and online databases (especially MoNA [http://mona.fiehnlab.ucdavis.edu]).<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Gert Wohlgemuth <berlinguyinca@gmail.com><br />
<br />
<br />
=== Project: MSDK - New IO Modules ===<br />
<br />
'''Brief explanation:''' Develop new MSDK-IO modules for currently unsupported file formats, like mzDB [https://github.com/mzdb/mzdb-specs], mz5 [http://software.steenlab.org/mz5/], or imzML [https://ms-imaging.org/wp/imzml/], and improve the existing support for reading native vendor formats. Update mzTab [https://github.com/HUPO-PSI/mzTab] support to version 1.1 with new features for metabolomics. In addition, support for reading ion mobility data can be added to the existing mzML format reader/writer.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com> and/or Adam Tenderholt <atenderholt@gmail.com><br />
<br />
=== Project: MSDK - KNIME integration ===<br />
<br />
'''Brief explanation:''' Develop an integration layer for MSDK algorithms into the workflow platform KNIME [https://www.knime.com].<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
<br />
=== Project: MSDK / MZmine - Statistical Analysis ===<br />
<br />
'''Brief explanation:''' Develop new modules for multivariate statistics and machine learning-based analysis of mass spectrometry results. Part of this project is algorithmic, part of it is GUI development.<br />
<br />
'''Prerequisites:''' Java, preferably basic knowledge about statistics<br />
<br />
'''Mentor:''' Xiuxia Du <Xiuxia.Du@uncc.edu><br />
<br />
<br />
=== Project: MSDK / MZmine - Correlation Analysis ===<br />
<br />
'''Brief explanation:''' Develop new modules for correlation-based identification of related mass spectrometry signals.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
<br />
=== Project: MZmine - New Visualization Modules ===<br />
<br />
'''Brief explanation:''' Implement new, JavaFX-based visualization modules for MZmine such as Cloud Plot [http://pubs.acs.org/doi/abs/10.1021/ac3029745] or spectral similarity tree imaging.<br />
<br />
'''Prerequisites:''' Java (experience with JavaFX is helpful but not required)<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
== NWChem Project Ideas ==<br />
NWChem is widely used open-source computational chemistry software ([http://nwchem-sw.org]) that tackles a wide variety of scientific problems.<br />
<br />
=== Project NWChem-JSON ===<br />
<br />
'''Brief explanation:''' Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics. <br />
<br />
'''Expected results:''' Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed. <br />
<br />
'''Prerequisites:''' Experience with Fortran90 and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== Project NWChem-Python-Jupyter Interface ===<br />
<br />
'''Brief explanation:''' Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks<br />
<br />
'''Expected results:''' NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite. <br />
<br />
'''Prerequisites:''' Experience with Fortran and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== JSON-LD for Chemical Data ===<br />
<br />
'''Brief explanation:''' Transforming NWChem and Chemical JSON formats to JSON-LD<br />
<br />
'''Expected results:''' Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.<br />
<br />
'''Prerequisites:''' Experience with Python and JSON-LD<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.<br />
<br />
=== Project: Transfer Learning Framework ===<br />
<br />
'''Brief explanation:''' Create easy to use tools for common transfer learning scenarios.<br />
<br />
'''Expected results:''' [https://arxiv.org/abs/1712.02734 ChemNet] discusses a powerful model independent transfer learning protocol. We would want to reproduce the results, and be able to apply the transfer learning protocol to arbitrary TensorGraph models. Jupyter notebook tutorials and blog posts will be expected over the course of the summer.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Data Interfaces ===<br />
<br />
'''Brief explanation:''' Transition deepchem.data.Dataset to tf.data.<br />
<br />
'''Expected results:''' DeepChem data objects were created before tf.data existed. We need to make our existing Featurizers, Transformers, and Models work over tf.data objects. Jupyter notebook tutorials and blog posts for how to use the new improved interfaces.<br />
<br />
'''Prerequisites:''' Python, some Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Model Visualization ===<br />
<br />
'''Brief explanation:''' Node Importance Visualizations from Graph Models<br />
<br />
'''Expected results:''' An argument often used against deep learning methods is that they are not understandable. This project would be to implement [https://github.com/debbiemarkslab/neural-fingerprint-theano visual neural graph fingerprints] into DeepChem. Stretch goals would be to implement [https://arxiv.org/abs/1605.01713 DeepLift] or masking techniques for atom level visualizations.<br />
<br />
'''Prerequisites:''' Python, Tensorflow, rdkit<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
<br />
=== Project: Imaging Tools ===<br />
<br />
'''Brief explanation:''' Enable chemical image segmentation and property prediction.<br />
<br />
'''Expected results:''' We want an implementation of [https://arxiv.org/pdf/1505.04597.pdf U-Net], and [https://arxiv.org/pdf/1512.03385.pdf ResNet] inside of the DeepChem framework. We want both pre-trained networks on problems of chemical importance and the image data augmentation techniques used to create the models.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at datamined dot io)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
=== Project: Utilizing Virtual Reality in Chemistry Visualization and Modeling ===<br />
<br />
''Brief explanation:''' Develop a VR application or library that can be used to visualize molecular structures, possibly manipulate them. <br />
<br />
'''Expected results:''' A VR application or library that can be integrated in one of the apps above, focused on molecular structure modeling. The target is both scientific applications and an educational component. If time permits, development of an interface that allows users to manipulate the structures and get a realtime response (using fast molecular force-fields to compute responses) would be a stretch outcome.<br />
<br />
'''Prerequisites:''' Python, C++, VR SDK experience would be nice.<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov) <br />
<br />
===Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data===<br />
<br />
'''Brief explanation:''' Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.<br />
<br />
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).<br />
<br />
'''Expected results:''' Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.<br />
<br />
'''Prerequisites:''' General programming experience, and ideally experience in chemistry and matrix manipulations.<br />
<br />
'''Suggested Readings:''' <br />
* http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009<br />
* http://zhanglab.ccmb.med.umich.edu/EDTSurf/<br />
* https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf<br />
<br />
'''Mentor:''' Geoffrey Hutchison (geoffh at pitt dot edu) and Adam Tenderholt (atenderholt at gmail dot com)<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: YAeHMOP as a library ===<br />
<br />
'''Brief explanation:''' YAeHMOP (https://github.com/greglandrum/yaehmop and http://yaehmop.sourceforge.net/) is an open-source package of tools for doing extended Hueckel calculations on molecules and crystals. The software was developed as a series of command line tools that expect to read and write from files. After some years of obscurity, YAeHMOP has recently attracted attention as a plugin for Avogadro. The goal of this project is to modernize aspects of YAeHMOP and make the core computational functionality accessible as a library.<br />
<br />
'''Expected results:''' A library that is callable from C/C++ allowing the construction of an input molecule/crystal, specification of computational parameters, execution of a calculation, and capture of results. Ideally this will include a modernization of the pieces of the code that still rely on f2c-translated Fortran77 code: the eigenvalue solver can likely be replaced with eigen without too much effort; the functionality to calculate STO integrals will need to be re-written based on the original f77. The library should also include a robust set of regression tests. Stretch goals would be adding Python wrappers to the library and/or creating an RDKit plugin for it.<br />
<br />
'''Prerequisites:''' C/C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2018&diff=626GSoC Ideas 20182018-02-21T08:45:25Z<p>Greg.landrum: Add MongoDB project</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve [http://www.openchemistry.org/avogadro2 Avogadro 2], [http://cclib.sourceforge.net cclib], [http://3dmol.csb.pitt.edu/ 3DMol], and [http://openbabel.org/ Open Babel]. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Integrate with VTK: Volume Rendering and Charts ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data<br />
<br />
'''Expected results:''' The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2<br />
<br />
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''' Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented using PyBind11 with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python or JavaScript, some experience with PyBind11, SWIG, Boost.Python, or similar packages (SIP, PySide, etc.) suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (marcus dot hanwell at kitware dot com)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Fast, Efficient Fragment-Based Coordinate Generation ===<br />
<br />
'''Brief explanation:''' A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.<br />
<br />
'''Expected results:''' Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail. <br />
<br />
Importantly, the approach is '''highly''' efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling. <br />
<br />
A recent research paper describing one such approach can be found [http://link.springer.com/article/10.1186/s13321-015-0095-1 here]<br />
<br />
'''Prerequisites:''' Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Bayesian Optimization of Conformer Geometries ===<br />
<br />
'''Brief explanation:''' Most molecules have multiple energetically-accessible geometries (conformations). In even medium-sized molecules, there may be '''thousands''' or '''''millions''''' of possibilities. Intelligent search strategies (i.e., Bayesian optimization) are needed to find the best geometries in the shortest amount of time.<br />
<br />
'''Expected results:''' An efficient implementation of Bayesian optimization of molecular dihedral angles and testing against known molecular geometries (crystal structures) and libraries of conformers. In principle, the goal is to balance exploration of the multiple degrees of freedom and exploitation of known data (i.e., local optimization). A key test is to compare against existing Monte Carlo and genetic algorithm methods already implemented in Open Babel.<br />
<br />
In many molecules, the degrees of freedom (dihedral angles) are non-independent, so detecting correlations between dimensions, dimensional reduction, etc. should likely improve performance. Combining data science and machine learning techniques may allow the code to detect such conditions based on the molecular structure (i.e., this is not a '''completely''' black-box optimization - we know some of the physics involved).<br />
<br />
'''Prerequisites:''' Experience in C++ or Python. Knowledge of data science or statistics (e.g., Bayesian inference, data mining) is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Test Framework Overhaul ===<br />
<br />
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.<br />
<br />
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.<br />
<br />
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.<br />
<br />
=== Project: Develop a JavaScript version of Open Babel ===<br />
<br />
'''Brief explanation:''' Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.<br />
<br />
'''Expected results''': Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.<br />
<br />
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)<br />
<br />
'''Prerequisities''': Some experience in C++, and also with JavaScript.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
=== Project: Develop a validation and standardization filter ===<br />
<br />
'''Brief explanation''': Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?<br />
<br />
'''Expected results''': Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.<br />
<br />
Such a model could be used as a filter, or as a warning to flag up problematic structures.<br />
<br />
Code could be modeled on MolVS using RDKit [[https://molvs.readthedocs.io/en/latest/]]<br />
<br />
'''Prerequisites''': Experience in C++ or Python, and an interest in data science or statistics.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Add color to Open Babel output ===<br />
<br />
'''Brief explanation''': A general framework for specifying color at the terminal would allow us to add color to output, e.g. nitrogen in blue, improve the display of warning messages (red), and highlight substructures.<br />
<br />
'''Expected results''': When writing results to a terminal, it's possible to use a whole range of colors (and styles) to enhance the display. By putting a general framework in place, that works cross-platform, this could be very helpful in a wide range of scenarios. For example, when substructure searching, the matched atoms could be highlighted, which would greatly aid in visualizing the match.<br />
<br />
'''Prerequisites''': Experience in C++.<br />
<br />
'''Mentor''': Noel O'Boyle (baoilleach at gmail dot com)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
=== Project: Implement QC JSON schema in cclib ===<br />
<br />
'''Brief explanation:''' Incorporate the [https://github.com/MolSSI/QC_JSON_Schema MolSSI JSON schema], which is currently in the design stage.<br />
<br />
'''Expected results:''' Implement a reader and writer according to the schema, and provide feedback to help drive the schema design to completion. Optionally also improve the code and tests for existing reader/writer classes in cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally some familiarity with JSON, quantum chemistry and computational chemistry programs.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() functions in parsers are long and contain a lot of business logic. They should be refactored into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement the best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Discovering computational chemistry content online ===<br />
<br />
'''Brief explanation:''' There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!<br />
<br />
'''Expected results:''' Build a crawler that identifies and indexes computational chemistry content online, and provides the ability to extract data with cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Machine learning applied to parsing computational chemistry output ===<br />
<br />
'''Bried explanation:''' Can we teach a machine to parse computational chemistry logfiles at least as well as cclib already does? What machine learning approach here would be most appropriate? Is it useful to include prior (chemical) knowldedge or soft constraints to guide parser learning?<br />
<br />
'''Expected results:''' Identify and implement a machine learning pipeline that attempts to reproduce or complement cclib's various parsers.<br />
<br />
'''Prerequisites:''' Experience with Python, machine learning, and ideally familiarity with computational chemistry.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
==3Dmol.js Project Ideas==<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://http.developer.nvidia.com/GPUGems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Google Cardboard for 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Implement low cost virtual reality visualization using Google Cardboard<br />
<br />
'''Expected results:''' [Google Cardboard https://en.wikipedia.org/wiki/Google_Cardboard] is a VR experience using commodity smartphones and either a paperboard/cardboard mount or an inexpensive pre-made mount. The project would produce an implementation using the Cardboard SDK for 3Dmol.js, allowing both individual VR use and synchronized classroom use (e.g., one "guide" and multiple synchronized viewers).<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Create a generalized fingerprinting function ===<br />
<br />
'''Brief explanation:''' The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008<br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MMTF Integration ===<br />
<br />
'''Brief explanation:''' Implementation of the MMTF file format in the RDKit. See the similar OpenBabel project for more details.<br />
<br />
'''Expected results:''' A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MolVS ===<br />
<br />
'''Brief explanation:''' MolVS (https://molvs.readthedocs.io/en/latest/) provides very useful functionality for molecular validation and standardization. MolVS is built using the RDKit, but in this project we will expand its capabilities and integrate it into the RDKit project. An eventual end goal, though not necessarily one for this project, will be to have a C++ implementation of the algorithm that is part of the core RDKit. Matt Swain (the original author of MolVS) will collaborate with us on this.<br />
<br />
'''Expected results:''' A Python or C++ implementation of an extended version of MolVS that can be integrated into the RDKit core. The extensions will include support for sets of atom types that are to be allowed/disallowed. We will also add ideas borrowed from Standardiser (https://github.com/flatkinson/standardiser) <br />
<br />
'''Prerequisites:''' Python, C++ would be an advantage but is not required<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: neo4j integration ===<br />
<br />
'''Brief explanation:''' <br />
The RDKit already has strong integration with the open-source relational database PostgreSQL, in this project you'll build a similar extension for the open-source graph database neo4j (https://neo4j.com/). The concept of the knowledge graph, which stores the relationships between objects in addition to the objects themselves, has become widespread in data management and integration. This project will allow us to build and query knowledge graphs storing molecular and chemical information.<br />
<br />
'''Expected results:''' <br />
An RDKit extension to neo4j that provides chemical functionality for finding entry points into the graph and to efficiently filter paths using chemical knowledge while traversing the graph.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Christian Pilger (christian.pilger at basf.com)<br />
<br />
=== Project: RDKit - OpenMM Integration ===<br />
<br />
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.<br />
<br />
'''Expected results:''' C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. The work would likely involve completing the MMFF94 implementation described by Paolo Tosco at the 2017 RDKit UGM (https://github.com/rdkit/UGM_2017/blob/master/Presentations/Tosco_RDKit_OpenMM_integration.pdf) and extending to other force fields like UFF.<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' TBA, likely Geoff Hutchison (geoffh at pitt.edu) and others<br />
<br />
<br />
=== Project: MongoDB integration ===<br />
<br />
'''Brief explanation:''' <br />
MongoDB (https://www.mongodb.com/) is an open-source cross-platform document oriented NoSQL database program optimized for performance. Its flexible schema can accommodate hierarchical relationships between chemical compounds. To enable chemical intelligence in mongoDB queries, an integration with RDKit is necessary. This project will allow us to build a frame to perform similarity, substructure, and identity searches. We will leverage on the document structure of the database to store multiple representations of each molecule. While adding new functionalities and developing existing capabilities we will keep an eye on performance, to ensure optimal scalability (indexing, shards, multiprocessing, etc.). We will learn from, and possibly build upon other work that has been done for chemistry integration into MongoDB (e.g. http://wiki.openchemistry.org/MongoChem and http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb).<br />
<br />
'''Expected results:''' <br />
A stable and performant RDKit extension to MongoDB that provides chemical functionalities on a document database.<br />
<br />
'''Prerequisites:''' Python<br />
<br />
'''Mentor:''' Marco Stenta (marco.stenta at syngenta.com)<br />
<br />
== MSDK / MZmine Project Ideas ==<br />
<br />
Mass spectrometry is an analytical technique that measures the mass of small molecules with high precision. The data coming from mass spectrometry instruments is complex and multi-dimensional. Mass spectrometry development kit (MSDK [http://msdk.github.io]) is a Java library of algorithms for processing such mass spectrometry data. The goals of the library are to provide a flexible data model with Java interfaces for mass-spectrometry related objects (including raw spectra, processed data sets, identification results etc.) and to integrate the existing algorithms that are currently scattered around various Java-based graphical tools. MZmine [https://mzmine.github.io] is an open-source software for mass-spectrometry data processing. A new version, MZmine 3, which is currently under development, is based on JavaFX for GUI and on MSDK for data processing algorithms. <br />
<br />
<br />
=== Project: MSDK - Feature Detection ===<br />
<br />
'''Brief explanation:''' Provide a native Java implementation of some popular LC-MS feature detection algorithms from the R world (centWave, massifquant, CAMERA, etc.) [https://www.bioconductor.org/packages/3.7/bioc/manuals/xcms/man/xcms.pdf]. Further development of ADAP-3D module [https://github.com/msdk/msdk/tree/master/msdk-featuredetection-adap3d] for intelligent, parameter-less feature detection.<br />
<br />
'''Prerequisites:''' Java, preferably some knowledge about mass spectrometry<br />
<br />
'''Mentor:''' Dmitry Avtonomov <dmitriy.avtonomov@gmail.com>, Xiuxia Du <Xiuxia.Du@uncc.edu><br />
<br />
<br />
=== Project: MSDK - Spectral Database Search ===<br />
<br />
'''Brief explanation:''' Develop new MSDK modules for spectral search in offline and online databases (especially MoNA [http://mona.fiehnlab.ucdavis.edu]).<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Gert Wohlgemuth <berlinguyinca@gmail.com><br />
<br />
<br />
=== Project: MSDK - New IO Modules ===<br />
<br />
'''Brief explanation:''' Develop new MSDK-IO modules for currently unsupported file formats, like mzDB [https://github.com/mzdb/mzdb-specs], mz5 [http://software.steenlab.org/mz5/], or imzML [https://ms-imaging.org/wp/imzml/], and improve the existing support for reading native vendor formats. Update mzTab [https://github.com/HUPO-PSI/mzTab] support to version 1.1 with new features for metabolomics. In addition, support for reading ion mobility data can be added to the existing mzML format reader/writer.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com> and/or Adam Tenderholt <atenderholt@gmail.com><br />
<br />
=== Project: MSDK - KNIME integration ===<br />
<br />
'''Brief explanation:''' Develop an integration layer for MSDK algorithms into the workflow platform KNIME [https://www.knime.com].<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
<br />
=== Project: MSDK / MZmine - Statistical Analysis ===<br />
<br />
'''Brief explanation:''' Develop new modules for multivariate statistics and machine learning-based analysis of mass spectrometry results. Part of this project is algorithmic, part of it is GUI development.<br />
<br />
'''Prerequisites:''' Java, preferably basic knowledge about statistics<br />
<br />
'''Mentor:''' Xiuxia Du <Xiuxia.Du@uncc.edu><br />
<br />
<br />
=== Project: MSDK / MZmine - Correlation Analysis ===<br />
<br />
'''Brief explanation:''' Develop new modules for correlation-based identification of related mass spectrometry signals.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
<br />
=== Project: MZmine - New Visualization Modules ===<br />
<br />
'''Brief explanation:''' Implement new, JavaFX-based visualization modules for MZmine such as Cloud Plot [http://pubs.acs.org/doi/abs/10.1021/ac3029745] or spectral similarity tree imaging.<br />
<br />
'''Prerequisites:''' Java (experience with JavaFX is helpful but not required)<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
== NWChem Project Ideas ==<br />
NWChem is widely used open-source computational chemistry software ([http://nwchem-sw.org]) that tackles a wide variety of scientific problems.<br />
<br />
=== Project NWChem-JSON ===<br />
<br />
'''Brief explanation:''' Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics. <br />
<br />
'''Expected results:''' Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed. <br />
<br />
'''Prerequisites:''' Experience with Fortran90 and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== Project NWChem-Python-Jupyter Interface ===<br />
<br />
'''Brief explanation:''' Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks<br />
<br />
'''Expected results:''' NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite. <br />
<br />
'''Prerequisites:''' Experience with Fortran and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== JSON-LD for Chemical Data ===<br />
<br />
'''Brief explanation:''' Transforming NWChem and Chemical JSON formats to JSON-LD<br />
<br />
'''Expected results:''' Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.<br />
<br />
'''Prerequisites:''' Experience with Python and JSON-LD<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.<br />
<br />
=== Project: Transfer Learning Framework ===<br />
<br />
'''Brief explanation:''' Create easy to use tools for common transfer learning scenarios.<br />
<br />
'''Expected results:''' [https://arxiv.org/abs/1712.02734 ChemNet] discusses a powerful model independent transfer learning protocol. We would want to reproduce the results, and be able to apply the transfer learning protocol to arbitrary TensorGraph models. Jupyter notebook tutorials and blog posts will be expected over the course of the summer.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Data Interfaces ===<br />
<br />
'''Brief explanation:''' Transition deepchem.data.Dataset to tf.data.<br />
<br />
'''Expected results:''' DeepChem data objects were created before tf.data existed. We need to make our existing Featurizers, Transformers, and Models work over tf.data objects. Jupyter notebook tutorials and blog posts for how to use the new improved interfaces.<br />
<br />
'''Prerequisites:''' Python, some Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Model Visualization ===<br />
<br />
'''Brief explanation:''' Node Importance Visualizations from Graph Models<br />
<br />
'''Expected results:''' An argument often used against deep learning methods is that they are not understandable. This project would be to implement [https://github.com/debbiemarkslab/neural-fingerprint-theano visual neural graph fingerprints] into DeepChem. Stretch goals would be to implement [https://arxiv.org/abs/1605.01713 DeepLift] or masking techniques for atom level visualizations.<br />
<br />
'''Prerequisites:''' Python, Tensorflow, rdkit<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
<br />
=== Project: Imaging Tools ===<br />
<br />
'''Brief explanation:''' Enable chemical image segmentation and property prediction.<br />
<br />
'''Expected results:''' We want an implementation of [https://arxiv.org/pdf/1505.04597.pdf U-Net], and [https://arxiv.org/pdf/1512.03385.pdf ResNet] inside of the DeepChem framework. We want both pre-trained networks on problems of chemical importance and the image data augmentation techniques used to create the models.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at datamined dot io)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
=== Project: Utilizing Virtual Reality in Chemistry Visualization and Modeling ===<br />
<br />
''Brief explanation:''' Develop a VR application or library that can be used to visualize molecular structures, possibly manipulate them. <br />
<br />
'''Expected results:''' A VR application or library that can be integrated in one of the apps above, focused on molecular structure modeling. The target is both scientific applications and an educational component. If time permits, development of an interface that allows users to manipulate the structures and get a realtime response (using fast molecular force-fields to compute responses) would be a stretch outcome.<br />
<br />
'''Prerequisites:''' Python, C++, VR SDK experience would be nice.<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov) <br />
<br />
===Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data===<br />
<br />
'''Brief explanation:''' Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.<br />
<br />
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).<br />
<br />
'''Expected results:''' Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.<br />
<br />
'''Prerequisites:''' General programming experience, and ideally experience in chemistry and matrix manipulations.<br />
<br />
'''Suggested Readings:''' <br />
* http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009<br />
* http://zhanglab.ccmb.med.umich.edu/EDTSurf/<br />
* https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf<br />
<br />
'''Mentor:''' Geoffrey Hutchison (geoffh at pitt dot edu) and Adam Tenderholt (atenderholt at gmail dot com)<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: YAeHMOP as a library ===<br />
<br />
'''Brief explanation:''' YAeHMOP (https://github.com/greglandrum/yaehmop and http://yaehmop.sourceforge.net/) is an open-source package of tools for doing extended Hueckel calculations on molecules and crystals. The software was developed as a series of command line tools that expect to read and write from files. After some years of obscurity, YAeHMOP has recently attracted attention as a plugin for Avogadro. The goal of this project is to modernize aspects of YAeHMOP and make the core computational functionality accessible as a library.<br />
<br />
'''Expected results:''' A library that is callable from C/C++ allowing the construction of an input molecule/crystal, specification of computational parameters, execution of a calculation, and capture of results. Ideally this will include a modernization of the pieces of the code that still rely on f2c-translated Fortran77 code: the eigenvalue solver can likely be replaced with eigen without too much effort; the functionality to calculate STO integrals will need to be re-written based on the original f77. The library should also include a robust set of regression tests. Stretch goals would be adding Python wrappers to the library and/or creating an RDKit plugin for it.<br />
<br />
'''Prerequisites:''' C/C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2018&diff=620GSoC Ideas 20182018-02-16T06:17:29Z<p>Greg.landrum: /* RDKit Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve [http://www.openchemistry.org/avogadro2 Avogadro 2], [http://cclib.sourceforge.net cclib], [http://3dmol.csb.pitt.edu/ 3DMol], and [http://openbabel.org/ Open Babel]. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Integrate with VTK: Volume Rendering and Charts ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data<br />
<br />
'''Expected results:''' The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2<br />
<br />
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''' Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented using PyBind11 with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python or JavaScript, some experience with PyBind11, SWIG, Boost.Python, or similar packages (SIP, PySide, etc.) suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (marcus dot hanwell at kitware dot com)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Fast, Efficient Fragment-Based Coordinate Generation ===<br />
<br />
'''Brief explanation:''' A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.<br />
<br />
'''Expected results:''' Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail. <br />
<br />
Importantly, the approach is '''highly''' efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling. <br />
<br />
A recent research paper describing one such approach can be found [http://link.springer.com/article/10.1186/s13321-015-0095-1 here]<br />
<br />
'''Prerequisites:''' Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Bayesian Optimization of Conformer Geometries ===<br />
<br />
'''Brief explanation:''' Most molecules have multiple energetically-accessible geometries (conformations). In even medium-sized molecules, there may be '''thousands''' or '''''millions''''' of possibilities. Intelligent search strategies (i.e., Bayesian optimization) are needed to find the best geometries in the shortest amount of time.<br />
<br />
'''Expected results:''' An efficient implementation of Bayesian optimization of molecular dihedral angles and testing against known molecular geometries (crystal structures) and libraries of conformers. In principle, the goal is to balance exploration of the multiple degrees of freedom and exploitation of known data (i.e., local optimization). A key test is to compare against existing Monte Carlo and genetic algorithm methods already implemented in Open Babel.<br />
<br />
In many molecules, the degrees of freedom (dihedral angles) are non-independent, so detecting correlations between dimensions, dimensional reduction, etc. should likely improve performance. Combining data science and machine learning techniques may allow the code to detect such conditions based on the molecular structure (i.e., this is not a '''completely''' black-box optimization - we know some of the physics involved).<br />
<br />
'''Prerequisites:''' Experience in C++ or Python. Knowledge of data science or statistics (e.g., Bayesian inference, data mining) is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Test Framework Overhaul ===<br />
<br />
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.<br />
<br />
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.<br />
<br />
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
=== Project: Implement QC JSON schema in cclib ===<br />
<br />
'''Brief explanation:''' Incorporate the [https://github.com/MolSSI/QC_JSON_Schema MolSSI JSON schema], which is currently in the design stage.<br />
<br />
'''Expected results:''' Implement a reader and writer according to the schema, and provide feedback to help drive the schema design to completion. Optionally also improve the code and tests for existing reader/writer classes in cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally some familiarity with JSON, quantum chemistry and computational chemistry programs.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() functions in parsers are long and contain a lot of business logic. They should be refactored into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement the best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Discovering computational chemistry content online ===<br />
<br />
'''Brief explanation:''' There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!<br />
<br />
'''Expected results:''' Build a crawler that identifies and indexes computational chemistry content online, and provides the ability to extract data with cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Machine learning applied to parsing computational chemistry output ===<br />
<br />
'''Bried explanation:''' Can we teach a machine to parse computational chemistry logfiles at least as well as cclib already does? What machine learning approach here would be most appropriate? Is it useful to include prior (chemical) knowldedge or soft constraints to guide parser learning?<br />
<br />
'''Expected results:''' Identify and implement a machine learning pipeline that attempts to reproduce or complement cclib's various parsers.<br />
<br />
'''Prerequisites:''' Experience with Python, machine learning, and ideally familiarity with computational chemistry.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
==3Dmol.js Project Ideas==<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://http.developer.nvidia.com/GPUGems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Google Cardboard for 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Implement low cost virtual reality visualization using Google Cardboard<br />
<br />
'''Expected results:''' [Google Cardboard https://en.wikipedia.org/wiki/Google_Cardboard] is a VR experience using commodity smartphones and either a paperboard/cardboard mount or an inexpensive pre-made mount. The project would produce an implementation using the Cardboard SDK for 3Dmol.js, allowing both individual VR use and synchronized classroom use (e.g., one "guide" and multiple synchronized viewers).<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Create a generalized fingerprinting function ===<br />
<br />
'''Brief explanation:''' The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008<br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MMTF Integration ===<br />
<br />
'''Brief explanation:''' Implementation of the MMTF file format in the RDKit. See the similar OpenBabel project for more details.<br />
<br />
'''Expected results:''' A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MolVS ===<br />
<br />
'''Brief explanation:''' MolVS (https://molvs.readthedocs.io/en/latest/) provides very useful functionality for molecular validation and standardization. MolVS is built using the RDKit, but in this project we will expand its capabilities and integrate it into the RDKit project. An eventual end goal, though not necessarily one for this project, will be to have a C++ implementation of the algorithm that is part of the core RDKit. Matt Swain (the original author of MolVS) will collaborate with us on this.<br />
<br />
'''Expected results:''' A Python or C++ implementation of an extended version of MolVS that can be integrated into the RDKit core. The extensions will include support for sets of atom types that are to be allowed/disallowed. We will also add ideas borrowed from Standardiser (https://github.com/flatkinson/standardiser) <br />
<br />
'''Prerequisites:''' Python, C++ would be an advantage but is not required<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: neo4j integration ===<br />
<br />
'''Brief explanation:''' <br />
The RDKit already has strong integration with the open-source relational database PostgreSQL, in this project you'll build a similar extension for the open-source graph database neo4j (https://neo4j.com/). The concept of the knowledge graph, which stores the relationships between objects in addition to the objects themselves, has become widespread in data management and integration. This project will allow us to build and query knowledge graphs storing molecular and chemical information.<br />
<br />
'''Expected results:''' <br />
An RDKit extension to neo4j that provides chemical functionality for finding entry points into the graph and to efficiently filter paths using chemical knowledge while traversing the graph.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Christian Pilger (christian.pilger at basf.com)<br />
<br />
== MSDK / MZmine Project Ideas ==<br />
<br />
Mass spectrometry is an analytical technique that measures the mass of small molecules with high precision. The data coming from mass spectrometry instruments is complex and multi-dimensional. Mass spectrometry development kit (MSDK [http://msdk.github.io]) is a Java library of algorithms for processing such mass spectrometry data. The goals of the library are to provide a flexible data model with Java interfaces for mass-spectrometry related objects (including raw spectra, processed data sets, identification results etc.) and to integrate the existing algorithms that are currently scattered around various Java-based graphical tools. MZmine [https://mzmine.github.io] is an open-source software for mass-spectrometry data processing. A new version, MZmine 3, which is currently under development, is based on JavaFX for GUI and on MSDK for data processing algorithms. <br />
<br />
<br />
=== Project: MSDK - Feature Detection ===<br />
<br />
'''Brief explanation:''' Provide a native Java implementation of some popular LC-MS feature detection algorithms from the R world (centWave, massifquant, CAMERA, etc.) [https://www.bioconductor.org/packages/3.7/bioc/manuals/xcms/man/xcms.pdf]. Further development of ADAP-3D module [https://github.com/msdk/msdk/tree/master/msdk-featuredetection-adap3d] for intelligent, parameter-less feature detection.<br />
<br />
'''Prerequisites:''' Java, preferably some knowledge about mass spectrometry<br />
<br />
'''Mentor:''' Dmitry Avtonomov <dmitriy.avtonomov@gmail.com>, Xiuxia Du <Xiuxia.Du@uncc.edu><br />
<br />
<br />
=== Project: MSDK - Spectral Database Search ===<br />
<br />
'''Brief explanation:''' Develop new MSDK modules for spectral search in offline and online databases (especially MoNA [http://mona.fiehnlab.ucdavis.edu]).<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Gert Wohlgemuth <berlinguyinca@gmail.com><br />
<br />
<br />
=== Project: MSDK - New IO Modules ===<br />
<br />
'''Brief explanation:''' Develop new MSDK-IO modules for currently unsupported file formats, like mzDB [https://github.com/mzdb/mzdb-specs], mz5 [http://software.steenlab.org/mz5/], or imzML [https://ms-imaging.org/wp/imzml/], and improve the existing support for reading native vendor formats. Update mzTab [https://github.com/HUPO-PSI/mzTab] support to version 1.1 with new features for metabolomics. In addition, support for reading ion mobility data can be added to the existing mzML format reader/writer.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com> and/or Adam Tenderholt <atenderholt@gmail.com><br />
<br />
=== Project: MSDK - KNIME integration ===<br />
<br />
'''Brief explanation:''' Develop an integration layer for MSDK algorithms into the workflow platform KNIME [https://www.knime.com].<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
<br />
=== Project: MSDK / MZmine - Statistical Analysis ===<br />
<br />
'''Brief explanation:''' Develop new modules for multivariate statistics and machine learning-based analysis of mass spectrometry results. Part of this project is algorithmic, part of it is GUI development.<br />
<br />
'''Prerequisites:''' Java, preferably basic knowledge about statistics<br />
<br />
'''Mentor:''' Xiuxia Du <Xiuxia.Du@uncc.edu><br />
<br />
<br />
=== Project: MSDK / MZmine - Correlation Analysis ===<br />
<br />
'''Brief explanation:''' Develop new modules for correlation-based identification of related mass spectrometry signals.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
<br />
=== Project: MZmine - New Visualization Modules ===<br />
<br />
'''Brief explanation:''' Implement new, JavaFX-based visualization modules for MZmine such as Cloud Plot [http://pubs.acs.org/doi/abs/10.1021/ac3029745] or spectral similarity tree imaging.<br />
<br />
'''Prerequisites:''' Java (experience with JavaFX is helpful but not required)<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
== NWChem Project Ideas ==<br />
NWChem is widely used open-source computational chemistry software ([http://nwchem-sw.org]) that tackles a wide variety of scientific problems.<br />
<br />
=== Project NWChem-JSON ===<br />
<br />
'''Brief explanation:''' Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics. <br />
<br />
'''Expected results:''' Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed. <br />
<br />
'''Prerequisites:''' Experience with Fortran90 and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== Project NWChem-Python-Jupyter Interface ===<br />
<br />
'''Brief explanation:''' Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks<br />
<br />
'''Expected results:''' NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite. <br />
<br />
'''Prerequisites:''' Experience with Fortran and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== JSON-LD for Chemical Data ===<br />
<br />
'''Brief explanation:''' Transforming NWChem and Chemical JSON formats to JSON-LD<br />
<br />
'''Expected results:''' Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.<br />
<br />
'''Prerequisites:''' Experience with Python and JSON-LD<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.<br />
<br />
=== Project: Transfer Learning Framework ===<br />
<br />
'''Brief explanation:''' Create easy to use tools for common transfer learning scenarios.<br />
<br />
'''Expected results:''' [https://arxiv.org/abs/1712.02734 ChemNet] discusses a powerful model independent transfer learning protocol. We would want to reproduce the results, and be able to apply the transfer learning protocol to arbitrary TensorGraph models. Jupyter notebook tutorials and blog posts will be expected over the course of the summer.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Data Interfaces ===<br />
<br />
'''Brief explanation:''' Transition deepchem.data.Dataset to tf.data.<br />
<br />
'''Expected results:''' DeepChem data objects were created before tf.data existed. We need to make our existing Featurizers, Transformers, and Models work over tf.data objects. Jupyter notebook tutorials and blog posts for how to use the new improved interfaces.<br />
<br />
'''Prerequisites:''' Python, some Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Model Visualization ===<br />
<br />
'''Brief explanation:''' Node Importance Visualizations from Graph Models<br />
<br />
'''Expected results:''' An argument often used against deep learning methods is that they are not understandable. This project would be to implement [https://github.com/debbiemarkslab/neural-fingerprint-theano visual neural graph fingerprints] into DeepChem. Stretch goals would be to implement [https://arxiv.org/abs/1605.01713 DeepLift] or masking techniques for atom level visualizations.<br />
<br />
'''Prerequisites:''' Python, Tensorflow, rdkit<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
<br />
=== Project: Imaging Tools ===<br />
<br />
'''Brief explanation:''' Enable chemical image segmentation and property prediction.<br />
<br />
'''Expected results:''' We want an implementation of [https://arxiv.org/pdf/1505.04597.pdf U-Net], and [https://arxiv.org/pdf/1512.03385.pdf ResNet] inside of the DeepChem framework. We want both pre-trained networks on problems of chemical importance and the image data augmentation techniques used to create the models.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at datamined dot io)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
=== Project: Utilizing Virtual Reality in Chemistry Visualization and Modeling ===<br />
<br />
''Brief explanation:''' Develop a VR application or library that can be used to visualize molecular structures, possibly manipulate them. <br />
<br />
'''Expected results:''' A VR application or library that can be integrated in one of the apps above, focused on molecular structure modeling. The target is both scientific applications and an educational component. If time permits, development of an interface that allows users to manipulate the structures and get a realtime response (using fast molecular force-fields to compute responses) would be a stretch outcome.<br />
<br />
'''Prerequisites:''' Python, C++, VR SDK experience would be nice.<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov) <br />
<br />
===Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data===<br />
<br />
'''Brief explanation:''' Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.<br />
<br />
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).<br />
<br />
'''Expected results:''' Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.<br />
<br />
'''Prerequisites:''' General programming experience, and ideally experience in chemistry and matrix manipulations.<br />
<br />
'''Suggested Readings:''' <br />
* http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009<br />
* http://zhanglab.ccmb.med.umich.edu/EDTSurf/<br />
* https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf<br />
<br />
'''Mentor:''' Geoffrey Hutchison (geoffh at pitt dot edu) and Adam Tenderholt (atenderholt at gmail dot com)<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: YAeHMOP as a library ===<br />
<br />
'''Brief explanation:''' YAeHMOP (https://github.com/greglandrum/yaehmop and http://yaehmop.sourceforge.net/) is an open-source package of tools for doing extended Hueckel calculations on molecules and crystals. The software was developed as a series of command line tools that expect to read and write from files. After some years of obscurity, YAeHMOP has recently attracted attention as a plugin for Avogadro. The goal of this project is to modernize aspects of YAeHMOP and make the core computational functionality accessible as a library.<br />
<br />
'''Expected results:''' A library that is callable from C/C++ allowing the construction of an input molecule/crystal, specification of computational parameters, execution of a calculation, and capture of results. Ideally this will include a modernization of the pieces of the code that still rely on f2c-translated Fortran77 code: the eigenvalue solver can likely be replaced with eigen without too much effort; the functionality to calculate STO integrals will need to be re-written based on the original f77. The library should also include a robust set of regression tests. Stretch goals would be adding Python wrappers to the library and/or creating an RDKit plugin for it.<br />
<br />
'''Prerequisites:''' C/C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2018&diff=619GSoC Ideas 20182018-02-16T06:08:34Z<p>Greg.landrum: /* RDKit Project Ideas */ add stub for new project idea (content to come)</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve [http://www.openchemistry.org/avogadro2 Avogadro 2], [http://cclib.sourceforge.net cclib], [http://3dmol.csb.pitt.edu/ 3DMol], and [http://openbabel.org/ Open Babel]. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Integrate with VTK: Volume Rendering and Charts ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data<br />
<br />
'''Expected results:''' The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2<br />
<br />
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''' Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented using PyBind11 with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python or JavaScript, some experience with PyBind11, SWIG, Boost.Python, or similar packages (SIP, PySide, etc.) suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (marcus dot hanwell at kitware dot com)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Fast, Efficient Fragment-Based Coordinate Generation ===<br />
<br />
'''Brief explanation:''' A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.<br />
<br />
'''Expected results:''' Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail. <br />
<br />
Importantly, the approach is '''highly''' efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling. <br />
<br />
A recent research paper describing one such approach can be found [http://link.springer.com/article/10.1186/s13321-015-0095-1 here]<br />
<br />
'''Prerequisites:''' Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Bayesian Optimization of Conformer Geometries ===<br />
<br />
'''Brief explanation:''' Most molecules have multiple energetically-accessible geometries (conformations). In even medium-sized molecules, there may be '''thousands''' or '''''millions''''' of possibilities. Intelligent search strategies (i.e., Bayesian optimization) are needed to find the best geometries in the shortest amount of time.<br />
<br />
'''Expected results:''' An efficient implementation of Bayesian optimization of molecular dihedral angles and testing against known molecular geometries (crystal structures) and libraries of conformers. In principle, the goal is to balance exploration of the multiple degrees of freedom and exploitation of known data (i.e., local optimization). A key test is to compare against existing Monte Carlo and genetic algorithm methods already implemented in Open Babel.<br />
<br />
In many molecules, the degrees of freedom (dihedral angles) are non-independent, so detecting correlations between dimensions, dimensional reduction, etc. should likely improve performance. Combining data science and machine learning techniques may allow the code to detect such conditions based on the molecular structure (i.e., this is not a '''completely''' black-box optimization - we know some of the physics involved).<br />
<br />
'''Prerequisites:''' Experience in C++ or Python. Knowledge of data science or statistics (e.g., Bayesian inference, data mining) is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Test Framework Overhaul ===<br />
<br />
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.<br />
<br />
'''Expected results:''' A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.<br />
<br />
'''Prerequisites:''' Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
=== Project: Implement QC JSON schema in cclib ===<br />
<br />
'''Brief explanation:''' Incorporate the [https://github.com/MolSSI/QC_JSON_Schema MolSSI JSON schema], which is currently in the design stage.<br />
<br />
'''Expected results:''' Implement a reader and writer according to the schema, and provide feedback to help drive the schema design to completion. Optionally also improve the code and tests for existing reader/writer classes in cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally some familiarity with JSON, quantum chemistry and computational chemistry programs.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() functions in parsers are long and contain a lot of business logic. They should be refactored into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement the best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Discovering computational chemistry content online ===<br />
<br />
'''Brief explanation:''' There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!<br />
<br />
'''Expected results:''' Build a crawler that identifies and indexes computational chemistry content online, and provides the ability to extract data with cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Machine learning applied to parsing computational chemistry output ===<br />
<br />
'''Bried explanation:''' Can we teach a machine to parse computational chemistry logfiles at least as well as cclib already does? What machine learning approach here would be most appropriate? Is it useful to include prior (chemical) knowldedge or soft constraints to guide parser learning?<br />
<br />
'''Expected results:''' Identify and implement a machine learning pipeline that attempts to reproduce or complement cclib's various parsers.<br />
<br />
'''Prerequisites:''' Experience with Python, machine learning, and ideally familiarity with computational chemistry.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
==3Dmol.js Project Ideas==<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://http.developer.nvidia.com/GPUGems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Google Cardboard for 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Implement low cost virtual reality visualization using Google Cardboard<br />
<br />
'''Expected results:''' [Google Cardboard https://en.wikipedia.org/wiki/Google_Cardboard] is a VR experience using commodity smartphones and either a paperboard/cardboard mount or an inexpensive pre-made mount. The project would produce an implementation using the Cardboard SDK for 3Dmol.js, allowing both individual VR use and synchronized classroom use (e.g., one "guide" and multiple synchronized viewers).<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Create a generalized fingerprinting function ===<br />
<br />
'''Brief explanation:''' The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008<br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MMTF Integration ===<br />
<br />
'''Brief explanation:''' Implementation of the MMTF file format in the RDKit. See the similar OpenBabel project for more details.<br />
<br />
'''Expected results:''' A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MolVS ===<br />
<br />
'''Brief explanation:''' MolVS (https://molvs.readthedocs.io/en/latest/) provides very useful functionality for molecular validation and standardization. MolVS is built using the RDKit, but in this project we will expand its capabilities and integrate it into the RDKit project. An eventual end goal, though not necessarily one for this project, will be to have a C++ implementation of the algorithm that is part of the core RDKit. Matt Swain (the original author of MolVS) will collaborate with us on this.<br />
<br />
'''Expected results:''' A Python or C++ implementation of an extended version of MolVS that can be integrated into the RDKit core. The extensions will include support for sets of atom types that are to be allowed/disallowed. We will also add ideas borrowed from Standardiser (https://github.com/flatkinson/standardiser) <br />
<br />
'''Prerequisites:''' Python, C++ would be an advantage but is not required<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: Neo4J integration ===<br />
<br />
'''Brief explanation:''' <br />
<br />
'''Expected results:''' <br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Christian Pilger (christian.pilger at basf.com)<br />
<br />
== MSDK / MZmine Project Ideas ==<br />
<br />
Mass spectrometry is an analytical technique that measures the mass of small molecules with high precision. The data coming from mass spectrometry instruments is complex and multi-dimensional. Mass spectrometry development kit (MSDK [http://msdk.github.io]) is a Java library of algorithms for processing such mass spectrometry data. The goals of the library are to provide a flexible data model with Java interfaces for mass-spectrometry related objects (including raw spectra, processed data sets, identification results etc.) and to integrate the existing algorithms that are currently scattered around various Java-based graphical tools. MZmine [https://mzmine.github.io] is an open-source software for mass-spectrometry data processing. A new version, MZmine 3, which is currently under development, is based on JavaFX for GUI and on MSDK for data processing algorithms. <br />
<br />
<br />
=== Project: MSDK - Feature Detection ===<br />
<br />
'''Brief explanation:''' Provide a native Java implementation of some popular LC-MS feature detection algorithms from the R world (centWave, massifquant, CAMERA, etc.) [https://www.bioconductor.org/packages/3.7/bioc/manuals/xcms/man/xcms.pdf]. Further development of ADAP-3D module [https://github.com/msdk/msdk/tree/master/msdk-featuredetection-adap3d] for intelligent, parameter-less feature detection.<br />
<br />
'''Prerequisites:''' Java, preferably some knowledge about mass spectrometry<br />
<br />
'''Mentor:''' Dmitry Avtonomov <dmitriy.avtonomov@gmail.com>, Xiuxia Du <Xiuxia.Du@uncc.edu><br />
<br />
<br />
=== Project: MSDK - Spectral Database Search ===<br />
<br />
'''Brief explanation:''' Develop new MSDK modules for spectral search in offline and online databases (especially MoNA [http://mona.fiehnlab.ucdavis.edu]).<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Gert Wohlgemuth <berlinguyinca@gmail.com><br />
<br />
<br />
=== Project: MSDK - New IO Modules ===<br />
<br />
'''Brief explanation:''' Develop new MSDK-IO modules for currently unsupported file formats, like mzDB [https://github.com/mzdb/mzdb-specs], mz5 [http://software.steenlab.org/mz5/], or imzML [https://ms-imaging.org/wp/imzml/], and improve the existing support for reading native vendor formats. Update mzTab [https://github.com/HUPO-PSI/mzTab] support to version 1.1 with new features for metabolomics. In addition, support for reading ion mobility data can be added to the existing mzML format reader/writer.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com> and/or Adam Tenderholt <atenderholt@gmail.com><br />
<br />
=== Project: MSDK - KNIME integration ===<br />
<br />
'''Brief explanation:''' Develop an integration layer for MSDK algorithms into the workflow platform KNIME [https://www.knime.com].<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
<br />
=== Project: MSDK / MZmine - Statistical Analysis ===<br />
<br />
'''Brief explanation:''' Develop new modules for multivariate statistics and machine learning-based analysis of mass spectrometry results. Part of this project is algorithmic, part of it is GUI development.<br />
<br />
'''Prerequisites:''' Java, preferably basic knowledge about statistics<br />
<br />
'''Mentor:''' Xiuxia Du <Xiuxia.Du@uncc.edu><br />
<br />
<br />
=== Project: MSDK / MZmine - Correlation Analysis ===<br />
<br />
'''Brief explanation:''' Develop new modules for correlation-based identification of related mass spectrometry signals.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
<br />
=== Project: MZmine - New Visualization Modules ===<br />
<br />
'''Brief explanation:''' Implement new, JavaFX-based visualization modules for MZmine such as Cloud Plot [http://pubs.acs.org/doi/abs/10.1021/ac3029745] or spectral similarity tree imaging.<br />
<br />
'''Prerequisites:''' Java (experience with JavaFX is helpful but not required)<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
== NWChem Project Ideas ==<br />
NWChem is widely used open-source computational chemistry software ([http://nwchem-sw.org]) that tackles a wide variety of scientific problems.<br />
<br />
=== Project NWChem-JSON ===<br />
<br />
'''Brief explanation:''' Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics. <br />
<br />
'''Expected results:''' Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed. <br />
<br />
'''Prerequisites:''' Experience with Fortran90 and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== Project NWChem-Python-Jupyter Interface ===<br />
<br />
'''Brief explanation:''' Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks<br />
<br />
'''Expected results:''' NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite. <br />
<br />
'''Prerequisites:''' Experience with Fortran and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== JSON-LD for Chemical Data ===<br />
<br />
'''Brief explanation:''' Transforming NWChem and Chemical JSON formats to JSON-LD<br />
<br />
'''Expected results:''' Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.<br />
<br />
'''Prerequisites:''' Experience with Python and JSON-LD<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.<br />
<br />
=== Project: Transfer Learning Framework ===<br />
<br />
'''Brief explanation:''' Create easy to use tools for common transfer learning scenarios.<br />
<br />
'''Expected results:''' [https://arxiv.org/abs/1712.02734 ChemNet] discusses a powerful model independent transfer learning protocol. We would want to reproduce the results, and be able to apply the transfer learning protocol to arbitrary TensorGraph models. Jupyter notebook tutorials and blog posts will be expected over the course of the summer.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Data Interfaces ===<br />
<br />
'''Brief explanation:''' Transition deepchem.data.Dataset to tf.data.<br />
<br />
'''Expected results:''' DeepChem data objects were created before tf.data existed. We need to make our existing Featurizers, Transformers, and Models work over tf.data objects. Jupyter notebook tutorials and blog posts for how to use the new improved interfaces.<br />
<br />
'''Prerequisites:''' Python, some Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Model Visualization ===<br />
<br />
'''Brief explanation:''' Node Importance Visualizations from Graph Models<br />
<br />
'''Expected results:''' An argument often used against deep learning methods is that they are not understandable. This project would be to implement [https://github.com/debbiemarkslab/neural-fingerprint-theano visual neural graph fingerprints] into DeepChem. Stretch goals would be to implement [https://arxiv.org/abs/1605.01713 DeepLift] or masking techniques for atom level visualizations.<br />
<br />
'''Prerequisites:''' Python, Tensorflow, rdkit<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
<br />
=== Project: Imaging Tools ===<br />
<br />
'''Brief explanation:''' Enable chemical image segmentation and property prediction.<br />
<br />
'''Expected results:''' We want an implementation of [https://arxiv.org/pdf/1505.04597.pdf U-Net], and [https://arxiv.org/pdf/1512.03385.pdf ResNet] inside of the DeepChem framework. We want both pre-trained networks on problems of chemical importance and the image data augmentation techniques used to create the models.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at datamined dot io)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
=== Project: Utilizing Virtual Reality in Chemistry Visualization and Modeling ===<br />
<br />
''Brief explanation:''' Develop a VR application or library that can be used to visualize molecular structures, possibly manipulate them. <br />
<br />
'''Expected results:''' A VR application or library that can be integrated in one of the apps above, focused on molecular structure modeling. The target is both scientific applications and an educational component. If time permits, development of an interface that allows users to manipulate the structures and get a realtime response (using fast molecular force-fields to compute responses) would be a stretch outcome.<br />
<br />
'''Prerequisites:''' Python, C++, VR SDK experience would be nice.<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov) <br />
<br />
===Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data===<br />
<br />
'''Brief explanation:''' Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.<br />
<br />
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).<br />
<br />
'''Expected results:''' Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.<br />
<br />
'''Prerequisites:''' General programming experience, and ideally experience in chemistry and matrix manipulations.<br />
<br />
'''Suggested Readings:''' <br />
* http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009<br />
* http://zhanglab.ccmb.med.umich.edu/EDTSurf/<br />
* https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf<br />
<br />
'''Mentor:''' Geoffrey Hutchison (geoffh at pitt dot edu) and Adam Tenderholt (atenderholt at gmail dot com)<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: YAeHMOP as a library ===<br />
<br />
'''Brief explanation:''' YAeHMOP (https://github.com/greglandrum/yaehmop and http://yaehmop.sourceforge.net/) is an open-source package of tools for doing extended Hueckel calculations on molecules and crystals. The software was developed as a series of command line tools that expect to read and write from files. After some years of obscurity, YAeHMOP has recently attracted attention as a plugin for Avogadro. The goal of this project is to modernize aspects of YAeHMOP and make the core computational functionality accessible as a library.<br />
<br />
'''Expected results:''' A library that is callable from C/C++ allowing the construction of an input molecule/crystal, specification of computational parameters, execution of a calculation, and capture of results. Ideally this will include a modernization of the pieces of the code that still rely on f2c-translated Fortran77 code: the eigenvalue solver can likely be replaced with eigen without too much effort; the functionality to calculate STO integrals will need to be re-written based on the original f77. The library should also include a robust set of regression tests. Stretch goals would be adding Python wrappers to the library and/or creating an RDKit plugin for it.<br />
<br />
'''Prerequisites:''' C/C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2018&diff=616GSoC Ideas 20182018-02-08T13:39:56Z<p>Greg.landrum: A bit of tweaking of RDKit-related projects</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve [http://www.openchemistry.org/avogadro2 Avogadro 2], [http://cclib.sourceforge.net cclib], [http://3dmol.csb.pitt.edu/ 3DMol], and [http://openbabel.org/ Open Babel]. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Integrate with VTK: Volume Rendering and Charts ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data<br />
<br />
'''Expected results:''' The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2<br />
<br />
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''' Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented using PyBind11 with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python or JavaScript, some experience with PyBind11, SWIG, Boost.Python, or similar packages (SIP, PySide, etc.) suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (marcus dot hanwell at kitware dot com)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Fast, Efficient Fragment-Based Coordinate Generation ===<br />
<br />
'''Brief explanation:''' A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.<br />
<br />
'''Expected results:''' Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail. <br />
<br />
Importantly, the approach is '''highly''' efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling. <br />
<br />
A recent research paper describing one such approach can be found [http://link.springer.com/article/10.1186/s13321-015-0095-1 here]<br />
<br />
'''Prerequisites:''' Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Bayesian Optimization of Conformer Geometries ===<br />
<br />
'''Brief explanation:''' Most molecules have multiple energetically-accessible geometries (conformations). In even medium-sized molecules, there may be '''thousands''' or '''''millions''''' of possibilities. Intelligent search strategies (i.e., Bayesian optimization) are needed to find the best geometries in the shortest amount of time.<br />
<br />
'''Expected results:''' An efficient implementation of Bayesian optimization of molecular dihedral angles and testing against known molecular geometries (crystal structures) and libraries of conformers. In principle, the goal is to balance exploration of the multiple degrees of freedom and exploitation of known data (i.e., local optimization). A key test is to compare against existing Monte Carlo and genetic algorithm methods already implemented in Open Babel.<br />
<br />
In many molecules, the degrees of freedom (dihedral angles) are non-independent, so detecting correlations between dimensions, dimensional reduction, etc. should likely improve performance. Combining data science and machine learning techniques may allow the code to detect such conditions based on the molecular structure (i.e., this is not a '''completely''' black-box optimization - we know some of the physics involved).<br />
<br />
'''Prerequisites:''' Experience in C++ or Python. Knowledge of data science or statistics (e.g., Bayesian inference, data mining) is ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.<br />
<br />
=== Project: Implement QC JSON schema in cclib ===<br />
<br />
'''Brief explanation:''' Incorporate the [https://github.com/MolSSI/QC_JSON_Schema MolSSI JSON schema], which is currently in the design stage.<br />
<br />
'''Expected results:''' Implement a reader and writer according to the schema, and provide feedback to help drive the schema design to completion. Optionally also improve the code and tests for existing reader/writer classes in cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally some familiarity with JSON, quantum chemistry and computational chemistry programs.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() functions in parsers are long and contain a lot of business logic. They should be refactored into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement the best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Discovering computational chemistry content online ===<br />
<br />
'''Brief explanation:''' There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!<br />
<br />
'''Expected results:''' Build a crawler that identifies and indexes computational chemistry content online, and provides the ability to extract data with cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
=== Project: Machine learning applied to parsing computational chemistry output ===<br />
<br />
'''Bried explanation:''' Can we teach a machine to parse computational chemistry logfiles at least as well as cclib already does? What machine learning approach here would be most appropriate? Is it useful to include prior (chemical) knowldedge or soft constraints to guide parser learning?<br />
<br />
'''Expected results:''' Identify and implement a machine learning pipeline that attempts to reproduce or complement cclib's various parsers.<br />
<br />
'''Prerequisites:''' Experience with Python, machine learning, and ideally familiarity with computational chemistry.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com)<br />
<br />
==3Dmol.js Project Ideas==<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://http.developer.nvidia.com/GPUGems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Google Cardboard for 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Implement low cost virtual reality visualization using Google Cardboard<br />
<br />
'''Expected results:''' [Google Cardboard https://en.wikipedia.org/wiki/Google_Cardboard] is a VR experience using commodity smartphones and either a paperboard/cardboard mount or an inexpensive pre-made mount. The project would produce an implementation using the Cardboard SDK for 3Dmol.js, allowing both individual VR use and synchronized classroom use (e.g., one "guide" and multiple synchronized viewers).<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Create a generalized fingerprinting function ===<br />
<br />
'''Brief explanation:''' The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008<br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MMTF Integration ===<br />
<br />
'''Brief explanation:''' Implementation of the MMTF file format in the RDKit. See the similar OpenBabel project for more details.<br />
<br />
'''Expected results:''' A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MolVS ===<br />
<br />
'''Brief explanation:''' MolVS (https://molvs.readthedocs.io/en/latest/) provides very useful functionality for molecular validation and standardization. MolVS is built using the RDKit, but in this project we will expand its capabilities and integrate it into the RDKit project. An eventual end goal, though not necessarily one for this project, will be to have a C++ implementation of the algorithm that is part of the core RDKit. Matt Swain (the original author of MolVS) will collaborate with us on this.<br />
<br />
'''Expected results:''' A Python or C++ implementation of an extended version of MolVS that can be integrated into the RDKit core. The extensions will include support for sets of atom types that are to be allowed/disallowed. We will also add ideas borrowed from Standardiser (https://github.com/flatkinson/standardiser) <br />
<br />
'''Prerequisites:''' Python, C++ would be an advantage but is not required<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
== MSDK / MZmine Project Ideas ==<br />
<br />
Mass spectrometry is an analytical technique that measures the mass of small molecules with high precision. The data coming from mass spectrometry instruments is complex and multi-dimensional. Mass spectrometry development kit (MSDK [http://msdk.github.io]) is a Java library of algorithms for processing such mass spectrometry data. The goals of the library are to provide a flexible data model with Java interfaces for mass-spectrometry related objects (including raw spectra, processed data sets, identification results etc.) and to integrate the existing algorithms that are currently scattered around various Java-based graphical tools. MZmine [https://mzmine.github.io] is an open-source software for mass-spectrometry data processing. A new version, MZmine 3, which is currently under development, is based on JavaFX for GUI and on MSDK for data processing algorithms. <br />
<br />
<br />
=== Project: MSDK - Feature Detection ===<br />
<br />
'''Brief explanation:''' Provide a native Java implementation of some popular LC-MS feature detection algorithms from the R world (centWave, massifquant, CAMERA, etc.) [https://www.bioconductor.org/packages/3.7/bioc/manuals/xcms/man/xcms.pdf]. Further development of ADAP-3D module [https://github.com/msdk/msdk/tree/master/msdk-featuredetection-adap3d] for intelligent, parameter-less feature detection.<br />
<br />
'''Prerequisites:''' Java, preferably some knowledge about mass spectrometry<br />
<br />
'''Mentor:''' Dmitry Avtonomov <dmitriy.avtonomov@gmail.com>, Xiuxia Du <Xiuxia.Du@uncc.edu><br />
<br />
<br />
=== Project: MSDK - Spectral Database Search ===<br />
<br />
'''Brief explanation:''' Develop new MSDK modules for spectral search in offline and online databases (especially MoNA [http://mona.fiehnlab.ucdavis.edu]).<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Gert Wohlgemuth <berlinguyinca@gmail.com><br />
<br />
<br />
=== Project: MSDK - New IO Modules ===<br />
<br />
'''Brief explanation:''' Develop new MSDK-IO modules for currently unsupported file formats, like mzDB [https://github.com/mzdb/mzdb-specs], mz5 [http://software.steenlab.org/mz5/], or imzML [https://ms-imaging.org/wp/imzml/], and improve the existing support for reading native vendor formats. Update mzTab [https://github.com/HUPO-PSI/mzTab] support to version 1.1 with new features for metabolomics. In addition, support for reading ion mobility data can be added to the existing mzML format reader/writer.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com> and/or Adam Tenderholt <atenderholt@gmail.com><br />
<br />
=== Project: MSDK - KNIME integration ===<br />
<br />
'''Brief explanation:''' Develop an integration layer for MSDK algorithms into the workflow platform KNIME [https://www.knime.com].<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
<br />
=== Project: MSDK / MZmine - Statistical Analysis ===<br />
<br />
'''Brief explanation:''' Develop new modules for multivariate statistics and machine learning-based analysis of mass spectrometry results. Part of this project is algorithmic, part of it is GUI development.<br />
<br />
'''Prerequisites:''' Java, preferably basic knowledge about statistics<br />
<br />
'''Mentor:''' Xiuxia Du <Xiuxia.Du@uncc.edu><br />
<br />
<br />
=== Project: MSDK / MZmine - Correlation Analysis ===<br />
<br />
'''Brief explanation:''' Develop new modules for correlation-based identification of related mass spectrometry signals.<br />
<br />
'''Prerequisites:''' Java<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
<br />
=== Project: MZmine - New Visualization Modules ===<br />
<br />
'''Brief explanation:''' Implement new, JavaFX-based visualization modules for MZmine such as Cloud Plot [http://pubs.acs.org/doi/abs/10.1021/ac3029745] or spectral similarity tree imaging.<br />
<br />
'''Prerequisites:''' Java (experience with JavaFX is helpful but not required)<br />
<br />
'''Mentor:''' Tomas Pluskal <plusik@gmail.com><br />
<br />
== NWChem Project Ideas ==<br />
NWChem is widely used open-source computational chemistry software ([http://nwchem-sw.org]) that tackles a wide variety of scientific problems.<br />
<br />
=== Project NWChem-JSON ===<br />
<br />
'''Brief explanation:''' Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics. <br />
<br />
'''Expected results:''' Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed. <br />
<br />
'''Prerequisites:''' Experience with Fortran90 and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== Project NWChem-Python-Jupyter Interface ===<br />
<br />
'''Brief explanation:''' Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks<br />
<br />
'''Expected results:''' NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite. <br />
<br />
'''Prerequisites:''' Experience with Fortran and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== JSON-LD for Chemical Data ===<br />
<br />
'''Brief explanation:''' Transforming NWChem and Chemical JSON formats to JSON-LD<br />
<br />
'''Expected results:''' Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.<br />
<br />
'''Prerequisites:''' Experience with Python and JSON-LD<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
== DeepChem Project Ideas ==<br />
[https://deepchem.io DeepChem] aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.<br />
<br />
=== Project: Transfer Learning Framework ===<br />
<br />
'''Brief explanation:''' Create easy to use tools for common transfer learning scenarios.<br />
<br />
'''Expected results:''' [https://arxiv.org/abs/1712.02734 ChemNet] discusses a powerful model independent transfer learning protocol. We would want to reproduce the results, and be able to apply the transfer learning protocol to arbitrary TensorGraph models. Jupyter notebook tutorials and blog posts will be expected over the course of the summer.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Data Interfaces ===<br />
<br />
'''Brief explanation:''' Transition deepchem.data.Dataset to tf.data.<br />
<br />
'''Expected results:''' DeepChem data objects were created before tf.data existed. We need to make our existing Featurizers, Transformers, and Models work over tf.data objects. Jupyter notebook tutorials and blog posts for how to use the new improved interfaces.<br />
<br />
'''Prerequisites:''' Python, some Tensorflow<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
=== Project: Model Visualization ===<br />
<br />
'''Brief explanation:''' Node Importance Visualizations from Graph Models<br />
<br />
'''Expected results:''' An argument often used against deep learning methods is that they are not understandable. This project would be to implement [https://github.com/debbiemarkslab/neural-fingerprint-theano visual neural graph fingerprints] into DeepChem. Stretch goals would be to implement [https://arxiv.org/abs/1605.01713 DeepLift] or masking techniques for atom level visualizations.<br />
<br />
'''Prerequisites:''' Python, Tensorflow, rdkit<br />
<br />
'''Mentor:''' Karl Leswing (karl dot leswing at schrodinger dot com)<br />
<br />
<br />
=== Project: Imaging Tools ===<br />
<br />
'''Brief explanation:''' Enable chemical image segmentation and property prediction.<br />
<br />
'''Expected results:''' We want an implementation of [https://arxiv.org/pdf/1505.04597.pdf U-Net], and [https://arxiv.org/pdf/1512.03385.pdf ResNet] inside of the DeepChem framework. We want both pre-trained networks on problems of chemical importance and the image data augmentation techniques used to create the models.<br />
<br />
'''Prerequisites:''' Python, Tensorflow<br />
<br />
'''Mentor:''' Bharath Ramsundar (bharath at datamined dot io)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
=== Project: Utilizing Virtual Reality in Chemistry Visualization and Modeling ===<br />
<br />
''Brief explanation:''' Develop a VR application or library that can be used to visualize molecular structures, possibly manipulate them. <br />
<br />
'''Expected results:''' A VR application or library that can be integrated in one of the apps above, focused on molecular structure modeling. The target is both scientific applications and an educational component. If time permits, development of an interface that allows users to manipulate the structures and get a realtime response (using fast molecular force-fields to compute responses) would be a stretch outcome.<br />
<br />
'''Prerequisites:''' Python, C++, VR SDK experience would be nice.<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov) <br />
<br />
===Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data===<br />
<br />
'''Brief explanation:''' Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.<br />
<br />
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).<br />
<br />
'''Expected results:''' Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.<br />
<br />
'''Prerequisites:''' General programming experience, and ideally experience in chemistry and matrix manipulations.<br />
<br />
'''Suggested Readings:''' <br />
* http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009<br />
* http://zhanglab.ccmb.med.umich.edu/EDTSurf/<br />
* https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf<br />
<br />
'''Mentor:''' Geoffrey Hutchison (geoffh at pitt dot edu) and Adam Tenderholt (atenderholt at gmail dot com)<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: YAeHMOP as a library ===<br />
<br />
'''Brief explanation:''' YAeHMOP (https://github.com/greglandrum/yaehmop and http://yaehmop.sourceforge.net/) is an open-source package of tools for doing extended Hueckel calculations on molecules and crystals. The software was developed as a series of command line tools that expect to read and write from files. After some years of obscurity, YAeHMOP has recently attracted attention as a plugin for Avogadro. The goal of this project is to modernize aspects of YAeHMOP and make the core computational functionality accessible as a library.<br />
<br />
'''Expected results:''' A library that is callable from C/C++ allowing the construction of an input molecule/crystal, specification of computational parameters, execution of a calculation, and capture of results. Ideally this will include a modernization of the pieces of the code that still rely on f2c-translated Fortran77 code: the eigenvalue solver can likely be replaced with eigen without too much effort; the functionality to calculate STO integrals will need to be re-written based on the original f77. The library should also include a robust set of regression tests. Stretch goals would be adding Python wrappers to the library and/or creating an RDKit plugin for it.<br />
<br />
'''Prerequisites:''' C/C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2018&diff=585GSoC Ideas 20182018-01-15T07:10:46Z<p>Greg.landrum: update RDKit section</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve [http://www.openchemistry.org/avogadro2 Avogadro 2], [http://cclib.sourceforge.net cclib], [http://3dmol.csb.pitt.edu/ 3DMol], and [http://openbabel.org/ Open Babel]. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Integrate with VTK: Volume Rendering and Charts ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data<br />
<br />
'''Expected results:''' The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2<br />
<br />
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''' Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python or JavaScript, some experience with SWIG, Boost.Python, or similar packages (pybind11, SIP, PySide, etc.) suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs and contained in output files.<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() function of each parser are quite long and should be split into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with the Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for new parsers.<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers. For example, allow reading and writing from Molden and AIMALL wfn/wfx files.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
=== Project: Discovering computational chemistry content online ===<br />
<br />
'''Brief explanation:''' There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!<br />
<br />
'''Expected results:''' Build a crawler that identifies and indexes computational chemistry content online, and provides the ability to extract data with cclib.<br />
<br />
'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com).<br />
<br />
=== Project: Machine learning applied to parsing computational chemistry output ===<br />
<br />
'''Bried explanation:''' Can we teach a machine to parse computational chemistry logfiles at least as well as cclib already does? What machine learning approach here would be most appropriate? Is it useful to include prior (chemical) knowldedge or soft constraints to guide parser learning?<br />
<br />
'''Expected results:''' Identify and implement a machine learning pipeline that attempts to reproduce or complement cclib's various parsers.<br />
<br />
'''Prerequisites:''' Experience with Python, machine learning, and ideally familiarity with computational chemistry.<br />
<br />
'''Mentor:''' Karol Langner (karol.langner at gmail dot com).<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Fast, Efficient Fragment-Based Coordinate Generation ===<br />
<br />
'''Brief explanation:''' A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.<br />
<br />
'''Expected results:''' Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail. <br />
<br />
Importantly, the approach is '''highly''' efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling. <br />
<br />
A recent research paper describing one such approach can be found [http://link.springer.com/article/10.1186/s13321-015-0095-1 here]<br />
<br />
'''Prerequisites:''' Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
==3Dmol.js Project Ideas==<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://http.developer.nvidia.com/GPUGems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Google Cardboard for 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Implement low cost virtual reality visualization using Google Cardboard<br />
<br />
'''Expected results:''' [Google Cardboard https://en.wikipedia.org/wiki/Google_Cardboard] is a VR experience using commodity smartphones and either a paperboard/cardboard mount or an inexpensive pre-made mount. The project would produce an implementation using the Cardboard SDK for 3Dmol.js, allowing both individual VR use and synchronized classroom use (e.g., one "guide" and multiple synchronized viewers).<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Create a generalized fingerprinting function ===<br />
<br />
'''Brief explanation:''' The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008<br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MMTF Integration ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in the RDKit. See the similar OpenBabel project for more details.<br />
<br />
'''Expected results:''' A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - OpenMM Integration ===<br />
<br />
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.<br />
<br />
'''Expected results:''' C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. A stretch goal could be to complete the MMFF94 implementation described by Paolo Tosco at the 2016 RDKit UGM (https://github.com/rdkit/UGM_2016/blob/master/Presentations/PaoloTosco_OpenMM_RDKit_integration.pdf).<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' t.b.d.<br />
<br />
== MSDK Project Ideas ==<br />
<br />
Mass spectrometry is an analytical technique that measures the mass of small molecules with high precision. The data coming from mass spectrometry instruments is complex and multi-dimensional. Mass spectrometry development kit (MSDK [http://msdk.github.io]) is a Java library of algorithms for processing such mass spectrometry data. The goals of the library are to provide a flexible data model with Java interfaces for mass-spectrometry related objects (including raw spectra, processed data sets, identification results etc.) and to integrate the existing algorithms that are currently scattered around various Java-based graphical tools.<br />
<br />
== NWChem Project Ideas ==<br />
NWChem is widely used open-source computational chemistry software ([http://nwchem-sw.org]) that tackles a wide variety of scientific problems.<br />
<br />
=== Project NWChem-JSON ===<br />
<br />
'''Brief explanation:''' Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics. <br />
<br />
'''Expected results:''' Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed. <br />
<br />
'''Prerequisites:''' Experience with Fortran90 and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== Project NWChem-Python-Jupyter Interface ===<br />
<br />
'''Brief explanation:''' Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks<br />
<br />
'''Expected results:''' NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite. <br />
<br />
'''Prerequisites:''' Experience with Fortran and Python<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
=== JSON-LD for Chemical Data ===<br />
<br />
'''Brief explanation:''' Transforming NWChem and Chemical JSON formats to JSON-LD<br />
<br />
'''Expected results:''' Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.<br />
<br />
'''Prerequisites:''' Experience with Python and JSON-LD<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
=== Project: Utilizing Virtual Reality in Chemistry Visualization and Modeling ===<br />
<br />
''Brief explanation:''' Develop a VR application or library that can be used to visualize molecular structures, possibly manipulate them. <br />
<br />
'''Expected results:''' A VR application or library that can be integrated in one of the apps above, focused on molecular structure modeling. The target is both scientific applications and an educational component. If time permits, development of an interface that allows users to manipulate the structures and get a realtime response (using fast molecular force-fields to compute responses) would be a stretch outcome.<br />
<br />
'''Prerequisites:''' Python, C++, VR SDK experience would be nice.<br />
<br />
'''Mentor:''' Bert de Jong (wadejong at lbl dot gov) <br />
<br />
===Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data===<br />
<br />
'''Brief explanation:''' Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.<br />
<br />
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).<br />
<br />
'''Expected results:''' Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.<br />
<br />
'''Prerequisites:''' General programming experience, and ideally experience in chemistry and matrix manipulations.<br />
<br />
'''Suggested Readings:''' <br />
* http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009<br />
* http://zhanglab.ccmb.med.umich.edu/EDTSurf/<br />
* https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf<br />
<br />
'''Mentor:''' Geoffrey Hutchison (geoffh at pitt dot edu) and Adam Tenderholt (atenderholt at gmail dot com)<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: YAeHMOP as a library ===<br />
<br />
'''Brief explanation:''' YAeHMOP (https://github.com/greglandrum/yaehmop and http://yaehmop.sourceforge.net/) is an open-source package of tools for doing extended Hueckel calculations on molecules and crystals. The software was developed as a series of command line tools that expect to read and write from files. After some years of obscurity, YAeHMOP has recently attracted attention as a plugin for Avogadro. The goal of this project is to modernize aspects of YAeHMOP and make the core computational functionality accessible as a library.<br />
<br />
'''Expected results:''' A library that is callable from C/C++ allowing the construction of an input molecule/crystal, specification of computational parameters, execution of a calculation, and capture of results. Ideally this will include a modernization of the pieces of the code that still rely on f2c-translated Fortran77 code: the eigenvalue solver can likely be replaced with eigen without too much effort; the functionality to calculate STO integrals will need to be re-written based on the original f77. The library should also include a robust set of regression tests. Stretch goals would be adding Python wrappers to the library and/or creating an RDKit plugin for it.<br />
<br />
'''Prerequisites:''' C/C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2017&diff=562GSoC Ideas 20172017-02-06T11:58:28Z<p>Greg.landrum: /* RDKit Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve [http://www.openchemistry.org/avogadro2 Avogadro 2], [http://cclib.sourceforge.net cclib], [http://3dmol.csb.pitt.edu/ 3DMol], and [http://openbabel.org/ Open Babel]. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Integrate with VTK: Volume Rendering and Charts ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data<br />
<br />
'''Expected results:''' The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2<br />
<br />
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''' Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python or JavaScript, some experience with SWIG, Boost.Python, or similar packages (pybind11, SIP, PySide, etc.) suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs and contained in output files.<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() function of each parser are quite long and should be split into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with the Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for new parsers.<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers. For example, allow reading and writing from Molden and AIMALL wfn/wfx files.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Fast, Efficient Fragment-Based Coordinate Generation ===<br />
<br />
'''Brief explanation:''' A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.<br />
<br />
'''Expected results:''' Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail. <br />
<br />
Importantly, the approach is '''highly''' efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling. <br />
<br />
A recent research paper describing one such approach can be found [http://link.springer.com/article/10.1186/s13321-015-0095-1 here]<br />
<br />
'''Prerequisites:''' Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
==3Dmol.js Project Ideas==<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://http.developer.nvidia.com/GPUGems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Google Cardboard for 3Dmol.js ===<br />
<br />
'''Brief explanation:''' Implement low cost virtual reality visualization using Google Cardboard<br />
<br />
'''Expected results:''' [Google Cardboard https://en.wikipedia.org/wiki/Google_Cardboard] is a VR experience using commodity smartphones and either a paperboard/cardboard mount or an inexpensive pre-made mount. The project would produce an implementation using the Cardboard SDK for 3Dmol.js, allowing both individual VR use and synchronized classroom use (e.g., one "guide" and multiple synchronized viewers).<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu)<br />
<br />
=== Project: Molecular Dynamics Visualization ===<br />
<br />
'''Brief explanation:''' Implement high-performance in-browser visualization of molecular dynamics simulations.<br />
<br />
'''Expected results:''' Initial support is already present, with support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front. Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with extremely large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL and an MD code ideally, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu).<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Create a generalized fingerprinting function ===<br />
<br />
'''Brief explanation:''' The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008<br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MMTF Integration ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in the RDKit. See the similar OpenBabel project for more details.<br />
<br />
'''Expected results:''' A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - OpenMM Integration ===<br />
<br />
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.<br />
<br />
'''Expected results:''' C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. A stretch goal could be to complete the MMFF94 implementation described by Paolo Tosco at the 2016 RDKit UGM (https://github.com/rdkit/UGM_2016/blob/master/Presentations/PaoloTosco_OpenMM_RDKit_integration.pdf).<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' John Chodera (john.chodera at choderalab dot org)<br />
<br />
=== Project: RDKit - 3Dmol.js Integration ===<br />
<br />
'''Brief explanation:''' 3Dmol.js (http://3dmol.csb.pitt.edu/) is a JavaScript library for visualizing molecular data. The goal of this project is to enable ligand modifications of a protein-ligand complex (http://www.nature.com/nprot/journal/v11/n5/fig_tab/nprot.2016.051_F1.html)<br />
<br />
'''Expected results:''' Python functionality allowing RDKit molecules to be sent to the force fields available in the RDKit to perform ligand modification and energy minimisation inside the binding pocket. Integration of this with a Jupyter-notebook UI based on 3Dmol.js<br />
<br />
'''Prerequisites:''' Python and Javascript<br />
<br />
'''Mentor:''' Paul Czodrowski (paul.czodrowski at merckgroup dot com)<br />
<br />
== MSDK Project Ideas ==<br />
<br />
Mass spectrometry is an analytical technique that measures the mass of small molecules with high precision. The data coming from mass spectrometry instruments is complex and multi-dimensional. Mass spectrometry development kit (MSDK [http://msdk.github.io]) is a Java library of algorithms for processing such mass spectrometry data. The goals of the library are to provide a flexible data model with Java interfaces for mass-spectrometry related objects (including raw spectra, processed data sets, identification results etc.) and to integrate the existing algorithms that are currently scattered around various Java-based graphical tools.<br />
<br />
=== Project MSDK-IO ===<br />
'''Brief explanation:''' Implementation of fast readers for open XML-based mass spectrometry file formats (mzML, mzXML) and vendor-specific file formats via vendor libraries.<br />
<br />
'''Expected results:''' Unified API for reading mass spectrometry data. Hand-written native java parsers for mzML and mzXML formats with support for linear and random reads. Support for writing mzML files and caching (performance optimization). Vendor-specific file support via either JNI or custom C programs for Java / native DLLs interop.<br />
<br />
'''Prerequisites:''' Java, C/C++ (for vendor libraries)<br />
<br />
'''Mentor:''' Tomas Pluskal (plusik at gmail dot com) and Dmitry Avtonomov (dmitriy dot avtonomov at gmail dot com)<br />
<br />
=== Project LC/MS Feature Detection ===<br />
'''Brief explanation:''' Develop an intelligent LC/MS (liquid chromatography mass spectrometry) feature detection algorithm that requires zero parameterization from the user, all the parameters that the algorithm might require should be estimated from the data itself.<br />
<br />
'''Expected results:''' A Java implementation (in the context of MSDK API) of a feature detection algorithm that requires no parameters. Visual validation of detection results (BatMass, TOPP View, MZmine etc.). Validation of the algorithm on data from various instruments and settings (TOFs, Orbitraps, Ion Traps, low resolution, high resolution).<br />
<br />
'''Prerequisites:''' Java, exposure to signal processing is a plus<br />
<br />
'''Mentor:''' Tomas Pluskal (plusik at gmail dot com) and Dmitry Avtonomov (dmitriy dot avtonomov at gmail dot com)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
=== Project: Computational Chemistry Web Repository ===<br />
<br />
'''Brief explanation:''' Implement an easy-to-deploy web server repository for computational chemistry<br />
<br />
'''Expected results:''' Many chemistry researchers implement open data repositories to serve up their research results (e.g., https://pqr.pitt.edu/ or https://openchemistry.kitware.com) but each implementation is a little different. Create a web repository solution that uses cclib to parse a wide range of computational chemistry files and easily serves up an easy-to-use web interface and web API (REST). The tool should be able to scan a set of directories and continually add new input and output files to the repository.<br />
<br />
'''Prerequisites:''' Python and JavaScript and experience in server software, e.g. Python and Flask.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
===Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data===<br />
<br />
'''Brief explanation:''' Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.<br />
<br />
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).<br />
<br />
'''Expected results:''' Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.<br />
<br />
'''Prerequisites:''' General programming experience, and ideally experience in chemistry and matrix manipulations.<br />
<br />
'''Suggested Readings:''' <br />
* http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009<br />
* http://zhanglab.ccmb.med.umich.edu/EDTSurf/<br />
* https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf<br />
<br />
'''Mentor:''' Geoffrey Hutchison (geoffh at pitt dot edu)<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: YAeHMOP as a library ===<br />
<br />
'''Brief explanation:''' YAeHMOP (https://github.com/greglandrum/yaehmop and http://yaehmop.sourceforge.net/) is an open-source package of tools for doing extended Hueckel calculations on molecules and crystals. The software was developed as a series of command line tools that expect to read and write from files. After some years of obscurity, YAeHMOP has recently attracted attention as a plugin for Avogadro. The goal of this project is to modernize aspects of YAeHMOP and make the core computational functionality accessible as a library.<br />
<br />
'''Expected results:''' A library that is callable from C/C++ allowing the construction of an input molecule/crystal, specification of computational parameters, execution of a calculation, and capture of results. Ideally this will include a modernization of the pieces of the code that still rely on f2c-translated Fortran77 code: the eigenvalue solver can likely be replaced with eigen without too much effort; the functionality to calculate STO integrals will need to be re-written based on the original f77. The library should also include a robust set of regression tests. Stretch goals would be adding Python wrappers to the library and/or creating an RDKit plugin for it.<br />
<br />
'''Prerequisites:''' C/C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2017&diff=552GSoC Ideas 20172017-02-03T13:11:41Z<p>Greg.landrum: /* RDKit Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve [http://www.openchemistry.org/avogadro2 Avogadro 2], [http://cclib.sourceforge.net cclib], [http://3dmol.csb.pitt.edu/ 3DMol], and [http://openbabel.org/ Open Babel]. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Integrate with VTK: Volume Rendering and Charts ===<br />
<br />
<img src="http://i.imgur.com/119Chwlm.jpg"/><br />
<br />
'''Brief explanation:''' Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data<br />
<br />
'''Expected results:''' The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2<br />
<br />
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''' Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python or JavaScript, some experience with SWIG, Boost.Python, or similar packages (pybind11, SIP, PySide, etc.) suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs and contained in output files.<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() function of each parser are quite long and should be split into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with the Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for new parsers.<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers. For example, allow reading and writing from Molden and AIMALL wfn/wfx files.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Fast, Efficient Fragment-Based Coordinate Generation ===<br />
<br />
'''Brief explanation:''' A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.<br />
<br />
'''Expected results:''' Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail. <br />
<br />
Importantly, the approach is '''highly''' efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling. <br />
<br />
A recent research paper describing one such approach can be found [http://link.springer.com/article/10.1186/s13321-015-0095-1 here]<br />
<br />
'''Prerequisites:''' Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
==3Dmol.js Project Ideas==<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://http.developer.nvidia.com/GPUGems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Molecular Dynamics Visualization ===<br />
<br />
'''Brief explanation:''' Implement high-performance in-browser visualization of molecular dynamics simulations.<br />
<br />
'''Expected results:''' Initial support is already present, with support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front. Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with extremely large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL and an MD code ideally, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu).<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Create a generalized fingerprinting function ===<br />
<br />
'''Brief explanation:''' The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008<br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MMTF Integration ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in the RDKit. See the similar OpenBabel project for more details.<br />
<br />
'''Expected results:''' A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - OpenMM Integration ===<br />
<br />
'''Brief explanation:''' OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.<br />
<br />
'''Expected results:''' C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. A stretch goal could be to complete the MMFF94 implementation described by Paolo Tosco at the 2016 RDKit UGM (https://github.com/rdkit/UGM_2016/blob/master/Presentations/PaoloTosco_OpenMM_RDKit_integration.pdf).<br />
<br />
'''Prerequisites:''' C++ and some Python<br />
<br />
'''Mentor:''' John Chodera (john.chodera at choderalab dot org)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
=== Project: Computational Chemistry Web Repository ===<br />
<br />
'''Brief explanation:''' Implement an easy-to-deploy web server repository for computational chemistry<br />
<br />
'''Expected results:''' Many chemistry researchers implement open data repositories to serve up their research results (e.g., https://pqr.pitt.edu/ or https://openchemistry.kitware.com) but each implementation is a little different. Create a web repository solution that uses cclib to parse a wide range of computational chemistry files and easily serves up an easy-to-use web interface and web API (REST). The tool should be able to scan a set of directories and continually add new input and output files to the repository.<br />
<br />
'''Prerequisites:''' Python and JavaScript and experience in server software, e.g. Python and Flask.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: GPU and Multi-Core Enabled High Performance Force Field Calculations ===<br />
<br />
'''Brief explanation:''' Add integrated molecular mechanics force field simulations in Avogadro 2<br />
<br />
'''Expected results:''' Currently, Avogadro 2 relies on command-line calls to Open Babel to optimize geometries or perform conformer searching. The Open Babel code supports multiple force fields, but has poor performance. A modern implementation of a force field library would be welcome, including OpenMP and/or OpenCL support for highly parallel calculations. The architecture should support constrained geometry optimizations and multiple optimization techniques (i.e., steepest descent, conjugate gradients, quasi-Newton like L-BFGS) and be modular enough to allow new force field implementations as plugins.<br />
<br />
Ideally the code would be implemented in a new library so it can be used by Avogadro, Open Babel, and other codes<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenMP or OpenCL ideally.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu).<br />
<br />
===Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data===<br />
<br />
'''Brief explanation:''' Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.<br />
<br />
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).<br />
<br />
'''Expected results:''' Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.<br />
<br />
'''Prerequisites:''' General programming experience, and ideally experience in chemistry and matrix manipulations.<br />
<br />
'''Suggested Readings:''' <br />
* http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009<br />
* http://zhanglab.ccmb.med.umich.edu/EDTSurf/<br />
* https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf<br />
<br />
'''Mentor:''' Geoffrey Hutchison (geoffh at pitt dot edu)<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: YAeHMOP as a library ===<br />
<br />
'''Brief explanation:''' YAeHMOP (https://github.com/greglandrum/yaehmop and http://yaehmop.sourceforge.net/) is an open-source package of tools for doing extended Hueckel calculations on molecules and crystals. The software was developed as a series of command line tools that expect to read and write from files. After some years of obscurity, YAeHMOP has recently attracted attention as a plugin for Avogadro. The goal of this project is to modernize aspects of YAeHMOP and make the core computational functionality accessible as a library.<br />
<br />
'''Expected results:''' A library that is callable from C/C++ allowing the construction of an input molecule/crystal, specification of computational parameters, execution of a calculation, and capture of results. Ideally this will include a modernization of the pieces of the code that still rely on f2c-translated Fortran77 code: the eigenvalue solver can likely be replaced with eigen without too much effort; the functionality to calculate STO integrals will need to be re-written based on the original f77. The library should also include a robust set of regression tests. Stretch goals would be adding Python wrappers to the library and/or creating an RDKit plugin for it.<br />
<br />
'''Prerequisites:''' C/C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2017&diff=551GSoC Ideas 20172017-02-03T13:01:27Z<p>Greg.landrum: /* Miscellaneous Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve [http://www.openchemistry.org/avogadro2 Avogadro 2], [http://cclib.sourceforge.net cclib], [http://3dmol.csb.pitt.edu/ 3DMol], and [http://openbabel.org/ Open Babel]. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Integrate with VTK: Volume Rendering and Charts ===<br />
<br />
<img src="http://i.imgur.com/119Chwlm.jpg"/><br />
<br />
'''Brief explanation:''' Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data<br />
<br />
'''Expected results:''' The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2<br />
<br />
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''' Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python or JavaScript, some experience with SWIG, Boost.Python, or similar packages (pybind11, SIP, PySide, etc.) suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs and contained in output files.<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() function of each parser are quite long and should be split into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with the Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for new parsers.<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers. For example, allow reading and writing from Molden and AIMALL wfn/wfx files.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Fast, Efficient Fragment-Based Coordinate Generation ===<br />
<br />
'''Brief explanation:''' A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.<br />
<br />
'''Expected results:''' Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail. <br />
<br />
Importantly, the approach is '''highly''' efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling. <br />
<br />
A recent research paper describing one such approach can be found [http://link.springer.com/article/10.1186/s13321-015-0095-1 here]<br />
<br />
'''Prerequisites:''' Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
==3Dmol.js Project Ideas==<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://http.developer.nvidia.com/GPUGems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Molecular Dynamics Visualization ===<br />
<br />
'''Brief explanation:''' Implement high-performance in-browser visualization of molecular dynamics simulations.<br />
<br />
'''Expected results:''' Initial support is already present, with support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front. Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with extremely large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL and an MD code ideally, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu).<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Create a generalized fingerprinting function ===<br />
<br />
'''Brief explanation:''' The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008<br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MMTF Integration ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in the RDKit. See the similar OpenBabel project for more details.<br />
<br />
'''Expected results:''' A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - OpenMM Integration ===<br />
<br />
'''Brief explanation:''' <br />
<br />
'''Expected results:'''<br />
<br />
'''Prerequisites:''' <br />
<br />
'''Mentor:''' John Chodera (john.chodera at choderalab dot org)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
=== Project: Computational Chemistry Web Repository ===<br />
<br />
'''Brief explanation:''' Implement an easy-to-deploy web server repository for computational chemistry<br />
<br />
'''Expected results:''' Many chemistry researchers implement open data repositories to serve up their research results (e.g., https://pqr.pitt.edu/ or https://openchemistry.kitware.com) but each implementation is a little different. Create a web repository solution that uses cclib to parse a wide range of computational chemistry files and easily serves up an easy-to-use web interface and web API (REST). The tool should be able to scan a set of directories and continually add new input and output files to the repository.<br />
<br />
'''Prerequisites:''' Python and JavaScript and experience in server software, e.g. Python and Flask.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: GPU and Multi-Core Enabled High Performance Force Field Calculations ===<br />
<br />
'''Brief explanation:''' Add integrated molecular mechanics force field simulations in Avogadro 2<br />
<br />
'''Expected results:''' Currently, Avogadro 2 relies on command-line calls to Open Babel to optimize geometries or perform conformer searching. The Open Babel code supports multiple force fields, but has poor performance. A modern implementation of a force field library would be welcome, including OpenMP and/or OpenCL support for highly parallel calculations. The architecture should support constrained geometry optimizations and multiple optimization techniques (i.e., steepest descent, conjugate gradients, quasi-Newton like L-BFGS) and be modular enough to allow new force field implementations as plugins.<br />
<br />
Ideally the code would be implemented in a new library so it can be used by Avogadro, Open Babel, and other codes<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenMP or OpenCL ideally.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu).<br />
<br />
===Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data===<br />
<br />
'''Brief explanation:''' Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.<br />
<br />
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).<br />
<br />
'''Expected results:''' Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.<br />
<br />
'''Prerequisites:''' General programming experience, and ideally experience in chemistry and matrix manipulations.<br />
<br />
'''Suggested Readings:''' <br />
* http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009<br />
* http://zhanglab.ccmb.med.umich.edu/EDTSurf/<br />
* https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf<br />
<br />
'''Mentor:''' Geoffrey Hutchison (geoffh at pitt dot edu)<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: YAeHMOP as a library ===<br />
<br />
'''Brief explanation:''' YAeHMOP (https://github.com/greglandrum/yaehmop and http://yaehmop.sourceforge.net/) is an open-source package of tools for doing extended Hueckel calculations on molecules and crystals. The software was developed as a series of command line tools that expect to read and write from files. After some years of obscurity, YAeHMOP has recently attracted attention as a plugin for Avogadro. The goal of this project is to modernize aspects of YAeHMOP and make the core computational functionality accessible as a library.<br />
<br />
'''Expected results:''' A library that is callable from C/C++ allowing the construction of an input molecule/crystal, specification of computational parameters, execution of a calculation, and capture of results. Ideally this will include a modernization of the pieces of the code that still rely on f2c-translated Fortran77 code: the eigenvalue solver can likely be replaced with eigen without too much effort; the functionality to calculate STO integrals will need to be re-written based on the original f77. The library should also include a robust set of regression tests. Stretch goals would be adding Python wrappers to the library and/or creating an RDKit plugin for it.<br />
<br />
'''Prerequisites:''' C/C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2017&diff=550GSoC Ideas 20172017-02-03T12:49:29Z<p>Greg.landrum: /* RDKit Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve [http://www.openchemistry.org/avogadro2 Avogadro 2], [http://cclib.sourceforge.net cclib], [http://3dmol.csb.pitt.edu/ 3DMol], and [http://openbabel.org/ Open Babel]. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Integrate with VTK: Volume Rendering and Charts ===<br />
<br />
<img src="http://i.imgur.com/119Chwlm.jpg"/><br />
<br />
'''Brief explanation:''' Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data<br />
<br />
'''Expected results:''' The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2<br />
<br />
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''' Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python or JavaScript, some experience with SWIG, Boost.Python, or similar packages (pybind11, SIP, PySide, etc.) suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs and contained in output files.<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() function of each parser are quite long and should be split into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with the Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for new parsers.<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers. For example, allow reading and writing from Molden and AIMALL wfn/wfx files.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Fast, Efficient Fragment-Based Coordinate Generation ===<br />
<br />
'''Brief explanation:''' A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.<br />
<br />
'''Expected results:''' Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail. <br />
<br />
Importantly, the approach is '''highly''' efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling. <br />
<br />
A recent research paper describing one such approach can be found [http://link.springer.com/article/10.1186/s13321-015-0095-1 here]<br />
<br />
'''Prerequisites:''' Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
==3Dmol.js Project Ideas==<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://http.developer.nvidia.com/GPUGems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Molecular Dynamics Visualization ===<br />
<br />
'''Brief explanation:''' Implement high-performance in-browser visualization of molecular dynamics simulations.<br />
<br />
'''Expected results:''' Initial support is already present, with support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front. Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with extremely large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL and an MD code ideally, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu).<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Create a generalized fingerprinting function ===<br />
<br />
'''Brief explanation:''' The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008<br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - MMTF Integration ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in the RDKit. See the similar OpenBabel project for more details.<br />
<br />
'''Expected results:''' A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - OpenMM Integration ===<br />
<br />
'''Brief explanation:''' <br />
<br />
'''Expected results:'''<br />
<br />
'''Prerequisites:''' <br />
<br />
'''Mentor:''' John Chodera (john.chodera at choderalab dot org)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
=== Project: Computational Chemistry Web Repository ===<br />
<br />
'''Brief explanation:''' Implement an easy-to-deploy web server repository for computational chemistry<br />
<br />
'''Expected results:''' Many chemistry researchers implement open data repositories to serve up their research results (e.g., https://pqr.pitt.edu/ or https://openchemistry.kitware.com) but each implementation is a little different. Create a web repository solution that uses cclib to parse a wide range of computational chemistry files and easily serves up an easy-to-use web interface and web API (REST). The tool should be able to scan a set of directories and continually add new input and output files to the repository.<br />
<br />
'''Prerequisites:''' Python and JavaScript and experience in server software, e.g. Python and Flask.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: GPU and Multi-Core Enabled High Performance Force Field Calculations ===<br />
<br />
'''Brief explanation:''' Add integrated molecular mechanics force field simulations in Avogadro 2<br />
<br />
'''Expected results:''' Currently, Avogadro 2 relies on command-line calls to Open Babel to optimize geometries or perform conformer searching. The Open Babel code supports multiple force fields, but has poor performance. A modern implementation of a force field library would be welcome, including OpenMP and/or OpenCL support for highly parallel calculations. The architecture should support constrained geometry optimizations and multiple optimization techniques (i.e., steepest descent, conjugate gradients, quasi-Newton like L-BFGS) and be modular enough to allow new force field implementations as plugins.<br />
<br />
Ideally the code would be implemented in a new library so it can be used by Avogadro, Open Babel, and other codes<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenMP or OpenCL ideally.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu).<br />
<br />
===Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data===<br />
<br />
'''Brief explanation:''' Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.<br />
<br />
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).<br />
<br />
'''Expected results:''' Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.<br />
<br />
'''Prerequisites:''' General programming experience, and ideally experience in chemistry and matrix manipulations.<br />
<br />
'''Suggested Readings:''' <br />
* http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009<br />
* http://zhanglab.ccmb.med.umich.edu/EDTSurf/<br />
* https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf<br />
<br />
'''Mentor:''' Geoffrey Hutchison (geoffh at pitt dot edu)<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2017&diff=549GSoC Ideas 20172017-02-03T12:48:56Z<p>Greg.landrum: /* RDKit Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve [http://www.openchemistry.org/avogadro2 Avogadro 2], [http://cclib.sourceforge.net cclib], [http://3dmol.csb.pitt.edu/ 3DMol], and [http://openbabel.org/ Open Babel]. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Integrate with VTK: Volume Rendering and Charts ===<br />
<br />
<img src="http://i.imgur.com/119Chwlm.jpg"/><br />
<br />
'''Brief explanation:''' Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data<br />
<br />
'''Expected results:''' The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2<br />
<br />
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''' Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python or JavaScript, some experience with SWIG, Boost.Python, or similar packages (pybind11, SIP, PySide, etc.) suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs and contained in output files.<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() function of each parser are quite long and should be split into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with the Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for new parsers.<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers. For example, allow reading and writing from Molden and AIMALL wfn/wfx files.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Fast, Efficient Fragment-Based Coordinate Generation ===<br />
<br />
'''Brief explanation:''' A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.<br />
<br />
'''Expected results:''' Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail. <br />
<br />
Importantly, the approach is '''highly''' efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling. <br />
<br />
A recent research paper describing one such approach can be found [http://link.springer.com/article/10.1186/s13321-015-0095-1 here]<br />
<br />
'''Prerequisites:''' Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
==3Dmol.js Project Ideas==<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://http.developer.nvidia.com/GPUGems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Molecular Dynamics Visualization ===<br />
<br />
'''Brief explanation:''' Implement high-performance in-browser visualization of molecular dynamics simulations.<br />
<br />
'''Expected results:''' Initial support is already present, with support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front. Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with extremely large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL and an MD code ideally, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu).<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Create a generalized fingerprinting function ===<br />
<br />
'''Brief explanation:''' The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008<br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: MMTF integration into the RDKit ===<br />
<br />
'''Brief explanation:''' <br />
<br />
'''Expected results:'''<br />
<br />
'''Prerequisites:''' <br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
<br />
=== Project: RDKit - MMTF Integration ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in the RDKit. See the similar OpenBabel project for more details.<br />
<br />
'''Expected results:''' A C++ implementation of an MMTF reader and writer for the RDKit along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - OpenMM Integration ===<br />
<br />
'''Brief explanation:''' <br />
<br />
'''Expected results:'''<br />
<br />
'''Prerequisites:''' <br />
<br />
'''Mentor:''' John Chodera (john.chodera at choderalab dot org)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
=== Project: Computational Chemistry Web Repository ===<br />
<br />
'''Brief explanation:''' Implement an easy-to-deploy web server repository for computational chemistry<br />
<br />
'''Expected results:''' Many chemistry researchers implement open data repositories to serve up their research results (e.g., https://pqr.pitt.edu/ or https://openchemistry.kitware.com) but each implementation is a little different. Create a web repository solution that uses cclib to parse a wide range of computational chemistry files and easily serves up an easy-to-use web interface and web API (REST). The tool should be able to scan a set of directories and continually add new input and output files to the repository.<br />
<br />
'''Prerequisites:''' Python and JavaScript and experience in server software, e.g. Python and Flask.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: GPU and Multi-Core Enabled High Performance Force Field Calculations ===<br />
<br />
'''Brief explanation:''' Add integrated molecular mechanics force field simulations in Avogadro 2<br />
<br />
'''Expected results:''' Currently, Avogadro 2 relies on command-line calls to Open Babel to optimize geometries or perform conformer searching. The Open Babel code supports multiple force fields, but has poor performance. A modern implementation of a force field library would be welcome, including OpenMP and/or OpenCL support for highly parallel calculations. The architecture should support constrained geometry optimizations and multiple optimization techniques (i.e., steepest descent, conjugate gradients, quasi-Newton like L-BFGS) and be modular enough to allow new force field implementations as plugins.<br />
<br />
Ideally the code would be implemented in a new library so it can be used by Avogadro, Open Babel, and other codes<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenMP or OpenCL ideally.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu).<br />
<br />
===Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data===<br />
<br />
'''Brief explanation:''' Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.<br />
<br />
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).<br />
<br />
'''Expected results:''' Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.<br />
<br />
'''Prerequisites:''' General programming experience, and ideally experience in chemistry and matrix manipulations.<br />
<br />
'''Suggested Readings:''' <br />
* http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009<br />
* http://zhanglab.ccmb.med.umich.edu/EDTSurf/<br />
* https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf<br />
<br />
'''Mentor:''' Geoffrey Hutchison (geoffh at pitt dot edu)<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2017&diff=548GSoC Ideas 20172017-02-03T11:58:34Z<p>Greg.landrum: /* RDKit Project Ideas */</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve [http://www.openchemistry.org/avogadro2 Avogadro 2], [http://cclib.sourceforge.net cclib], [http://3dmol.csb.pitt.edu/ 3DMol], and [http://openbabel.org/ Open Babel]. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Integrate with VTK: Volume Rendering and Charts ===<br />
<br />
<img src="http://i.imgur.com/119Chwlm.jpg"/><br />
<br />
'''Brief explanation:''' Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data<br />
<br />
'''Expected results:''' The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2<br />
<br />
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''' Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python or JavaScript, some experience with SWIG, Boost.Python, or similar packages (pybind11, SIP, PySide, etc.) suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs and contained in output files.<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() function of each parser are quite long and should be split into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with the Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for new parsers.<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers. For example, allow reading and writing from Molden and AIMALL wfn/wfx files.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Fast, Efficient Fragment-Based Coordinate Generation ===<br />
<br />
'''Brief explanation:''' A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.<br />
<br />
'''Expected results:''' Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail. <br />
<br />
Importantly, the approach is '''highly''' efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling. <br />
<br />
A recent research paper describing one such approach can be found [http://link.springer.com/article/10.1186/s13321-015-0095-1 here]<br />
<br />
'''Prerequisites:''' Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
==3Dmol.js Project Ideas==<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://http.developer.nvidia.com/GPUGems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Molecular Dynamics Visualization ===<br />
<br />
'''Brief explanation:''' Implement high-performance in-browser visualization of molecular dynamics simulations.<br />
<br />
'''Expected results:''' Initial support is already present, with support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front. Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with extremely large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL and an MD code ideally, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu).<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Create a generalized fingerprinting function ===<br />
<br />
'''Brief explanation:''' The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008<br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: MMTF integration into the RDKit ===<br />
<br />
'''Brief explanation:''' <br />
<br />
'''Expected results:'''<br />
<br />
'''Prerequisites:''' <br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
=== Project: RDKit - OpenMM Integration ===<br />
<br />
'''Brief explanation:''' <br />
<br />
'''Expected results:'''<br />
<br />
'''Prerequisites:''' <br />
<br />
'''Mentor:''' John Chodera (john.chodera at choderalab dot org)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
=== Project: Computational Chemistry Web Repository ===<br />
<br />
'''Brief explanation:''' Implement an easy-to-deploy web server repository for computational chemistry<br />
<br />
'''Expected results:''' Many chemistry researchers implement open data repositories to serve up their research results (e.g., https://pqr.pitt.edu/ or https://openchemistry.kitware.com) but each implementation is a little different. Create a web repository solution that uses cclib to parse a wide range of computational chemistry files and easily serves up an easy-to-use web interface and web API (REST). The tool should be able to scan a set of directories and continually add new input and output files to the repository.<br />
<br />
'''Prerequisites:''' Python and JavaScript and experience in server software, e.g. Python and Flask.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: GPU and Multi-Core Enabled High Performance Force Field Calculations ===<br />
<br />
'''Brief explanation:''' Add integrated molecular mechanics force field simulations in Avogadro 2<br />
<br />
'''Expected results:''' Currently, Avogadro 2 relies on command-line calls to Open Babel to optimize geometries or perform conformer searching. The Open Babel code supports multiple force fields, but has poor performance. A modern implementation of a force field library would be welcome, including OpenMP and/or OpenCL support for highly parallel calculations. The architecture should support constrained geometry optimizations and multiple optimization techniques (i.e., steepest descent, conjugate gradients, quasi-Newton like L-BFGS) and be modular enough to allow new force field implementations as plugins.<br />
<br />
Ideally the code would be implemented in a new library so it can be used by Avogadro, Open Babel, and other codes<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenMP or OpenCL ideally.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu).<br />
<br />
===Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data===<br />
<br />
'''Brief explanation:''' Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.<br />
<br />
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).<br />
<br />
'''Expected results:''' Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.<br />
<br />
'''Prerequisites:''' General programming experience, and ideally experience in chemistry and matrix manipulations.<br />
<br />
'''Suggested Readings:''' <br />
* http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009<br />
* http://zhanglab.ccmb.med.umich.edu/EDTSurf/<br />
* https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf<br />
<br />
'''Mentor:''' Geoffrey Hutchison (geoffh at pitt dot edu)<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)</div>Greg.landrumhttps://wiki.openchemistry.org/index.php?title=GSoC_Ideas_2017&diff=547GSoC Ideas 20172017-02-03T11:49:51Z<p>Greg.landrum: Add RDKit section and first project</p>
<hr />
<div>==Guidelines==<br />
<br />
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. We intend to concentrate mainly on projects to improve [http://www.openchemistry.org/avogadro2 Avogadro 2], [http://cclib.sourceforge.net cclib], [http://3dmol.csb.pitt.edu/ 3DMol], and [http://openbabel.org/ Open Babel]. We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!<br />
<br />
=== Adding Ideas ===<br />
<br />
When adding a new idea to this page, please try to include the following information:<br />
<br />
* A brief explanation of the idea.<br />
* Expected results/feature additions.<br />
* Any prerequisites for working on the project.<br />
* Links to any further information, discussions, bug reports etc.<br />
* Any special mailing lists if not the standard mailing list for the project<br />
* Your name and email address for contact (if willing to mentor, or nominated mentor).<br />
<br />
=== Proposal Guidelines ===<br />
<br />
Students need to write and submit a proposal, we have added the [[Applying_to_GSoC|applying to GSoC page]] to help guide our students on what we would like to see in those proposals.<br />
<br />
==Avogadro 2 Project Ideas==<br />
<br />
[http://www.openchemistry.org/projects/avogadro2/ Avogadro 2] is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.<br />
<br />
=== Project: Integrate with VTK: Volume Rendering and Charts ===<br />
<br />
<img src="http://i.imgur.com/119Chwlm.jpg"/><br />
<br />
'''Brief explanation:''' Volume rendering is great for visualizing electronic structure, and VTK also brings charts for quantitative data<br />
<br />
'''Expected results:''' The Visualization Toolkit (VTK) features a number of rendering techniques for 3D geometry (including volume rendering), and 2D charts (line, points, parallel coordinates, scatter plot matrices). These are all written using a recently updated OpenGL backend, and recent improvements to the Qt5 code in VTK make integration with Avogadro much simpler. This project would focus primarily on exposing the QVTKOpenGLWidget to Avogadro 2, and then exposing volume rendering for 3D electronic structure data, and tables in the charts. Integration would involve developing some interface classes that make it easy to interact with the VTK classes from Avogadro code, and some concrete implementations showing data in these widgets.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Biological Data Visualization ===<br />
<br />
[[File:Biomolecule.png|thumb|300px|Protein Visualization]]<br />
<br />
'''Brief explanation:''' Support for biological data, representations, and visualization<br />
<br />
'''Expected results:''' Add support for molecular fragments on top of the molecule model, extending this to residues, and supporting reading/writing this secondary structure (e.g., PDB format). Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an biochemistry ideally, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)<br />
<br />
=== Project: Molecular Dynamics ===<br />
<br />
'''Brief explanation:''' Improve support for running, reading, and analyzing molecular dynamics simulations in Avogadro 2<br />
<br />
'''Expected results:''' Code for running MD simulations in common packages (e.g., OpenMM, GROMACS, etc.) including extracting parameters and atom types from Open Babel. Initial support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS already exists. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front (as is currently the case). Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenGL and an MD code ideal, but not necessary.<br />
<br />
'''Mentor:''' Marcus D. Hanwell (marcus dot hanwell at kitware dot com).<br />
<br />
=== Project: Scripting Bindings ===<br />
<br />
'''Brief explanation:''' Implement an embedded scripting language (e.g., Python or JavaScript) in Avogadro 2<br />
<br />
'''Expected results:''' Create bindings for the C++ libraries in Python or JavaScript / QtScript. This should allow an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python or JavaScript. A Boost.Python implementation existed in Avogadro v1, and some rudimentary support has been re-implemented with the new code base. An ideal solution would connect to QML and Qt to allow scripting to add menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.<br />
<br />
'''Prerequisites:''' Experience in C++ and Python or JavaScript, some experience with SWIG, Boost.Python, or similar packages (pybind11, SIP, PySide, etc.) suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Integrate with RDKit ===<br />
<br />
'''Brief explanation:''' Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization<br />
<br />
'''Expected results:''' RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.<br />
<br />
'''Prerequisites:'' Experience in C++, some experience with Python will be helpful.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
==cclib Project Ideas==<br />
<br />
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs and contained in output files.<br />
<br />
=== Project: Advanced Analysis of Quantum Chemistry Data ===<br />
<br />
'''Brief explanation:''' Implement additional analysis and quantum calculation methods, including ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges.<br />
<br />
'''Expected results:''' The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly multiple partial charge assignment methods exist and can be implemented, including DDEC6.<br />
<br />
'''Suggested Readings:''' <br />
* DDEC6: https://arxiv.org/abs/1512.08270<br />
<br />
'''Prerequisites:''' Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: Refactor parsers ===<br />
<br />
'''Brief explanation:''' The main extract() function of each parser are quite long and should be split into smaller functions for maintainability. <br />
<br />
'''Expected results:''' Ensure test coverage of cclib prior to refactoring, propose a few different approaches and discuss with cclib team, and implement best proposal. Ideally new functions are consistent across parsers and associated docstrings can be used for keeping documentation up-to-date.<br />
<br />
'''Prerequisites:''' Experience with the Python.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
=== Project: Implement new parsers ===<br />
<br />
'''Brief explanation:''' There are outstanding issues on Github for new parsers.<br />
<br />
'''Expected results:''' Generate test data and unit tests, and implement new parsers. For example, allow reading and writing from Molden and AIMALL wfn/wfx files.<br />
<br />
'''Prerequisites:''' Experience with the Python, and ideally familiarity with computational chemistry programs.<br />
<br />
'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and possibly Karol Langner (karol.langner at gmail dot com).<br />
<br />
==Open Babel Project Ideas==<br />
<br />
[http://openbabel.org Open Babel] is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.<br />
<br />
=== Project: Implement MMTF format ===<br />
<br />
'''Brief explanation:''' Implementation of MMTF file format in OpenBabel. <br />
<br />
''Expected results:''' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
=== Project: Fast, Efficient Fragment-Based Coordinate Generation ===<br />
<br />
'''Brief explanation:''' A critical challenge is generating 3D coordinates for a known molecule. Implement a fragment-based generator to supplement the rule-based algorithm.<br />
<br />
'''Expected results:''' Many representations in chemistry either store no coordinates (SMILES, InChI, FASTA sequence) or 2D coordinates. Performing calculations requires the full 3D reconstruction of a molecule, ideally with a low-energy conformation. Currently, Open Babel uses a combination of a rule-based approach (i.e., expected geometries) to generate atom-by-atom the 3D coordinates of molecules. Fragments are only used for some ring-based structures. For inorganic and organometallic molecules, the rules may fail. <br />
<br />
Importantly, the approach is '''highly''' efficient, since fragments can set many atoms at once. The project should generate a library reflecting a balance between efficiency (i.e., many common fragments) and size, as well as an efficient, parallel algorithm for connecting fragments. A knowledge-based fragment approach can also supplement and minimize the need for conformer sampling. <br />
<br />
A recent research paper describing one such approach can be found [http://link.springer.com/article/10.1186/s13321-015-0095-1 here]<br />
<br />
'''Prerequisites:''' Experience in C++ and linear algebra. Knowledge of statistics (e.g., Bayesian inference, data mining), OpenMP or OpenCL ideal.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)<br />
<br />
==3Dmol.js Project Ideas==<br />
[http://3dmol.csb.pitt.edu 3Dmol.js] is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.<br />
<br />
=== Project: Implement volumetric rendering in 3Dmol.js ===<br />
<br />
[[File:Electron_Density.jpg|thumb|300px|Volumetric Electron Density Maps]]<br />
<br />
'''Brief explanation:''' [http://http.developer.nvidia.com/GPUGems/gpugems_ch39.html Volumetric rendering] provides a way to visualize volumetric data in more detail than simple isosurfaces.<br />
<br />
'''Expected results:''' A number of different volumetric rendering techniques will be implemented and evaluated for a variety of molecular data types.<br />
<br />
'''Prerequisites:''' Familiarity with JavaScript, WebGL and/or OpenGL, and basic matrix algebra.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu)<br />
<br />
=== Project: Molecular Dynamics Visualization ===<br />
<br />
'''Brief explanation:''' Implement high-performance in-browser visualization of molecular dynamics simulations.<br />
<br />
'''Expected results:''' Initial support is already present, with support for reading in basic trajectories from XYZ files, and static .gro files for GROMACS. Extend this to more fully support the needs of molecular dynamics, reading in trajectory files, ideally loading in time steps on demand for large files rather than loading the entire file in up front. Investigate whether compression techniques (e.g., delta compression) can improve reading and rendering performance. Investigate ways to support generating input, and dealing with extremely large systems (over one million particles). Add support for characterizing particle movement (e.g., pair-wise distribution functions), rare events, and visualizing these in addition to simple trajectory animations.<br />
<br />
'''Prerequisites:''' Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL and an MD code ideally, but not necessary.<br />
<br />
'''Mentor:''' David Koes l (dkoes@pitt.edu).<br />
<br />
== RDKit Project Ideas ==<br />
<br />
[http://www.rdkit.org The RDKit] is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.<br />
<br />
=== Project: Create a generalized fingerprinting function ===<br />
<br />
'''Brief explanation:''' The RDKit provides access to a broad range of fingerprints but the interface for accessing them is complex, with multiple functions taking variable parameters. The idea here is to implement a significantly simpler, yet more flexible, method for generating fingerprints. We'll borrow ideas from two published papers while doing this: http://pubs.acs.org/doi/abs/10.1021/ci100062n and http://dx.doi.org/10.1016/j.jmgm.2010.05.008<br />
<br />
'''Expected results:''' A C++ implementation of the new fingerprinting functionality along with a robust set of test cases. Wrappers for the functions so that it is accessible from within the Python and Java wrappers as well as the PostgreSQL cartridge.<br />
<br />
'''Prerequisites:''' C++<br />
<br />
'''Mentor:''' Greg Landrum (greg.landrum at t5informatics dot com)<br />
<br />
== Miscellaneous Project Ideas ==<br />
<br />
These ideas would likely benefit two or more projects.<br />
<br />
=== Project: Computational Chemistry Web Repository ===<br />
<br />
'''Brief explanation:''' Implement an easy-to-deploy web server repository for computational chemistry<br />
<br />
'''Expected results:''' Many chemistry researchers implement open data repositories to serve up their research results (e.g., https://pqr.pitt.edu/ or https://openchemistry.kitware.com) but each implementation is a little different. Create a web repository solution that uses cclib to parse a wide range of computational chemistry files and easily serves up an easy-to-use web interface and web API (REST). The tool should be able to scan a set of directories and continually add new input and output files to the repository.<br />
<br />
'''Prerequisites:''' Python and JavaScript and experience in server software, e.g. Python and Flask.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu)<br />
<br />
=== Project: GPU and Multi-Core Enabled High Performance Force Field Calculations ===<br />
<br />
'''Brief explanation:''' Add integrated molecular mechanics force field simulations in Avogadro 2<br />
<br />
'''Expected results:''' Currently, Avogadro 2 relies on command-line calls to Open Babel to optimize geometries or perform conformer searching. The Open Babel code supports multiple force fields, but has poor performance. A modern implementation of a force field library would be welcome, including OpenMP and/or OpenCL support for highly parallel calculations. The architecture should support constrained geometry optimizations and multiple optimization techniques (i.e., steepest descent, conjugate gradients, quasi-Newton like L-BFGS) and be modular enough to allow new force field implementations as plugins.<br />
<br />
Ideally the code would be implemented in a new library so it can be used by Avogadro, Open Babel, and other codes<br />
<br />
'''Prerequisites:''' Experience in C++, some experience with OpenMP or OpenCL ideally.<br />
<br />
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu).<br />
<br />
===Project: GPU Accelerated Calculation of Molecular Surfaces and QM Data===<br />
<br />
'''Brief explanation:''' Leverage generic GPU/CPU language (e.g., OpenCL or CUDA) to efficiently generate molecular surface data, including as molecular orbitals or electron (spin) density. This may include methods to approximate the molecular surface using compressed forms (e.g., Gaussian spheres). Current code uses multiple CPU cores, but competing codes can produce near-instantaneous rendering using OpenCL: http://www.kieber-emmons.com/Lumo/) Similar code exists in VMD.<br />
<br />
Additional performance improvements may come through efficient surface generation techniques used in other work (e.g., using the Euclidian Distance Transform).<br />
<br />
'''Expected results:''' Generate appropriate kernels that can be used in any language that supports OpenCL (C, C++, Python, etc.) across multiple platforms.<br />
<br />
'''Prerequisites:''' General programming experience, and ideally experience in chemistry and matrix manipulations.<br />
<br />
'''Suggested Readings:''' <br />
* http://www.ks.uiuc.edu/Publications/Papers/paper.cgi?tbcode=STON2009<br />
* http://zhanglab.ccmb.med.umich.edu/EDTSurf/<br />
* https://www.cgl.ethz.ch/Downloads/Publications/Papers/2009/Jan09/Jan09.pdf<br />
<br />
'''Mentor:''' Geoffrey Hutchison (geoffh at pitt dot edu)<br />
<br />
===Project: OneMol: Google Docs & YouTube for Molecules ===<br />
[[File:OneMolsm.png|right]]<br />
'''Brief explanation:''' There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.<br />
<br />
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.<br />
<br />
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.<br />
<br />
'''Expected results:''' Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).<br />
<br />
'''Prerequisites:''' Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.<br />
<br />
'''Mentor:''' David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)</div>Greg.landrum