GSoC Ideas 2020
Guidelines
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. While we have participated in the last few Google Summer of Code programs and will apply again in 2020, there is no guarantee that we will be selected again for GSoC in 2020.
We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!
Adding Ideas
When adding a new idea to this page, please try to include the following information:
- A brief explanation of the idea.
- Expected results/feature additions.
- Any prerequisites for working on the project.
- Links to any further information, discussions, bug reports etc.
- Any special mailing lists if not the standard mailing list for the project
- Your name and email address for contact (if willing to mentor, or nominated mentor).
Proposal Guidelines
Students need to write and submit a proposal, we have added the applying to GSoC page to help guide our students on what we would like to see in those proposals.
Avogadro 2 Project Ideas
Avogadro 2 is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.
Project: Biological Data Visualization
Brief explanation: Support for biological data, representations, and visualization
Expected results: Rudimentary support for residues, and reading secondary structure (e.g., PDB format) is now present. Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels are desired. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.
Prerequisites: Experience in C++, some experience with OpenGL and biochemistry ideally, but not necessary.
Mentor: Marcus D. Hanwell (marcus dot hanwell at kitware dot com) or Geoffrey Hutchison (geoffh at pitt.edu)
Project: Scripting Bindings
Brief explanation: Implement an embedded scripting language (i.e., Python) in Avogadro 2
'Expected results:’ Enable an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python. Initial Python bindings have been re-implemented using PyBind11 with the new codebase, and the Avogadro 2 core libraries are pip installable. Extending the coverage of the API from the rudimentary parts of core/io would be a good starting point. An ideal solution would connect to PySide2, to allow scripting to add UI like menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.
Example scripts, documentation, are highly encouraged.
Prerequisites: Experience in C++ and Python, some experience with PyBind11, Qt for Python, PySide suggested.
Mentor: Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (marcus dot hanwell at kitware dot com)
Project: Integrate with RDKit
Brief explanation: Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization
Expected results: RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.
'Prerequisites: Experience in C++, some experience with Python will be helpful.
Mentor: Geoff Hutchison (geoffh at pitt dot edu)
Project: Improve Avogadro Quantum Processing & Analysis
Brief explanation: Visualizing quantum mechanical data like orbitals, electron density, etc. is slow. Replace Avogadro's current orbital rendering to use the efficient the Gau2Grid library [[1]] and add analysis tools.
Expected results: A very fast real-time rendering of volumetric quantum chemical within Avogadro, ideally including processing and analysis of surfaces / volumes, orbitals, etc. For example, sometimes the gradient or the Laplacian of a surface are useful. Add tools to add/subtract or join / intersect surfaces and map properties (e.g., electrostatic potential mapped onto the electron density).
Prerequisites: Experience in C++, an understanding of vectorization and intrinsics would be helpful.
Mentor: Daniel G. A. Smith (dgasmith at vt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)
Open Babel Project Ideas
Open Babel is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.
Project: Integrate CoordGen library
Expected results: Schrodinger has released a BSD-licensed library for 2D chemical structure layout (https://github.com/schrodinger/coordgenlibs) and it has been successfully integrated into RDKit. The student will be responsible for integrating CoordGen into Open Babel. Code will be written in C++.
Mentor: Geoff Hutchison (geoffh at pitt dot edu)
Project: Implement MMTF format
Brief explanation: Implementation of MMTF file format in OpenBabel.
Expected results:' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.
Mentor: Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)
Project: Test Framework Overhaul
Brief explanation: Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.
Expected results: A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.
Prerequisites: Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.
Mentor: Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.
Project: Develop a JavaScript version of Open Babel
Brief explanation: Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.
Expected results: Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)
Prerequisities: Some experience in C++, and also with JavaScript.
Mentor: Noel O'Boyle (baoilleach at gmail dot com)
Project: Develop a validation and standardization filter
Brief explanation: Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?
Expected results: Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.
Such a model could be used as a filter, or as a warning to flag up problematic structures.
Code could be modeled on MolVS using RDKit [[2]]
Prerequisites: Experience in C++ or Python, and an interest in data science or statistics.
Mentor: Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)
cclib Project Ideas
cclib is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.
Project: Support for QCSchema JSON output
Bried explanation: The library already allows importing and exporting data between several formats. The QCSchema is a new JSON format that tries to standardize the way computational chemistry data is written and shared, so supporting the effort can be useful.
Expected results: Implement JSON output that conforms to the conventions of the MolSSI QCSchema.
Suggested readings:
- This cclib issue and the references there.
Prerequisites: Experience with Python, some experience with physics and chemistry also recommended.
Mentor: Eric Berquist (eric.john.berquist at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)
Project: Advanced Analysis of Quantum Chemistry Data
Brief explanation: The current cclib library offers some calculation methods, including fragment analysis and some charge models. Many modern analysis techniques exist to partition electron density, including computing gradients, Laplacians, ELF (electron localization function), Bader's AIM analysis, etc. Similarly, multiple partial charge assignment methods exist and can be implemented, including DDEC6.
Expected results: Implement additional analysis and quantum calculation methods, such as ELF (electron localization function), AIM (Bader's Atoms-in-Molecules) techniques, and/or DDEC6 atomic charges, with examples and tests.
Suggested Readings:
Prerequisites: Experience with Python and linear algebra (including numpy, scipy), some experience with numerical methods suggested.
Mentor: Geoff Hutchison (geoffh at pitt dot edu) and/or Karol Langner (karol.langner at gmail dot com)
Project: Implement new parsers
Brief explanation: There are outstanding issues on GitHub for supporting more programs, and parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).
Expected results: Generate test data and unit tests, and implement new parsers.
Prerequisites: Experience with Python, and ideally familiarity with computational chemistry programs.
Mentor: Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)
Project: Discovering computational chemistry content online
Brief explanation: There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!
Expected results: Build a crawler that identifies and indexes computational chemistry logfiles online, and provides the ability to extract the data they contain with cclib.
Prerequisites: Experience with Python, and ideally familiarity with computational chemistry and web indexing.
Mentor: Karol Langner (karol.langner at gmail dot com)
3Dmol.js Project Ideas
3Dmol.js is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.
Project: Improve 3Dmol.js
Brief explanation: Make significant improvements to 3Dmol.js functionality or performance.
Expected results: This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request.
Prerequisites: Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.
Mentor: David Koes l (dkoes@pitt.edu)
gnina Project Ideas
gnina is a C/C++ framework for applying deep learning to molecular docking.
Project: Improve gnina
Brief explanation: Make significant improvements to gnina functionality or performance.
Expected results: This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request.
Prerequisites: Experience with CUDA/C/C++ programming and the basics of deep learning.
Mentor: David Koes l (dkoes@pitt.edu)
RDKit Project Ideas
The RDKit is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, and C#. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.
Project: Integrate trained neural networks into the RDKit
Brief explanation: There's a lot of work going on to train and use neural networks that use the RDKit. It would be great to be able to use some of those trained networks from inside the RDKit itself. A couple of examples that immediately come to mind here are ANI-2X (https://chemrxiv.org/articles/Extending_the_Applicability_of_the_ANI_Deep_Learning_Molecular_Potential_to_Sulfur_and_Halogens/11819268) and CDDD (https://github.com/jrwnter/cddd). The idea in this project would be create the required Python and C++ infrastructure to translate a trained neural network into a form that it can be used from C++ and then integrate ANI-2X using that infrastructure. As a stretch goal the trained network for CDDD would be integrated.
Expected results: Code (probably in Python) to translate a trained neural network using one of the standard NN libraries to a form that it can be used from C++. Code to actually execute the network in C++. A port of the ANI-2X network to the RDKit's ForceField library using this new code. Wrappers for the new functionality so that it is accessible from within the Python and SWIG (Java and C#) wrappers. An comprehensive set of tests for the new functionality.
Prerequisites: C++, Python
Mentor: Greg Landrum (greg.landrum at t5informatics dot com)
Project: Implement a generalized file reader
Brief explanation: Implementation of a flexible generic interface for reading molecular file formats (things like .smi, .sdf, and the compressed versions thereof). The reader should recognize the file format automatically so that the user does not need to worry about this.
Expected results: A C++ implementation of a generalized file reader for the RDKit along with a robust set of test cases. Wrappers for the reader so that it is accessible from within the Python and SWIG (Java and C#) wrappers.
Prerequisites: C++
Mentor: Greg Landrum (greg.landrum at t5informatics dot com)
Project: Implement Molecular Interaction Fields calculations in the RDKit
Brief explanation: There is an old PR for the RDKit that implements molecular interaction fields: https://github.com/rdkit/rdkit/pull/318. This was never merged because the author ran out of time. At this point a lot of work would be required to update and finish this PR, but the results would be super useful for the RDKit community.
Expected results: A C++ implementation of the GRID calculator code along with a robust set of test cases. Wrappers for the reader so that it is accessible from within the Python and SWIG (Java and C#) wrappers.
Prerequisites: C++
Mentor: Greg Landrum (greg.landrum at t5informatics dot com)
Project: RDKit+OpenMM GPU Molecular Force Fields
Brief explanation: OpenMM (http://openmm.org/) is a high-performance toolkit for force-field based molecular simulation that includes GPU and CPU support. The goal of this project is to make it easy to use OpenMM force fields to minimize the energies of or perform molecular dynamics calculations on RDKit molecules.
Expected results: OpenMM supports a wide range of force fields, but not the classical MMFF94 or UFF methods implemented in RDKit. Needed is C++ functionality allowing RDKit molecules to be sent to OpenMM for minimization and/or to perform molecular dynamics. A robust set of regression tests for this functionality. Python wrappers around the new functionality. The work would likely involve completing the MMFF94 implementation described by Paolo Tosco at the 2017 RDKit UGM (https://github.com/rdkit/UGM_2017/blob/master/Presentations/Tosco_RDKit_OpenMM_integration.pdf) and extending to other force fields like UFF.
Prerequisites: C++ and some Python
Mentor: TBA, likely Geoff Hutchison (geoffh at pitt.edu) and others
Project: MongoDB integration
Brief explanation: MongoDB (https://www.mongodb.com/) is an open-source cross-platform document oriented NoSQL database program optimized for performance. Its flexible schema can accommodate hierarchical relationships between chemical compounds. To enable chemical intelligence in mongoDB queries, an integration with RDKit is necessary. This project will allow us to build a frame to perform similarity, substructure, and identity searches. We will leverage on the document structure of the database to store multiple representations of each molecule. While adding new functionalities and developing existing capabilities we will keep an eye on performance, to ensure optimal scalability (indexing, shards, multiprocessing, etc.). We will learn from, and possibly build upon other work that has been done for chemistry integration into MongoDB (e.g. http://wiki.openchemistry.org/MongoChem and http://blog.matt-swain.com/post/87093745652/chemical-similarity-search-in-mongodb).
Expected results: A stable and performant RDKit extension to MongoDB that provides chemical functionalities on a document database.
Prerequisites: Python
Mentor: Marco Stenta (marco.stenta at syngenta.com)
NWChem Project Ideas
NWChem is widely used open-source computational chemistry software ([3]) that tackles a wide variety of scientific problems.
Project NWChem-JSON
Brief explanation: Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics.
Expected results: Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed.
Prerequisites: Experience with Fortran90 and Python
Mentor: Bert de Jong (wadejong at lbl dot gov)
Project NWChem-Python-Jupyter Interface
Brief explanation: Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks
Expected results: NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite.
Prerequisites: Experience with Fortran and Python
Mentor: Bert de Jong (wadejong at lbl dot gov)
JSON-LD for Chemical Data
Brief explanation: Transforming NWChem and Chemical JSON formats to JSON-LD
Expected results: Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.
Prerequisites: Experience with Python and JSON-LD
Mentor: Bert de Jong (wadejong at lbl dot gov)
DeepChem Project Ideas
DeepChem aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.
Project: Dynamic DeepChem
Brief explanation: Lay the groundwork for a version of DeepChem based on Jax
Expected results: DeepChem was originally built on Theano then later ported to TensorFlow's grab mode. We are currently working on porting it to eager mode default. It seems sensible that the next big transition will be to more powerful automatic differentiation frameworks like Jax. This project would require students to implement core DeepChem models such as graph convolutions in Jax and demonstrate that they can be saved and loaded. This work would likely see it's way into the next main version of DeepChem.
Prerequisites: Python, Tensorflow
Mentor: Bharath Ramsundar (bharath dot ramsundar at gmail dot com)
Project: Improvements to Transfer Learning
Brief explanation: Expand out DeepChem's transfer learning framework and machinery.
Expected results: ChemNet discusses a powerful model independent transfer learning protocol. We had a GSoC student expand out this framework over last summer (post). We'd like to see work expanding this framework out further and adding in new ideas, perhaps borrowing from recent research on transformers.
Prerequisites: Python, Tensorflow
Mentor: Bharath Ramsundar (bharath dot ramsundar at gmail dot com)
Miscellaneous Project Ideas
These ideas would likely benefit two or more projects.
Project: OneMol: Google Docs & YouTube for Molecules
Brief explanation: There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.
Expected results: Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).
Prerequisites: Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.
Mentor: David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)
Psi4 Project Ideas
psi4 is an open-source hybrid Python/C++ suite of ab initio quantum chemistry programs designed for efficient, high-accuracy simulations of a variety of molecular properties.
Project: Quantum Chemistry with Deep Learning Toolkits
Brief explanation: Integrate GPU tensors tools like TensorFlow and PyTorch with the Psi4NumPy (https://github.com/psi4/psi4numpy) to explore the performance of these high-level tools with quantum chemistry.
Expected results: A small module that can evaluate quantum chemistry on GPUs.
Prerequisites: Tensorflow or PyTorch knowledge, linear algebra, and an understanding of general tensor contraction. No quantum chemistry knowledge required.
Mentor: Daniel G. A. Smith (dgasmith at vt.edu)
Project: Parallelization of Task Graph Computations
Brief explanation: Improve Psi4's task graph computation integration with the MolSSI QCFractal (https://github.com/MolSSI/QCFractal) project for massively parallel quantum chemistry.
Expected results: Massively parallel implementations of crystal computations, n-body interactions
Prerequisites: Python experience and task-graph experience (such as Dask), an understand quantum chemistry understanding would be helpful.
Mentor: Lori Burns (lori.burns at gmail.com) or Roberto Di Remigio (roberto.diremigio at gmail.com)
Project: Avogadro visualization integration
Brief explanation: Integration of Psi4 volumetric data with Avogadro's rendering tools.
Expected results: Automatic integration of Psi4s volumetric data such as cube files, F-SAPT energy decomposition analysis routines, and vibrational frequencies.
Prerequisites: Python experience and Avogadro integration, a small amount of quantum chemistry understanding would be helpful.
Mentor: Justin Turney (justin.turney at gmail.com) or Andrew James (amjames2 vt.edu)
MSDK / MZmine Project Ideas
Project: New Visualization Modules
Brief explanation: Implement new, JavaFX-based visualization modules for MZmine [4] such as 3D plot and Cloud Plot.
Expected results: A replacement module for the aging and barely functional 3D visualizer [5], as well as new visualization tools for data analysis.
Prerequisites: Java, JavaFX (preferred), experience with 3D graphics helpful but not required.
Mentor: Tomas Pluskal (plusik at gmail.com)