GSoC Ideas 2024: Difference between revisions
Line 106: | Line 106: | ||
'''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu) | '''Mentor:''' Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu) | ||
=== Project [90 | === Project [90 or 175 hours]: Test Framework Improvements === | ||
'''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel. | '''Brief explanation:''' Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel. |
Revision as of 14:05, 5 February 2024
Guidelines
Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas.
One important factor is that GSoC in 2024 includes both shorter projects (~175 hours) and longer projects (~350 hours). You should consider the appropriate timeline for your project proposal. We have indicated in the project totals where we suggest particular lengths.
Contributors can also decide on the number of weeks (e.g., spreading the project time over multiple weeks).
If you are unsure of the scope of a project, please reach out and discuss BEFORE the proposal deadline.
When possible, submitting drafts a week or more in dance of the proposal deadline is preferred because we can make suggestions towards your proposal.
We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!
Adding Ideas
When adding a new idea to this page, please try to include the following information:
- Size of the project (~90, ~175 or ~350 hours of work)
- A brief explanation of the idea.
- Expected results/feature additions.
- Any prerequisites for working on the project.
- Links to any further information, discussions, bug reports etc.
- Any special mailing lists if not the standard mailing list for the project
- Your name and email address for contact (if willing to mentor, or nominated mentor).
Proposal Guidelines
Students need to write and submit a proposal, we have added the applying to GSoC page to help guide our students on what we would like to see in those proposals.
Avogadro 2 Project Ideas
Avogadro 2 is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.
Project [90, 175, or 350 hours]: Automation & Scripting Bindings
Brief explanation: Improve automation or implement an embedded scripting language (i.e., Python) in Avogadro 2
Expected results: Enable an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python. Python bindings exist, using PyBind11 with the new codebase, and the Avogadro 2 core libraries are pip installable. Extending the coverage of the API from the rudimentary parts of core/io would be a good starting point. An ideal solution would connect to PySide, to allow scripting to add UI like menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.
A larger project might include recording UI events to translate into Python code.
Example scripts, documentation, are highly encouraged.
Prerequisites: Experience in C++ and Python, some experience with PyBind11, Qt for Python, PySide suggested.
Mentor: Geoff Hutchison (geoffh at pitt dot edu)
Project [175 or 350 hours]: Tools for Interactive Molecular Dynamics
Brief explanation: Building solvent boxes, implementing standard molecular dynamics using in-progress optimization framework. The scope could be 175 or 350 hours - please discuss what scale project you have in mind.
Expected results: Avogadro (v1) has interactive force field optimization allowing building and manipulation (e.g., push-pull atoms into position). Some users call this 'video game mode' ;-) A new optimization framework is in progress, including calling external programs for energies and forces. The project would enable building out MD simulations, including tools to add water or solvent boxes, build larger systems (e.g., via PackMol integration) and implement simple MD integration and thermostats.
'Prerequisites: Experience in C++, ideally with knowledge of molecular dynamics methods and tools. Some Python would be helpful
Mentor: Geoff Hutchison (geoffh at pitt dot edu)
Project [350 hours]: Improved Rendering with Geometry Shaders
Brief explanation: Our current rendering code needs updating to the OpenGL Core Profile and Optimization with Geometry Shaders
Expected results: An efficient GPU-enabled surface generation and rendering framework using geometry shaders to provide dynamic level of detail, improved depth-of-focus and rendering quality.
'Prerequisites: Experience in C++, ideally with knowledge of OpenGL shaders. Some understanding of quantum chemistry would be helpful.
Mentor: Geoff Hutchison (geoffh at pitt dot edu)
Project [175 or 350 hours]: Improved Selection and/or Molecular Find-and-Replace
Brief explanation: Improved support for selection tools (e.g., "select all water molecules" and "select everything within X Å of my mouse-click") and tools for bulk editing of structures using a "find and replace" interface (e.g., replace 10% of all gold atoms with silver or "change this molecular pattern to this new functional group")
Expected results: A set of new selection tools and commands and/or a set of tools to substitute atoms and molecular fragments in the builder. For example, the Wilmer group published "MOFUN" (https://doi.org/10.1039/D2DD00044J) a package to replace specific fragments, and many packages such as Open Babel and RDKit support SMIRKS or reaction SMILES rules. Ideas to improve the usability are highly welcome.
'Prerequisites: Experience in C++, some experience with Python will be helpful.
Mentor: Geoff Hutchison (geoffh at pitt dot edu)
Project [175 hours]: Integrate with RDKit
Brief explanation: Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization
Expected results: RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.
'Prerequisites: Experience in C++, some experience with Python will be helpful.
Mentor: Geoff Hutchison (geoffh at pitt dot edu)
Open Babel Project Ideas
Open Babel is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.
Project [175 hours]: Integrate CoordGen library
Expected results: Schrodinger has released a BSD-licensed library for 2D chemical structure layout (https://github.com/schrodinger/coordgenlibs) and it has been successfully integrated into RDKit. The student will be responsible for integrating CoordGen into Open Babel. Code will be written in C++.
Mentor: Geoff Hutchison (geoffh at pitt dot edu)
Project [90 hours]: Implement MMTF format
Brief explanation: Implementation of MMTF file format in OpenBabel.
Expected results:' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.
Mentor: Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)
Project [90 or 175 hours]: Test Framework Improvements
Brief explanation: Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.
Expected results: A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.
Prerequisites: Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.
Mentor: Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.
Project [350 hours]: Develop a JavaScript version of Open Babel
Brief explanation: Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.
Expected results: Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.
Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)
Prerequisities: Some experience in C++, and also with JavaScript.
Mentor: Geoff Hutchison (geoffh at pitt dot edu)
Project [350 hours]: Develop a validation and standardization filter
Brief explanation: Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?
Expected results: Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.
Such a model could be used as a filter, or as a warning to flag up problematic structures.
To be clear "model" does not mean a machine learning model, instead as a set of filters. (An ML-based model for tautomers might be useful, however.)
Code could be modeled on MolVS using RDKit [[1]]
Prerequisites: Experience in C++ or Python, and an interest in data science or statistics.
Mentor: Geoff Hutchison (geoffh at pitt dot edu)
cclib Project Ideas
cclib is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.
Project: [175 or 350 hours] Implement new parsers
Brief explanation: There are outstanding issues on GitHub for supporting more programs (e.g. CFOUR, xtb, NBO, GAMESS dat, MRCC, DIRAC), and parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA). There may also be more programs missing that haven't been considered.
Expected results: Implement parsers for one or more new programs/formats, generate test data, and write unit and regression tests for each parser.
Prerequisites: Experience with Python, basic familiarity with computational chemistry programs, and access to the program(s) needed to generate the test data.
Mentors: Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com) and/or Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)
Project: [175 or 350 hours] Implement new bridges
Brief explanation: There are outstanding issues on GitHub for more integrations with external programs (e.g. chemfiles, RDKit) via their Python bindings. There may also be more programs missing that haven't been considered.
Expected results: Implement bridges for one or more new programs, along with writing unit tests and documentation for each bridge.
Prerequisites: Experience with Python and ideally familiarity with the program that is being bridged.
Mentors: Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com) and/or Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)
Project [350 hours]: Implement new methods
Brief explanation: There are outstanding issues on GitHub for more analysis methods being added directly to cclib (e.g. calculating geometric parameters). There may also be other methods that are desirable to include which haven't been considered.
Expected results: Implement one or more new methods, along with writing unit tests and documentation for each method.
Prerequisites: Experience with Python and familiarity with the method(s) being added, depending on the complexity of the method.
Mentors: Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com) and/or Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)
Project [350 hours]: Additional visualization for OpenChemVault
Brief explanation: OpenChemVault (https://github.com/cclib/openchemvault) is capable of parsing output files, storing them, and displaying geometries, but any sort of additional visualization (such as plotting molecular orbitals or spectra) is missing. The capabilities of GaussSum (http://gausssum.sourceforge.net/) are a possible starting point.
Expected results: Implement one or more new visualizations for the OpenChemVault web interface.
Prerequisites: Experience with Python common visualizations that are desirable for computational chemistry outputs. No previous experience with JavaScript is necessary.
Mentors: Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com) and/or Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)
RDKit Project Ideas
The RDKit is a BSD licensed open source cheminformatics toolkit written in C++ with wrappers for use from Python, Java, C#, and JavaScript. The RDKit also provides "cartridge" functionality that allows chemical searching in the open-source relational database PostgreSQL.
Project [350 hours]: Implement Molecular Interaction Fields calculations in the RDKit
Brief explanation: There is an old PR for the RDKit that implements molecular interaction fields: https://github.com/rdkit/rdkit/pull/318. This was never merged because the author ran out of time. At this point a lot of work would be required to update and finish this PR, but the results would be super useful for the RDKit community.
Expected results: A C++ implementation of the GRID calculator code along with a robust set of test cases. Wrappers for the calculator so that it is accessible from within the Python and SWIG (Java and C#) wrappers.
Prerequisites: C++
Mentor: Greg Landrum (greg.landrum at t5informatics dot com)
Project [175 or 350 hours]: Implement additional fingerprints in the RDKit
Brief explanation: There are a number of chemical fingerprint types which it would be useful to have natively available in the RDKit; in this project you will implement one or more of them. Some ideas for fingerprints to be included are:
- Pubchem fingerprint: https://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdf
- CSFP: https://doi.org/10.1021/acs.jcim.9b00571
- Physicochemical property fingerprints: porting the existing Python implementation (https://github.com/rdkit/rdkit/blob/7153918af4dff37c768577441c5286b425e6bf3d/rdkit/Chem/AtomPairs/Sheridan.py) to C++
The number to be implemented depends on whether you are doing this as a 175 or 350 hour project.
Expected results: A C++ implementation of the new fingerprints along with a robust set of test cases. Wrappers for the calculators so that they are accessible from with the Python and SWIG (Java and C#) wrappers.
Prerequisites: C++ and some Python
Mentor: Greg Landrum (greg.landrum at t5informatics dot com)
3Dmol.js Project Ideas
3Dmol.js is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.
Project [175 hours]: More cartoon options for nucleic acids.
Brief explanation: Implement additional visualizations of nucleic acids.
Expected results: See https://github.com/3dmol/3Dmol.js/issues/559
Prerequisites: Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.
Mentor: David Koes (dkoes@pitt.edu)
Project [175 or 350 hours]: Improve 3Dmol.js
Brief explanation: Make significant improvements to 3Dmol.js functionality or performance.
Expected results: This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request.
Prerequisites: Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.
Mentor: David Koes (dkoes@pitt.edu)
gnina Project Ideas
gnina is a C/C++ framework for applying deep learning to molecular docking.
Project [175 or 350 hours]: Improve gnina
Brief explanation: Make significant improvements to gnina functionality or performance.
Expected results: This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request.
Prerequisites: Experience with CUDA/C/C++ programming and the basics of deep learning.
Mentor: David Koes l (dkoes@pitt.edu)
DeepChem Project Ideas
DeepChem aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology. Additional project ideas are discussed at https://forum.deepchem.io/
Project [350 hours]: Layer Documentation
Brief explanation: DeepChem is moving towards a concept of first class layers. Improving the documentation for existing layers will help us make our current collection of layers more useful for the community.
Expected results: This project should also add a tutorial for using the layers to the DeepChem tutorial series, and should plan to add a few new layers as well.
Prerequisites: PyTorch/TensorFlow, Python
Mentor: Bharath Ramsundar (bharath at deepforestsci dot com)
Project [350 hours]: PyTorch Porting
Brief explanation: DeepChem is shifting towards using PyTorch as its primary backend, but many models are still implemented in TensorFlow. A good project could be to pick a TensorFlow model or two, then port its layers and model into PyTorch along with suitable unit tests.
Expected results: At least one model should be ported from TensorFlow to PyTorch successfully with associated unit tests. See See https://github.com/deepchem/deepchem/issues/2863
Prerequisites: PyTorch/TensorFlow, Python
Mentor: Bharath Ramsundar (bharath at deepforestsci dot com)
Project [350 hours]: HuggingFace Integration
Brief explanation: HuggingFace Integration: Last year, we had a few student projects explore HuggingFace/DeepChem integration, but these projects were not able to merge in HuggingFace models into DeepChem.
Expected results: This project would create a working HuggingFace model in DeepChem along with tutorials on how to use HuggingFace with DeepChem.
Prerequisites: PyTorch/TensorFlow, Python
Mentor: Bharath Ramsundar (bharath at deepforestsci dot com)
Project [350 hours]: Improved PINNs Support
Brief explanation: Improving our PINNs Support: One of the exciting new features in DeepChem 2.6.0 is support for PINNs, a class of techniques to solve PDEs with neural networks. The API for this class is still rudimentary and supports only a limited class of models and requires handcoding the loss.
Expected results: Extend the API to allow for a broader class of PDEs to be implemented. I’d suggest using Schrodinger’s equation as a test since Schrodinger can be solved in 1D as a toy and extended to arbitrarily high dimensions for larger molecules.
Prerequisites: PyTorch/TensorFlow, Python
Mentor: Bharath Ramsundar (bharath at deepforestsci dot com)
Project [350 hours]: Improved Equivariance Support
Brief explanation: Improve Equivariant Support: DeepChem has no support for equivariant models. Given the increasing importance of equivariance for scientific machine learning this is a major oversight.
Expected results: This project would aim to add a tutorial about equivariant modeling and add an equivariant model to DeepChem. You may want to use e3nn or another library to facilitate implementation.
Prerequisites: PyTorch/TensorFlow, Python
Mentor: Bharath Ramsundar (bharath at deepforestsci dot com)
Project [350 hours]: Improved Antibody Support
Brief explanation: Improving Antibody Support: DeepChem at present doesn’t have much tooling or support for working with anbtibodies.
Expected results: This project would add suitable antibody datasets to MoleculeNet and create a tutorial walking users through antibody design and modeling with DeepChem. If necessary, students may add antibody-specific models as well.
Prerequisites: PyTorch/TensorFlow, Python
Mentor: Bharath Ramsundar (bharath at deepforestsci dot com)
Miscellaneous Project Ideas
These ideas would likely benefit two or more projects.
Project [350 hours]: Computational Chemistry Repository Server
Brief explanation: Providing a web server to store and organize computational chemistry results is extremely needed. Previous work includes MonogoChemServer (https://github.com/OpenChemistry/mongochemserver) and OpenChemVault (https://github.com/cclib/openchemvault). The former includes visualizations for vibrations, spectra, etc. but is based on the out-dated Girder data management platform. The latter handles its own data management, but doesn't offer the same range of visualizations.
Expected results: An extensible, maintainable server framework built on FastAPI and a modern database, which can use cclib to import computational files from any source, provide basic charts and visualizations (e.g., potential energy curves, spectra, vibrations). Suggested directions include the use of columnar data stores and frameworks such as Apache Arrow, Ibis, and similar that can offer native bindings across languages coupled with fast binary transport/storage.
Assuming the project is successful, a manuscript will be prepared for publication in an appropriate journal.
Prerequisites: Experience with Python and FastAPI and at least some understanding of computational chemistry.
Mentor: Geoffrey Hutchison (geoffh at pitt.edu) and Marcus Hanwell
Project [350 hours]: OneMol: Google Docs & YouTube for Molecules
Brief explanation: There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.
File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.
The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.
Expected results: Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).
Prerequisites: Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.
Mentor: David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)