GSoC Ideas 2021: Difference between revisions

From wiki.openchemistry.org
Jump to navigation Jump to search
Line 138: Line 138:
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.
[http://cclib.github.io cclib] is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.


=== Project: Support for QCSchema JSON output ===
==Project: Implement new parsers==


'''Bried explanation:''' The library already allows importing and exporting data between several formats. The QCSchema is a new JSON format that tries to standardize the way computational chemistry data is written and shared, so supporting the effort can be useful.
'''Brief explanation''': There are outstanding issues on GitHub for supporting more programs (e.g. CFOUR, xtb, NBO, GAMESS dat, MRCC, DIRAC), and parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA). There may also be more programs missing that haven't been considered.


'''Expected results:''' Implement JSON output that conforms to the conventions of the [https://github.com/MolSSI/QCSchema MolSSI QCSchema].
'''Expected results''': Implement parsers for one or more new programs/formats, generate test data, and write unit and regression tests for each parser.


'''Suggested readings:'''
'''Prerequisites''': Experience with Python, basic familiarity with computational chemistry programs, and access to the program(s) needed to generate the test data.
* This [https://github.com/cclib/cclib/issues/643 cclib issue] and the references there.


'''Prerequisites:''' Experience with Python, some experience with physics and chemistry also recommended.
'''Mentors''': Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com) and/or Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com)


'''Mentor:''' Eric Berquist (eric.john.berquist at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)
==Project: Implement new bridges==


=== Project: Implement new parsers ===
'''Brief explanation''': There are outstanding issues on GitHub for more integrations with external programs (e.g. chemfiles, RDKit) via their Python bindings. There may also be more programs missing that haven't been considered.


'''Brief explanation:''' There are outstanding issues on GitHub for supporting more programs, and parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA).
'''Expected results''': Implement bridges for one or more new programs, along with writing unit tests and documentation for each bridge.


'''Expected results:''' Generate test data and unit tests, and implement new parsers.
'''Prerequisites''': Experience with Python and ideally familiarity with the program that is being bridged.


'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry programs.
'''Mentors''': Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com) and/or Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com)


'''Mentor:''' Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com)
==Project: Implement new methods==


=== Project: Discovering computational chemistry content online ===
'''Brief explanation''': There are outstanding issues on GitHub for more analysis methods being added directly to cclib (e.g. calculating geometric parameters). There may also be other methods that are desirable to include which haven't been considered.


'''Brief explanation:''' There are tens or hundreds of thousands of computational chemistry results available online - let's mine them!
'''Expected results''': Implement one or more new methods, along with writing unit tests and documentation for each method.


'''Expected results:''' Build a crawler that identifies and indexes computational chemistry logfiles online, and provides the ability to extract the data they contain with cclib.
'''Prerequisites''': Experience with Python and familiarity with the method(s) being added, depending on the complexity of the method.


'''Prerequisites:''' Experience with Python, and ideally familiarity with computational chemistry and web indexing.
'''Mentors''': Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com) and/or Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com)


'''Mentor:''' Karol Langner (karol.langner at gmail dot com)
==Project: Julia bindings==
 
'''Brief explanation''': The Julia programming language (https://julialang.org/) is growing in popularity for computational chemistry as a language that both production-level computation and analysis can be performed in seamlessly. In order to analyze computational chemistry outputs from traditional programs in Julia, rather than reimplement all cclib functionality in Julia, we should be able to call cclib from Julia directly and reuse its core functionality.
 
'''Expected results''': Julia bindings to cclib IO functionality and a Julia-native representation of cclib data objects, with each cclib attribute accessible as a native Julia type. The bindings should be available on the default Julia package registry. The remainder of the project is more open-ended, but an example application of using the bindings would be ideal.
 
'''Prerequisites''': Experience with Python and/or Julia, and ideally some familiarity with important quantities from computational chemistry outputs.
 
'''Mentors''': Eric Berquist (eric.john.berquist at gmail dot com)
 
==Project: Additional visualization for OpenChemVault==
 
'''Brief explanation''': OpenChemVault (https://github.com/cclib/openchemvault) is capable of parsing output files, storing them, and displaying geometries, but any sort of additional visualization (such as plotting molecular orbitals or spectra) is missing. The capabilities of GaussSum (http://gausssum.sourceforge.net/) are a possible starting point.
 
'''Expected results''': Implement one or more new visualizations for the OpenChemVault web interface.
 
'''Prerequisites''': Experience with Python common visualizations that are desirable for computational chemistry outputs. No previous experience with JavaScript is necessary.
 
'''Mentors''': Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com) and/or Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com)


== QC-Devs Project Ideas ==
== QC-Devs Project Ideas ==

Revision as of 19:54, 14 March 2021

Guidelines

Open Chemistry is an umbrella for projects in chemistry, materials science, biochemistry, and related areas. While we have participated in the last few Google Summer of Code programs and will apply again in 2021, there is no guarantee that we will be selected again for GSoC in 2021.

One important factor is that GSoC in 2021 will focus on shorter projects. You should consider the shorter timeline in your proposal.

We have gathered a pool of interested mentors together who are seasoned developers in each of these projects. We welcome original ideas in addition to what's listed here - please suggest something interesting for open source chemistry!

Adding Ideas

When adding a new idea to this page, please try to include the following information:

  • A brief explanation of the idea.
  • Expected results/feature additions.
  • Any prerequisites for working on the project.
  • Links to any further information, discussions, bug reports etc.
  • Any special mailing lists if not the standard mailing list for the project
  • Your name and email address for contact (if willing to mentor, or nominated mentor).

Proposal Guidelines

Students need to write and submit a proposal, we have added the applying to GSoC page to help guide our students on what we would like to see in those proposals.

Avogadro 2 Project Ideas

Avogadro 2 is a chemical editor and visualization application, it is also a set of reusable software libraries written in C++ using principles of modularity for maximum reuse. We offer permissively licensed, open source, cross platform software components in the Avogadro 2 libraries, along with an end-user application with full source code, and binaries.

Project: Python-based Compute and Data Server

Brief explanation: Avogadro would be more powerful with a local compute and data server

Expected results: A number of projects have build servers for larger projects that can also do compute, Jupyter, etc. Python has a number of lightweight data server frameworks such as FastAPI where RESTful APIs can be developed rapidly. Using this as a basis along with PostgreSQL, EdgeDB, or other database technologies the project would build a lightweight data layer for storing, searching, and visualizing data. Ideally this would be packaged in a container, and deployable to the cloud or run locally via pip or conda. A stretch goal would be to implement simple queuing and execution of jobs within the server API reusing Python projects to handle queuing, execution, etc.

Prerequisites: Experience in Python, some experience with C++/Qt and RESTful APIs.

Mentor: Marcus D. Hanwell (mhanwell at bnl.gov)

Project: Biological Data Visualization

Error creating thumbnail: File missing
Protein Visualization

Brief explanation: Efficient biomolecular visualization, including surfaces, cartoons, etc. would be ideal

Expected results: Support for residues, and reading secondary structure (e.g., PDB format) is now present. Additional rendering modes for secondary biological structures (i.e. ribbons, cartoons, etc.), building up a biomolecule from residues, and adding residue labels are desired. Code and algorithms may be adapted from 3DMol.js. Since biological molecules are often large (10^3 to 10^6 atoms and bonds), such implementations should be highly efficient and optimized, adopting symmetry and other techniques to improve interactivity and rendering performance. General extension of Avogadro for editing/interacting with biological data structures, and/or structures with named fragments would be ideal. Extending that to porting builders for fragment based building blocks would be a big plus once basic support for rendering is in place.

Prerequisites: Experience in C++, some experience with OpenGL and biochemistry ideally, but not necessary.

Mentor: Marcus D. Hanwell (mhanwell at bnl.gov) or Geoffrey Hutchison (geoffh at pitt.edu)

Project: Scripting Bindings

Brief explanation: Implement an embedded scripting language (i.e., Python) in Avogadro 2

Expected results: Enable an embedded scripting console as well as support for implementing modular extensions (tools, rendering, etc.) in Python. Python bindings exist, using PyBind11 with the new codebase, and the Avogadro 2 core libraries are pip installable. Extending the coverage of the API from the rudimentary parts of core/io would be a good starting point. An ideal solution would connect to PySide2, to allow scripting to add UI like menu items, windows, etc. and provide documentation and example scripts. The interface should be maintainable as new classes and methods are added.

Example scripts, documentation, are highly encouraged.

Prerequisites: Experience in C++ and Python, some experience with PyBind11, Qt for Python, PySide suggested.

Mentor: Geoff Hutchison (geoffh at pitt dot edu) or Marcus D. Hanwell (mhanwell at bnl dot gov)

Project: Integrate with RDKit

Brief explanation: Integrate the RDKit toolkit into Avogadro for conformer sampling and force field optimization

Expected results: RDKit is a BSD-licensed cheminformatics toolkit with a wide range of features useful for Avogadro 2. Most notably, RDKit offers efficient and accurate 3D coordinate generation, conformer sampling, and force field optimization. Implement a connection between Avogadro objects (molecules and atoms) and RDKit objects and implement conformer sampling and force field optimization code.

'Prerequisites: Experience in C++, some experience with Python will be helpful.

Mentor: Geoff Hutchison (geoffh at pitt dot edu)

Project: Tools for Interactive Molecular Dynamics

Brief explanation: Building solvent boxes, implementing standard molecular dynamics using in-progress optimization framework.

Expected results: Avogadro (v1) has interactive force field optimization allowing building and manipulation (e.g., push-pull atoms into position). Some users call this 'video game mode' ;-) A new optimization framework is in progress, including calling external programs for energies and forces. The project would enable building out MD simulations, including tools to add water or solvent boxes, build larger systems (e.g., via PackMol integration) and implement simple MD integration and thermostats.

'Prerequisites: Experience in C++, ideally with knowledge of molecular dynamics methods and tools. Some Python would be helpful

Mentor: Geoff Hutchison (geoffh at pitt dot edu)

Open Babel Project Ideas

Open Babel is an open toolbox for chemistry, designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling, chemistry, solid-state materials, biochemistry, or related areas.

Project: Integrate CoordGen library

Expected results: Schrodinger has released a BSD-licensed library for 2D chemical structure layout (https://github.com/schrodinger/coordgenlibs) and it has been successfully integrated into RDKit. The student will be responsible for integrating CoordGen into Open Babel. Code will be written in C++.

Mentor: Geoff Hutchison (geoffh at pitt dot edu)

Project: Implement MMTF format

Brief explanation: Implementation of MMTF file format in OpenBabel.

Expected results:' Macromolecular Transmission Format (MMTF) is a new compact binary format to transmit and store biomolecular structural data quickly and accurately (http://mmtf.rcsb.org). Your task is to implement support for this format in the OpenBabel open-source cheminformatics toolkit (http://openbabel.org). Code will be written in C++.

Mentor: Geoff Hutchison (geoffh at pitt dot edu) or David Koes (dkoes at pitt dot edu)

Project: Test Framework Overhaul

Brief explanation: Automated testing is an important part of maintaining code quality. This project will improve the current testing regime of openbabel.

Expected results: A comprehensive test framework that automates the generation of unit tests for all supported languages and simplifies the creation of new test cases will be implemented. The student will be responsible for choosing the most appropriate framework, porting existing test cases, and expanding the test suite to enhance code coverage.

Prerequisites: Experience in C++. Knowledge of modern software engineering practices or test frameworks is ideal.

Mentor: Geoff Hutchison (geoffh at pitt dot edu), David Koes (dkoes at pitt dot edu), the OpenBabel development community.

Project: Develop a JavaScript version of Open Babel

Brief explanation: Building on existing work, you will use Emscripten to compile the C++ codebase of Open Babel to JavaScript. This will make it easy to write in-browser applications that need cheminformatics functionality.

Expected results: Following from work described in a recent paper (https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00434), a JavaScript version of the Open Babel toolkit will be created. The generation of any necessary wrappers should be automated to allow it to track changes in the Open Babel API.

Ideally, the project will adapt a core JavaScript library openbabel.js that allows modules, such as file formats to be imported separately (e.g., smilesformat.js, pdbformat.js, xyzformat.js, etc.)

Prerequisities: Some experience in C++, and also with JavaScript.

Mentor: Noel O'Boyle (baoilleach at gmail dot com)

Project: Develop a validation and standardization filter

Brief explanation: Given a particular molecular structure, can we say how chemically plausible is it, and use this as to filter or warn about problems (e.g., undefined stereo centers)?

Expected results: Given a set of reference structures (e.g. ChEMBL), it should be possible to build a model that can say how normal/unusual a query structure is. For example, given a set of drug-like molecules, a molecule with a ruthenium atom might be considered unusual; or given any set of molecules, a 5-coordinate carbon is unusual.

Such a model could be used as a filter, or as a warning to flag up problematic structures.

Code could be modeled on MolVS using RDKit [[1]]

Prerequisites: Experience in C++ or Python, and an interest in data science or statistics.

Mentor: Noel O'Boyle (baoilleach at gmail dot com) or Geoff Hutchison (geoffh at pitt dot edu)

cclib Project Ideas

cclib is an open source library, written in Python, for parsing and interpreting the results of computational chemistry packages. The goals of cclib are centered around the reuse of data obtained from these programs when stored in program-specific output files.

Project: Implement new parsers

Brief explanation: There are outstanding issues on GitHub for supporting more programs (e.g. CFOUR, xtb, NBO, GAMESS dat, MRCC, DIRAC), and parsing binary files for various QM programs (e.g. Gaussian, NWChem, and ORCA). There may also be more programs missing that haven't been considered.

Expected results: Implement parsers for one or more new programs/formats, generate test data, and write unit and regression tests for each parser.

Prerequisites: Experience with Python, basic familiarity with computational chemistry programs, and access to the program(s) needed to generate the test data.

Mentors: Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com) and/or Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com)

Project: Implement new bridges

Brief explanation: There are outstanding issues on GitHub for more integrations with external programs (e.g. chemfiles, RDKit) via their Python bindings. There may also be more programs missing that haven't been considered.

Expected results: Implement bridges for one or more new programs, along with writing unit tests and documentation for each bridge.

Prerequisites: Experience with Python and ideally familiarity with the program that is being bridged.

Mentors: Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com) and/or Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com)

Project: Implement new methods

Brief explanation: There are outstanding issues on GitHub for more analysis methods being added directly to cclib (e.g. calculating geometric parameters). There may also be other methods that are desirable to include which haven't been considered.

Expected results: Implement one or more new methods, along with writing unit tests and documentation for each method.

Prerequisites: Experience with Python and familiarity with the method(s) being added, depending on the complexity of the method.

Mentors: Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com) and/or Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com)

Project: Julia bindings

Brief explanation: The Julia programming language (https://julialang.org/) is growing in popularity for computational chemistry as a language that both production-level computation and analysis can be performed in seamlessly. In order to analyze computational chemistry outputs from traditional programs in Julia, rather than reimplement all cclib functionality in Julia, we should be able to call cclib from Julia directly and reuse its core functionality.

Expected results: Julia bindings to cclib IO functionality and a Julia-native representation of cclib data objects, with each cclib attribute accessible as a native Julia type. The bindings should be available on the default Julia package registry. The remainder of the project is more open-ended, but an example application of using the bindings would be ideal.

Prerequisites: Experience with Python and/or Julia, and ideally some familiarity with important quantities from computational chemistry outputs.

Mentors: Eric Berquist (eric.john.berquist at gmail dot com)

Project: Additional visualization for OpenChemVault

Brief explanation: OpenChemVault (https://github.com/cclib/openchemvault) is capable of parsing output files, storing them, and displaying geometries, but any sort of additional visualization (such as plotting molecular orbitals or spectra) is missing. The capabilities of GaussSum (http://gausssum.sourceforge.net/) are a possible starting point.

Expected results: Implement one or more new visualizations for the OpenChemVault web interface.

Prerequisites: Experience with Python common visualizations that are desirable for computational chemistry outputs. No previous experience with JavaScript is necessary.

Mentors: Adam Tenderholt (atenderholt at gmail dot com) and/or Karol Langner (karol.langner at gmail dot com) and/or Eric Berquist (eric.john.berquist at gmail dot com) and/or Shiv Upadhyay (shivnupadhyay at gmail dot com)

QC-Devs Project Ideas

QC-Devs (https://qcdevs.org/) develops various free, open-source, and cross-platform libraries for scientific computing, especially theoretical and computational chemistry. Our goal is to make programming accessible to chemists and promote precepts of sustainable software development. The two main pieces of the QC-Devs ecosystem are: HORTON (electronic structure theory): https://quantumelephant.org/ ChemTools (molecular structure and reactivity): https://chemtools.org/ All our repositories are hosted on Theochem organization (https://github.com/theochem) on GitHub.

Project: Visualization of Molecular Structure and Reactivity

Brief Explanation: ChemTools (https://github.com/theochem/chemtools) is a post-processing library for extracting chemical insight from quantum chemistry calculations. Currently, ChemTools relies on Visual Molecular Dynamics (VMD) and Matplotlib for visualization. ChemTools has the functionality to generate visualization scripts for VMD, so the user can easily generate informative plots like iso-surface of electron density colored by electrostatic potential.

Expected Results: Add functionality to ChemTools to generate visualization scripts for PyMol, IQMol, and Avogadro. The current functionality for VMD can be used as a template. Difficulty Level: Intermediate

Relevant Skills: Experience with Python and visualization

Mentor: Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca), Gabriela Sánchez Díaz (sanchezg at mcmaster dot ca), and Esteban Vohringer-Martinez (estebanvohringer at qcmmlab dot com)

Project: Extended Interoperability of ChemTools and Quantum Chemistry Software

Brief Explanation: ChemTools (https://github.com/theochem/chemtools) is a post-processing library for extracting chemical insight from quantum chemistry calculations. Currently, ChemTools relies on modules of the HORTON library to compute the basic quantities required for its analysis. The goal of this project is to extend the interoperability of ChemTools, so that it can use the Psi4 & PySCF packages and take advantage of their features.

Expected Results: Writing wrappers for Psi4 and PySCF to compute various quantum mechanical properties and provide those properties to ChemTools for further analysis. The current wrappers for HORTON can be used as a template. Both Psi4 & PySCF have Python interfaces.

Difficulty Level: Intermediate

Relevant Skills: Experience with Python, Numpy

Mentor: Ali Tehrani (19at27 at queensu dot ca), Gabriela Sánchez Díaz (sanchezg at mcmaster dot ca), and Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca)

Project: Visualize Chemical Reactions

Brief Explanation: GOpt (https://github.com/theochem/gopt) is a Python library for optimizing molecular structures and determining chemical reaction pathways. Currently, GOpt can output a series of chemically relevant numerical structures (e.g., structures along the intrinsic reaction coordinate; optimization trajectories), but there is no interface to visualize these structures or perform structural or chemical analysis of them. The goal of this project is to generate visualization scripts for Avogadro, PyMol and/or IQMol, all of which can provide animations of reaction pathways and optimization trajectories. A stretch goal is to provide a workflow linking GOpt to ChemTools (https://github.com/theochem/chemtools), so that structural and reactivity indicators can be computed and visualized along reaction pathways.

Expected Results: Add functionality to GOpt to generate visualization scripts for Avogadro, PyMol and/or IQMol. (Stretch goal: Interface Gopt and ChemTools to facilitate chemical reaction path analysis.)

Difficulty Level: Easy

Relevant Skills: Experience with Python

Mentor: Derrick Yang (yxt1991 at gmail dot com) and Paul Ayers (ayers at mcmaster dot ca)

Project: Extended interoperability of GOpt and Quantum Chemistry Software

Brief Explanation: GOpt (https://github.com/theochem/gopt) is a Python library for optimizing molecular structures and determining chemical reaction pathways. Currently, it obtains the required information (e.g. atomic forces and Hessian matrix) for optimization from the Gaussian quantum chemistry package. The goal of this project is to make it possible for GOpt to use Psi4, PySCF, ORCA, and NWChem at every step of the optimization.

Expected Results: Expanding the scope of the GOpt library by increasing the number of quantum chemistry packages it can use for studying chemical reactions. You are expected to use IOData (https://github.com/theochem/iodata) which is a Python library for parsing, storing, and writing various quantum chemistry file formats and generating input files for quantum chemistry packages. This involves: GOpt using IOData to write an appropriate input file for the above-mentioned quantum chemistry package. GOpt using IOData to parse the (formatted) output files from these quantum chemistry packages to extract the necessary information (energy, gradient, Hessian, etc.) required.

Difficulty Level: Intermediate

Relevant Skills: Experience with Python.

Mentor: Derrick Yang (yxt1991 at gmail dot com), Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca), and Paul Ayers (ayers at mcmaster dot ca)

Project: Implement Workflows for Calculation and Usage of Databases of Isolated Atom Densities

Brief Explanation: A database of atomic electron densities is often used to analyze electron densities of gas-phase molecules or condensed phases. In practice, there are many ways to calculate the electron density, using different theoretical models and computational tools. As a consequence, such a database is not a one-time effort, but rather a procedure that is regularly repeated with different computational settings and theoretical models. Setting up and processing such calculations by hand (for different elements, ions, spin states, ...) is extremely tedious and error-prone. The implementation of an easy-to-use workflow would heavily reduce the burden of researchers who make use of such databases. This project also aims to facilitate the exchange and archival of atomic density databases.

Expected Results: Extension of Denspart (https://github.com/theochem/denspart) with a database that can store (spherical) atomic electron densities together with atomic metadata. This program currently uses a hard-coded database. Development and implementation of a JSON specification for archival and exchange of atomic density databases. Implementation of a workflow for setting up new databases. This involves (i) the generation of input files for existing quantum chemistry codes together with a suitable job script to execute the calculations on an HPC and (ii) processing the outputs of these calculations. This workflow will be implemented using other packages in the HORTON project, such as IOData, Grid, and GBasis. (See https://github.com/theochem)

Difficulty Level: Intermediate

Relevant Skills: Experience with Python, NumPy

Mentor: Toon Verstraelen (Toon.Verstraelen at ugent dot be) and Farnaz Heidar-Zadeh (farnaz.heidarzadeh at queensu dot ca)

Project: Orthogonal Procrustes for Rectangular Matrices

Brief Explanation: Procrustes (https://github.com/theochem/procrustes) is a library for finding the optimal transformation that makes two matrices as close as possible to each other. Procrustes analysis has numerous applications in object recognition, though our primary interest pertains to its utility for quantifying chemical and physical (dis)similarity of molecular structures. Currently, when two input matrices have different numbers of columns, the smaller matrix is augmented by columns of zeros (zero-padding). An alternative to this artificial approach was recently proposed for the special case of orthogonal transformations [SIAM Journal of Matrix vol. 41, pp. 957-983 (2020)]. The goal of this process is to implement the SCFRTR method (algorithm 5.1) from this reference into the Procrustes library. Expected Results: Extension of Procrustes to include the SCFRTR algorithm as an alternative to zero-padding for unbalanced orthogonal Procrustes problems.

Difficulty Level: Advanced

Relevant Skills: Experience with Python, NumPy, and numerical analysis

Mentor: Ali Tehrani (19at27 at queensu dot ca), David Kim (david.kim.91 at gmail dot com), Paul Ayers (ayers at mcmaster dot ca)

Project: Faster Molecular Integrals with Density-Fitting

Brief Explanation: GBasis (https://github.com/theochem/gbasis) is a library for evaluating and analytically integrating Gaussian-type orbitals and their related quantities, especially molecular integrals. In many applications, the computational bottleneck is the evaluation of two-electron integrals, as the number of two-electron integrals grows as the fourth power of the basis-set size. By introducing an auxiliary, density-fitting, basis, this power is reduced to the third power of the basis-set size, which in many cases eliminates the computational bottleneck, since there are often other facets of the computation that scale more severely than this. The goal of this project is to implement density-fitting methods into GBasis.

Expected Results: Extension of GBasis to support density fitting. This involves expanding products of basis functions in the auxiliary basis, evaluating 2-electron integrals in the auxiliary basis, and using these two entities to construct molecular integrals more efficiently.

Difficulty Level: Intermediate to Advanced

Relevant Skills: Experience with Python, NumPy

Mentor: Ali Tehrani (19at27 at queensu dot ca), David Kim (david.kim.91 at gmail dot com), Paul Ayers (ayers at mcmaster dot ca)

3Dmol.js Project Ideas

3Dmol.js is a modern, object-oriented JavaScript library for visualizing molecular data that is forked from GLmol. A particular emphasis is placed on performance.

Project: Improve 3Dmol.js

Brief explanation: Make significant improvements to 3Dmol.js functionality or performance.

Expected results: This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request.

Prerequisites: Experience with JavaScript and client-server programming, some experience with OpenGL/WebGL ideal, but not necessary.

Mentor: David Koes l (dkoes@pitt.edu)

gnina Project Ideas

gnina is a C/C++ framework for applying deep learning to molecular docking.

Project: Improve gnina

Brief explanation: Make significant improvements to gnina functionality or performance.

Expected results: This is an open-ended project that must be driven by the applicant. A strong proposal will identify significant shortcomings in the current code and explain how it will be addressed. The GitHub Issues page may provide some ideas. A proposal must include a significant initial pull request.

Prerequisites: Experience with CUDA/C/C++ programming and the basics of deep learning.

Mentor: David Koes l (dkoes@pitt.edu)

NWChem Project Ideas

NWChem is widely used open-source computational chemistry software ([2]) that tackles a wide variety of scientific problems.

Project NWChem-JSON

Brief explanation: Expansion of JSON capabilities in NWChem to plane wave DFT dynamics and molecular dynamics.

Expected results: Expanding JSON output generator in the NWChem source to include plane wave DFT dynamics and classical molecular dynamics capabilities. In addition the "Python NWChem output to JSON converter" needs to be expanded to include these capabilities. The latter strongly overlaps with cclib's project ideas for building a complete set of Python parsers. Questions as to handle large data structures in conjunction with JSON need to be addressed.

Prerequisites: Experience with Fortran90 and Python

Mentor: Bert de Jong (wadejong at lbl dot gov)

Project NWChem-Python-Jupyter Interface

Brief explanation: Exposing and binding NWChem data structures and computational APIs to Python and utilize those in Jupyter notebooks

Expected results: NWChem currently has a very limited interface with Python. But, more and more developers are using platforms such as Python to sandbox new theories, methods and algorithms. In addition, the extended Python interface could be integrated into a Jupyter notebook. A full Python interface needs to be developed for the NWChem software suite.

Prerequisites: Experience with Fortran and Python

Mentor: Bert de Jong (wadejong at lbl dot gov)

JSON-LD for Chemical Data

Brief explanation: Transforming NWChem and Chemical JSON formats to JSON-LD

Expected results: Refactoring NWChem JSON and Chemical JSON formats to utilize JSON-LD. Currently the JSON documents that are created are single dense objects, even though they could be handled as linked objects. This transformation will enable the generated computational data and objects to be more naturally aligned with triple-stores and knowledge graphs when connecting with experimental data.

Prerequisites: Experience with Python and JSON-LD

Mentor: Bert de Jong (wadejong at lbl dot gov)

DeepChem Project Ideas

DeepChem aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology. Additional project ideas are discussed at https://forum.deepchem.io/t/google-summer-of-code-ideas/356.

Project: PyTorch Lightning Implementation

Brief explanation: Allow for implementation of DeepChem models in PyTorch Lightning.

Expected results: PyTorch lightning is a popular framework for PyTorch. This project would look into enabling the easy construction of PyTorch lightning based models for DeepChem. Completion of this project should require the implementation of a good test suite and a jupyter notebook tutorial for implementing PyTorch Lightning models in DeepChem.

Prerequisites: PyTorch Lightning, Python

Mentor: Bharath Ramsundar (bharath at deepforestsci dot com)

Project: Semiconductor Modeling Support

Brief explanation: Add support for semiconductor modeling deep learning tools.

Expected results: This project would involve implementing semiconductor models from https://arxiv.org/ftp/arxiv/papers/2101/2101.04383.pdf. These models should be added to DeepChem along with suitable tests, and a suitable jupyter notebook usage tutorial.

Prerequisites: PyTorch/TensorFlow, Python

Mentor: Bharath Ramsundar (bharath at deepforestsci dot com)

Project: Protein Language Models

Brief explanation: Add support for protein language models.

Expected results: This project would implement a language model for protein sequence modeling, using a transformer or suitable language model on a dataset like UniProt. Models should be added to DeepChem along with suitable tests and a good jupyter notebook usage tutorial.

Prerequisites: PyTorch/TensorFlow, Python

Mentor: Bharath Ramsundar (bharath at deepforestsci dot com)

Miscellaneous Project Ideas

These ideas would likely benefit two or more projects.


Project: OneMol: Google Docs & YouTube for Molecules

OneMolsm.png

Brief explanation: There is a huge need in the research community for improved collaboration tools on web and desktop. OneMol will provide an open API for collaborating on molecular data that both Avogadro and 3Dmol.js will support as reference implementations. OneMol compliant applications will be able to manipulate and view molecular data in real time so that changes made by one client will be propagated to other clients.

File-sharing is a means for sharing data, but it does not share real-time interactions; each user’s data exists in its own isolated environment. Screen-sharing provides a common viewpoint for all participants, but allowing others to interact with the data requires granting access to the host workstation. This approach is needlessly inefficient for the task of collaborating on molecular data, and this inefficiency introduces scalability issues. For example, a simple rotation necessitates a full screen update when the fundamental change in state was a simple change in viewing angles.

The OneMol framework consists of three main components: a client module, embedded in a molecular viewer; a facilitator module that enforces a consistent viewer state between all the clients; and a storage module that stores the raw molecular data. All three modules may coexist on the same machine within the same application. However, we anticipate a more common modality will be to use a publicly hosted facilitator server, since this simplifies network connectivity in the face of firewalls and network address translation.

Expected results: Prototype web services to allow web and/or desktop collaboration using 3DMol as a viewer, likely integrating with existing storage systems (e.g., MongoChem or PQR).

Prerequisites: Experience with scripting, and web services. Interest and experience with databases like MongoDB or DSpace very helpful.

Mentor: David Koes (dkoes@pitt.edu) or Geoffrey Hutchison (geoffh at pitt.edu)