Neural message passing for Quantum Chemistry: Nonlinear Function
Created: June 06, 2020
Modified: March 21, 2022

Neural message passing for Quantum Chemistry

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.
  • Gilmer at al. paper 2017
  • Experiments on QM9.
    • Unlike SMILES strings, includes molecular geometry.
  • General formulation of message passing neural networks. Vertices and edges have associated feature vectors.
    • The message at a vertex is a sum over neighbors of some nonlinear function M of the vertex and neighbor features as well as the edge features.
    • The vertex's feature vector is updated to equal some nonlinear function U of its current features and the message vector.
    • Edge features are fixed and are not updated.
  • A cute trick is to add a special 'master' node that's connected to all nodes and can act as a global scratch space. This allows faraway nodes to communicate without the time complexity of a fully connected graph.
  • 'Separate towers' trick. Using matrix multiplication in messages has cost O(d2)O(d^2) where dd is the dimension of the feature/message vector. Instead we can consider kk separate vectors of size d/kd/k, do updates for each of these, and at the end pass the kk vectors at each node through a 'mixing function' gg represented as a neural net.
    • Q: Isn't the mixing network still going to be expensive because it will include d x d matrices?
    • In practice, they only seem to gain a factor of two from this.
    • But: generalization is actually better. Maybe because of fewer params, or it approximates ensembling?
  • Node features: atom type, atomic number, accepts electrons, donates electrons, aromatic (??), orbital hybridization, number of hydrogens
  • Edge features: bond type, discrete distance bins, and a raw continuous distance feature.
  • Findings:
    • including distance features on edges is important. If no distance features (sparse graph), it helped a lot to use a master node or set2set output (learning a transformation of the features, where set2set uses attention to combine inputs in a permutation-invariant way).
    • explicitly modeling hydrogens is important (but slow)
    • one model per target was better than multi-task (why?)
    • Best message function was the edge network: the message is A(e)htA(e) h_t where A(e) is a neural network that takes the edge feature vector and returns a d x d matrix.
  • Challenges:
    • generalizing to large molecules. The best methods (using distance info) require a fully connected graph, which doesn't scale.