attention: Nonlinear Function
Created: March 08, 2020
Modified: January 24, 2022

attention

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

One of the best ideas in machine learning. (I even thought so in 2011!)

  • There are two common mechanisms: 'soft' and 'hard'.
    • In both cases, a layer produces a set of 'value' and corresponding 'key' vectors (both, I assume, functions of some underlying activations) and a set of 'query' vectors matching the keys.
      • The structure of the network might depend on the inputs---e.g., in language modeling the values are associated with positions in the sentence, whereas in graph modeling they'd be associated with nodes of a graph.
    • In soft attention, the output for a query is a combination of the input values weighted by 'compatibility' (e.g., softmax of dot product between query and key).
    • In hard attention, the output is the input value with the highest compatibility to the query. This can't be trained by exact gradient descent; it requires gradient estimators (REINFORCE, REBAR, etc) which may be high-variance.
  • Attention is in some ways a neural relaxation of 'dynamic control flow'. Any given data at layer i might be picked up and processed by a some head at layer (i+1), or it might be thrown away, depending on the contents of layer i (which in turn determine the attention queries).
  • Attention achieves:
    • Concentrated statistical strength. A model of a small part of the input will need many fewer parameters than a model of the entire input, so if we can concentrate on a small part, we can learn from less data.
    • Allocation of computation. Layer i has a fixed amount of computation; attention allows it to learn what aspects of its input it should focus the computation on.
    • Dynamically reconfigurable architecture. Any given instantiation of (hard) attention weights defines an architecture.
      • Attention thus subsumes the idea of capsules and dynamic routing. (more or less? but what does this mean and what are the connections?)
      • Connections to continuous structure learning?
    • Dynamically sized inputs. We can handle sentences, action sequences, graphs, anything that can produce a set of keys.
    • Parallel generation I guess? I'm a little fuzzy on this but I guess you can process an entire sequence at once rather than sequentially like an RNN.
  • There are connections to von Neumann architecture---attention keys are like the names of registers, and the network determines which registers to 'read' from---except that the addressing is content-based rather than strictly indexed.
  • Self-attention (in the form of transformers) is state of the art for language modeling.
  • It seems as though self-attention might also subsume graph neural networks, since graph message passing is just the special case of attention where each node attends exactly to the state of its graph neighbors.
  • There are intuitive connections to consciousness: 'self-awareness' has a similar flavor to 'self-attention'. Meditation has an aspect of releasing control over your attention mechanism, allowing you to better perceive bottom-up stimuli. (or are you training yourself to have more control over your attention?? AI that can meditate)
    • Doug Hofstadter's idea of 'strange loops' or recursive self-reference seems like it could fall under here somewhere.
  • Research ideas:
    • can attention help in probabilistic inference? in the same way that it generalizes graph message passing, can we learn BP-like algorithms but better?
    • can attention help in probabilistic modeling? We can certainly use attention networks as encoders / amortized inference mechanisms. Attention networks can also define some sort of autoregressive model (as in transformers), though I don't think it's typically easily invertible---at least in principle it shouldn't be since attention loses information.

but what about dynamic control flow? there should be a rough mapping between PCFG generative models (including for kernels?) and transformer encoders or decoders.