fast weights: Nonlinear Function
Created: May 23, 2021
Modified: October 27, 2022

fast weights

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

On an evolutionary timescale, it's useful to evolve structures that can learn quickly. The nervous system is an evolved organ system for fast behavioral adaptation, much faster than evolution could work.

But the nervous system itself has 'weights' and 'activations'. It is flexible and can represent many different circuits, with some principle (that must be local) for adapting the weights over time so that the circuit changes towards one that achieves better outcomes. But even a fixed circuit will itself have fluctuating activations, and these fluctuations will reflect processing of very recent stimuli and short-term decision making. Perhaps all of conscious thought lives in this regime.

There may be a sharp distinction between learning withing a single lifetime vs evolution over many generations. But it's not obvious that there's a sharp distinction between updating the weights (or the connectivity pattern), vs updating the activations? I mean, I guess those are separate physical things; the connectivity patterns are literally visible with a microscope. But conceptually---changes in the activations (short-term memory) must drive changes in the weights (long-term memory)?

I think there's a lot of room for improvement in today's ML models on this axis. transformers should be able to permanently learn from interactions. Really what we need is a better notion of memory for neural architectures.

Thesis: the current totally-separated model of a gradient-ascent algorithm on the outside operating on the weights of a circuit executing on the inside may be a useful model, but whatever the brain actually does has got to be messier and more smudged-together than that.

There's no 'God' to implement the gradient ascent, so somehow the activations themselves are burning in new connectivity patterns.

It's possible that an explicit two-level nested-loop setup will end up just working as a model for modern AI. But I don't think it can, because gradient ascent isn't efficient enough. Taking a gradient step is not as good as doing a Bayesian update; it's often not good enough to even remember the training example at all. Our responses to experience need to be more like Bayesian updates than like gradient steps. If something important happens, I need it to burn itself on my memory.In some sense I really want to burn in the new evidence and all of its logical consequences. Maybe you get something closer to this with training for consistency, where you sometimes do training steps that don't use any data at all but just try to ensure that the different parts of the model remain consistent.

I guess as long as the gradient descent level is itself just a model of what the brain's doing, rather than a literal algorithm being carried out, then there's no need to limit ourselves to discrete steps. We might as well think about it as gradient flow, and then the notion of a single step isn't really relevant, it's just how fast the flow moves in a period of time. Importantly, there should somehow be affected by the activations! They should tell us how 'fast' to let our weights flow at any given point in time.

  • We could model this as letting a model choose to take multiple consecutive gradient steps on a datapoint, or somewhat equivalently, to choose to use a larger or smaller learning rate for a given point.
  • How would we train the model to do this? It's a meta learning problem. The general setting is that we see the gradient w\nabla_w and the current weights ww, and we have to define a parameterized function wfθ(w,w)w' \leftarrow f_\theta(\nabla_w, w) to update the weights. Now θ\theta are the meta-learned parameters. We can hope to compute gradients to them, maybe. I guess in some sense this is the entire field of meta learning? Or at least a big part of it.
  • Here we've assumed that the optimizer parameters θ\theta are separate from and 'slower' than the weights ww that we're updating. The optimizer parameters control how the weights are updated at each optimization step, but they themselves are only updated in the outer loop.
  • It seems like this can't possibly be efficient. It means we need to run the entire circuit many many times to get any kind of useful training signal to the optimizer. But suppose we had intermediate losses at each layer. Then we could get a local signal to the optimizer many times within a given execution.
  • Maybe we could say, there is a coupled 'flow' on the optimizer parameters θ\theta and object-level weights ww. The discretization scheme where we follow the inner flow for nn steps before we take a single step in the outer flow, is not the only discretization scheme for these two flows. Could others be better?

In order to work in practice,