dual metareasoning: Nonlinear Function
Created: October 17, 2022
Modified: October 17, 2022

dual metareasoning

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

From a conversation about attention, multiplicative interaction, and meta-reasoning:

at some level, a lot of the AI problem relates to metareasoning - systems that can determine for themselves what to 'think' about ie which computations to apply to a given input (PhdAdvisor Russell used to tell the story of a meta-reasoning talk he gave at Stanford, with Donald Knuth in the audience, with a slide consisting just of the word 'Algorithms' with a big red X over it). I think MCTS is the most successful modern instantiation of learning how to 'allocate computation' in a direct sense, but you can invert the question and ask: instead of learning which computations to apply to a given input, say we start with a fixed set of computational gates and then learn which parts of the input to route through each one. that's basically transformers?

I want to explore this thesis. Does it make sense to think of self-attention as a kind of dual formulation of metareasoning?


Another way to frame this is that metareasoning is about choosing which thoughts to have. If multiple thoughts arise, then attention selects which thoughts are elaborated. And we know from the library of Babel thought experiment that selecting thoughts is in principle just as powerful as generating them.


Let's restrict to the one-shot setting where we're trying to efficiently compute an approximation to some function f(x)f(x), for example, the optimal next move in a chess game. This will involve a sequence of computational steps gtg_t applied to a sequence of computational states sts_t:

st+1=gt(st)s_{t+1} = g_t(s_t)

We assume that the individual steps gtg_t are constant-time, e.g., represented by a fixed circuit. The initial computational state is a function of the input, s0=encode(x)s_0 = \text{encode}(x), and the final output is a function of the final state, y(x)=decode(sT)y(x) = \text{decode}(s_T).

In the domain of metareasoning, we can ask for the optimal policy on computations: what policy gt=π(st)g_t = \pi(s_t) optimizes the quality (yy(x))\ell(y^* - y(x)) of the final result? Note that the computations gtg_t are (by assumption) input-dependent, while the policy is not.