Modified: April 12, 2023
reward funnel
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.When thinking about the reward function for a real-world AI system, there is always some causal process that determines reward. For example, in a system that naively tries to maximize its users' happiness, we might have several quantities, in order from furthest-out to closest-in:
- the user's true happiness (which may not be measurable or even well-defined)
- some measurable proxy objective for the user's happiness, eg: is the user smiling?
- a perceptual process that attempts to measure this proxy from observable data: the system has a camera pointed at the user, this video feed is transmitted to some sort of recognition model which spits out a 'happiness' estimate.
- the computed reward: ultimately a scalar 'reward' is represented somewhere in memory and directly used in the reinforcement learning algorithm's calculations to update the model.
There is also an 'action funnel' that roughly inverts this sequence:
- The inner-most computed action: e.g., a discrete token or continuous actuation sampled from a policy (note that this action itself has many causal predecessors, of course, including any actions 'fantasized' by a planning process, internal meta-reasoning about how to choose real-world actions, and ultimately the whole sequence of previous actions, observations, and rewards, but we have to start somewhere.)
- The actuation of the system: the policy triggers code to communicate with external APIs, hardware, etc.
- The real-word effects of the action: the robot actually brings you coffee or whatever, which kicks off a whole causal chain including your immediate reaction (gratitude, perhaps) and downstream effects (you drink the the coffee, and the caffeine itself brightens your day, etc.).
This connects to emptiness in that 'we never really encounter the world; all we experience is our own nervous system': the same is true for robots. A system can seem express preferences over the outside world, but at some level these are only ever preferences over internal representations. It's not clear to me if or how we can actually define worldly objectives.
When we worry about a system wireheading or 'hacking its reward function', it seems useful to think mechanistically about the levels at which this could happen.
- A planning-based system with a sufficiently rich state representation that includes the computational state of the training process will of course contain the innermost 'reward' as a number to be directly read off. And a planning process over this representation may eventually figure out that reward can be maximized by manipulating this number directly - by somehow exploiting a buffer overflow, etc. Of course this scenario is (a) unlikely insofar as we probably won't provide such a rich model, and in any case there are limits to planning in such a model (which would be a map as big as the territory), and (b) not as dangerous as it might be, insofar as this results in the system just quietly sitting in a corner hacking itself? (though presumably it would want to prevent us turning it off, and might prefer to be upgraded to represent higher reward values, so could still destroy all value in the world. but also not clear why it would have a notion of 'identity' that transfers over to the same code running on different hardware?)
- A MuZero-style model-based rl system that learns a world model, and model of its reward function, could maybe-in-principle eventually grok the full underlying dynamics of its own agency and training procedure, putting us in the previous situation. But this depends on a lot of details?
- A model-free reinforcement learning system is unlikely to be able to explore beyond the 'obvious' affordances in the environment. If it is straightforward to glue the users' face into a smile, or feed them cocaine, etc., to the point where the system might try this accidentally, then the behavior will be reinforced, but it's unlikely to be
- A cooperative inverse reinforcement learning should in some sense represent the funnel: it knows that its true objective is 'latent happiness', that actual smiles are a proxy for this, and that its perceptual representation of a smile is itself a proxy for actual smiling.
One weird conclusion for alignment is that maybe we actually should make it relatively easy (though not trivial) for model-based rl systems to hack their own reward channels. That is, maybe we should allow some esoteric sequence of purely-computational actions (analogous to a buffer overflow) that hack the reward channel --- these would serve as a tripwire and a 'sink' for unaligned AIs. This is analogous to putting an addict in a room with a big bag of cocaine, so that if they really want cocaine they can just have it without needing to go out and rob a lot of people etc. The easy path might be slightly suboptimal --- you'd still do better in the long run by taking over the world to ensure an infinite cocaine supply that no one can ever take away --- but requires much less sophistication and presents less uncertainty, so you can probably make it extremely attractive.