Modified: March 02, 2022
reward is enough
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.Silver, Singh, Precup, and Sutton argue that Reward is enough: maximizing a reward signal implies, on its own, a very broad range of intelligent abilities:
Consider a signal that provides +1 reward to the agent each time a round-shaped pebble is collected. In order to maximise this reward signal effectively, an agent may need to classify pebbles, to manipulate pebbles, to navigate to pebble beaches, to store pebbles, to understand waves and tides and their effect on pebble distribution, to persuade people to help collect pebbles, to use tools and vehicles to collect greater quantities, to quarry and shape new pebbles, to discover and build new technologies for collecting pebbles, or to build a corporation that collects pebbles.
It may not matter what the reward signal is: even maximizing paperclips implies superintelligent behavior in general. Of course this leaves the door open for reward uncertainty: if the reward doesn't matter, we can do equally well by sampling reward functions at random from a diffuse posterior.
Optimizing reward is the only principle by which we can integrate multiple subskills (attacking, defending, exploring, exploiting, etc). or multiple sub-objectives.
On the other hand, reward is often an extremely weak and inefficient learning signal (it's LeCun's Cherry). It's natural to try to combine it with other self-supervised or unsupervised objectives to bring more bits of information to bear. What's the best way to think about this sort of rl with proxy objectives?
Possible counterpoint (papers to read): On the Expressivity of Markov Reward | OpenReview
Another counterpoint: Reward Is Not Enough. This makes the distinction between within-lifetime RL and outer-loop or 'intergenerational' RL. An outer loop that selects agents with high-reward trajectories is clearly sufficient to develop interesting behavior (as biological evolution proves), but these agents may not 'look' like RL agents in terms of their within-lifetime behavior. Indeed, in an environment where rewards are obtained only at the completion of a trajectory, within-lifetime RL is totally useless.
Does this distinction hold up in the real world? The notion of a 'lifetime' or a 'trajectory' is artificial; we can always concatenate trajectories into a single experience stream and build agents that continually update their policies. The argument is really that while high-reward policies may be good optima, with complex and interesting behavior, reward on its own doesn't in general produce a useful optimization landscape.