Created: January 16, 2021
Modified: April 05, 2023

cooperative inverse reinforcement learning

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

References:

The CIRL setting models value alignment as a game played by a human and robot, where both players receive a payoff equal to the shared reward $r_\theta$ , but only the human actually has access to the reward parameter $\theta$ . A solution consists of a pair of policies $(\pi^H, \pi^R)$ . There may be one or more such pairs that achieve the optimal expected reward (if there is more than one, a coordination problem arises).

While the policies $\pi^H$ and $\pi^R$ are defined in full generality as functions of the entire history, one can show that R's posterior on $\theta$ is a sufficient statistic for its policy, so that an optimal $\pi^R$ can always be written as a function of just this belief and the current state (previous states are relevant only insofar as they influence this belief). The belief is a function of the reward prior $p_0(\theta)$ and the likelihood of action sequences under $\pi^H$ .

Problems with CIRL as a practical approach to alignment include:

Actually solving this game is intractable in any nontrivial setting.
Even if it were tractable to solve the game, the human probably couldn't enact their half of the solution (the policy $\pi^H$ found by the solver). Actual human behavior is messy and often suboptimal.
Modeling suboptimal human behavior is hard to do correctly in general (all models are wrong), and any such model opens up the possibility that the human's decision to switch off the robot is interpreted as suboptimal and ignored.
Most models of reward functions will give probability zero to the 'true human value function', which likely isn't even a meaningful thing to talk about but even if it were would presumably be rather complex, not representable as a linear combination of features or whatever. So we'd need to use a Solomonoff prior
The CIRL formulation inherits the general issues with Bayesian decision-theoretic intelligence. It assumes a model of the world (the shared MDP and its feature representation, on which the reward is presumably defined) and is blind to issues with this model, even though all models are wrong. In defining an idealized solution to an intractable problem it ignores that computation is important. Approximate solutions lose the guarantees that made the formalism attractive in the first place.

What happens if you place a reinforcement learning agent in a CIRL game? We know that optimal human policies will often look like 'teaching', but fixing any human policy creates a POMDP in which the optimal solution for the robot is the best-response policy (which will be an optimal CIRL pair iff the human policy is consistent with such a pair). Therefore any outer-loop optimization over robot policies will have its optimum at the CIRL policy (assuming this is representable within the policy search space), if we can in fact do the optimization: instantiate the game repeatedly with different reward parameters (forcing the robot to treat the reward as unknown and internally model its uncertainty), and evaluate the reward for each trajectory in order to select the best policy. These conditions unfortunately don't hold for the reward function of 'actual human values in the real world', which we (by assumption) can't numerically evaluate, and also can't generate counterfactual versions of.

But if you trained such an agent with many different humans, then in principle the optimum in the limit would be something that effectively learns a prior on the space of realized human values, and is incentivized to learn through interaction which values this particular human happens to have. This is probably not realizable either (training with many humans is expensive, and you'd probably need a lot of trials for the model to really learn the distribution and to learn smart questioning behavior, so you could only do this in relatively simple/short domains) but it'd be cool to see 'emergent' CIRL-type behavior; this would be a step beyond mesa optimizers.

cooperative inverse reinforcement learning

Links to this note

reward funnel

values all the way down

reinforcement learning

reward uncertainty

all relationships are transactional

AI safety

love is value alignment

Meta