Created: April 06, 2023
Modified: April 07, 2023

value learning

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Notes on the Alignment Forum's Value Learning sequence curated by Rohin Shah.

ambitious value learning: the idea of learning 'the human value function', or some analogous function that can serve as a complete specification for an aligned AI's behavior.

Christiano: The Easy Goal Inference Problem is Still Hard:

Modeling human mistakes is key to any possible 'extra oomph' of inverse reinforcement learning over simple imitation learning - the goal is not just to mimic humans but to do what humans are trying to do, more effectively. I could quibble that even a fully-rational human would still face physical limitations that a machine might not (I might be walking directly towards my destination, but still prefer to have a car pick me up so I get there sooner). But in a general sense it's clearly true that "The error model isn’t an afterthought — it’s the main affair."
"I don’t think that writing down a model of human imperfections, which describes how humans depart from the rational pursuit of fixed goals, is likely to be any easier than writing down a complete model of human behavior."

More damning: the whole enterprise of value learning is probably ill-defined.

"If you were to find any principled answer to “what is the human brain optimizing?” the single most likely bet is probably something like “reproductive success.” But this isn’t the answer we are looking for."

Armstrong: Humans can be assigned any values whatsoever:

this seems like a pretty pedantic argument that you can push arbitrary information about the policy from the value function into the 'irrationality model', so goal inference is impossible. But it doesn't seem to engage with the clear intuition that human behavior can be compressed at least to some extent by modeling us as goal-directed agents. The arguments against this hinge on technicalities of Kolmogorov complexity, but there's no reason we'd need to choose that as the regularization technique (and lots of reasons not to), so this isn't a compelling impossibility result.
I think the argument also fails on its own terms: it assumes that memorizing a policy $\pi^H(s)$ requires a fixed-length number of additional bits, as does memorizing a reward function, so these are indistinguishable in an asymptotic sense. But a tabular policy in an infinite world is itself infinitely large! It may be finitely representable by exploiting structure, eg that it's generated by some finite reward function, but in relying on this we concede the point that the reward representation has lower complexity.
Still, the weak version of this argument is that it's not always clear where to draw the line between a value function and an irrationality model, and even if we speculate that a good compressor of human behavior will capture some aspects of value, it's far from clear how we'd identify these from a black-box representation. (seems like an interesting research idea, related to eliciting latent knowledge?)

Steinhardt: Model mis-specification and inverse reinforcement learning

Actually doing inverse reinforcement learning on human values in nontrivial settings is hard. You need a model of the 'human action space' (which can be specified at multiple levels --- any of which will generally be wrong because all models are wrong --- and none of which are directly observable, so there's also a perceptual task of inferring actions from eg video frames), the relevant state space of the world (with similar pitfalls), and a model of the information available to the human actor (since this can 'explain away' motivation for taking a particular action). In general you will get all of these wrong, so you're doing IRL in a mis-specified model, and you shouldn't expect to get out the 'true reward function' or necessarily anything that will generalize beyond the settings in which you collected the data.
Mis-specification is a different problem from identifiability, though both are serious issues.
Inferring rewards for long-term plans is difficult because we rarely observe panel data recording the same person(s) over a long period of time time; it's much more common to see cross-sectional data that records a different population at each moment (for example: polling a presidential race, we might see that Trump's support decreased by 2 points since last week, but we don't know if that's because 2% of Trump voters switched, or if 10% of voters switched one way and 8% the other way so it canceled out).
Getting value functions 'right' may have a negligible effect on predictive accuracy in training situations. Since a human raising a turkey feeds it every day until Thanksgiving, a model that predicts the human always wants to feed the turkey will be correct 364/365 of the time, while a model predicting slaughter would be more complicated and harder to fit, and so in practice it might easily be wrong more often than the naive model.
Human values don't necessarily correspond to human behavior. If a person is addicted to cocaine, do they 'value' cocaine? Many people would hope that a helpful robot in this situation would try to help them quit cocaine, as a loving family member would do.
We could try to formalize this as: the longer a human spends deliberating on a decision, the more likely the decision is to reflect their values. Values are defined not just as human choices, but as choices under reflection. Of course, actually modeling and implementing this is challenging, probably more so than 'base' inverse reinforcement learning.

Shah: Future directions in ambitious value learning:

Can look for mild assumptions that still enable joint learning of preferences and biases.
Can consider 'multimodal' value learning that looks at both behavior and speech (professed values).
Can tolerate mild misspecification by not fully optimizing the utility function (eg quantilizers which select an action randomly from the top K% of actions). reward uncertainty is also an option here, which Shah dismisses for reasons that don't make sense to me (he's worried about uncertainty collapsing to a point, but it's easy to consider models where this can't happen, eg with nonstationary preferences).

It's interesting to think about how current GPT-like models are aligned. In the pretraining stage, they learn from human behavior: this is framed as behavioral cloning, but with a model powerful enough that it can plausibly develop an internal mesa optimizer and plausibly model the behavior of various literary agents (both authors and characters!). This doesn't explicitly separate out a 'mistake model' --- we just predict what Dickens did write, not what he 'was trying to write' or 'would have written if he were a better writer' --- although it's possible that we could get at something like that with prompting! But it does get at some plausibly broad model of 'human value functions' for language, by training on language written by essentially every human that ever existed. Then the RLHF step fine-tunes on stated preferences: humans directly tell the model what kind of behavior they think is good or bad.

value learning

Links to this note

previously read

Meta