Modified: April 12, 2023
worldly objective
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.This may be a central point of confusion: how do we define AI systems that have preferences about the real world, so that their goals and objectives really map onto the world? Let me explain why this seems nontrivial.
In the standard model of rational agency, the reward is defined as a function of the state (or state-action pair) in a Markov decision process. It is assumed that the state (or at least, enough of it to define the reward) is directly observed by the agent.Or at least, by the search or learning procedure that generates the agent's policy: when we train a model to play Atari games, the policy may be a feedforward network that doesn't directly see the score it's trying to optimize, but the outer-loop RL algorithm has full access to the simulator state so it can do the necessary updates. I'll generally use 'agent' in the broad sense to include any outer-loop optimization procedures we might build in.
But an embedded agent can never observe or even hope to model the full world state (it lives inside the world, so this would be a map as big as the territory). It has some perceptual system that constructs observations; these constructions are all that it can ever encounter about the world. Like humans, its experiences and concepts are only ever empty mental representations, never reality itself. It can't usefully have preferences over world states, because it never experiences world states.
The classic POMDP approach to dealing with incomplete information is to reason in terms of belief states: we assume a prior over world states, and an 'observation model' that specifies how they generate observations, and so provides a likelihood of (and by Bayesian updating, a posterior over) world states given our observation sequence. This posterior, the 'belief state', is a sufficient statistic for acting in the POMDP. But this approach is unrealistic for embedded agents, because:
- The 'observation model' isn't a naturally occurring thing. It needs to be provided exogenously, as part of the setup; it can't be learned strictly from the observation sequence (if we do attempt to learn from the observation sequence, the 'real world' becomes a latent variable, not the sort of solid thing we can have meaningful preferences over). In practice we often can specify a plausible observation model, because we ourselves designed the perceptual system, but it will always be somewhat wrong, because a) the world may affect the agent through side channels we didn't design, and b) the 'type signature' of any observation model will assume some ontology or model for the world states it takes as input, but this will necessarily be incomplete and limiting; in reality we can never completely represent the world (for reasons discussed above); we have and need many models.
- Even if we commit to an observation model, we now have the problem of representing belief states, which are probability distributions over world states. But even representing one world state is (by assumption) impossible! So a naive sample-based representation, in terms of 'sets of plausible world states', is impossible.
The best we can hope for is a high-level representation of a belief state through abstraction, the way that saying "the water in this cup has temperature of 80 degrees C" efficiently represents a belief about possible underlying configurations of molecules in phase space without explicitly representing any particular such configuration. In general, the abstractions we can talk about are those either directly built in to our observation stream, or latent variables that we construct from that stream and identify as being 'sufficiently stable' to be worth reifying (for example, the physical concept of temperature).
We can certainly build agents that try to understand their experience through compression of a perceptual stream, and that develop world models as latent variables (this is basically the predictive processing model of cognition). If the predictive task is the agent's sole objective, the predictive agent will mostly just want to make the world predictable, which is dangerous in the limit since humans are unpredictable. But more likely we would build a predictive subsystem with its own internal learning mechanisms, and then try to define the agents' "main goals" in terms of the learned world model(s). This requires the models to have ontologies similar enough to our own that our values map onto them, and be interpretable enough that we can do this mapping. But any goals we define are only as sharp and stable as the world model and the perceptual processes that generate it.
Of course, people have worldly objectives all the time! We want to make the world a better place, to end wars, to end poverty and cure disease, to manufacture appropriate numbers of paperclips for our organizational needs. And notably, these don't seem to be preferences over fixed world models. In economics, for example, we understand that the quantities such as GDP that we can define and measure are not themselves the real targets; that over-optimizing them is counterproductive (Goodhart's law), and that we need to constantly be working to better measure and model the true thing we do want to optimize. So there is still a reward funnel:
- the things we can measure and directly optimize (GDP)
- things we think we care about, for which our measures are a proxy (global prosperity, itself a proxy for the wellbeing of all sentient beings)
- aspects of 'true reality' which may not be in our hypothesis space (if it turns out we live in a simulation, then we may decide we have goals for other branches of the simulation or for the 'outside', so that global prosperity within our branch seems relatively less critical as a goal).
Perceptual objectives do map onto worldly objectives in some strange way. If we give a system the objective 'maximize the number of paperclips detected in your visual field', that objective does by extension prefer certain states of the world over others: it's just a weird grab bag including states that actually have more paperclips, states where the system has hacked itself to bypass the perceptual system and maximize the paperclip counter, etc. So it's not that we can't express preferences that favor certain world states over others. It's that the obvious ways to do this don't exactly let us choose the world states; they all end up favoring weirdly-structured sets of states that don't necessarily encode what we want.
Related work:
- this post seems to cover similar themes: https://www.lesswrong.com/posts/RorXWkriXwErvJtvn/agi-will-have-learnt-utility-functions