Created: April 12, 2023
Modified: April 12, 2023

predictive agent

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Consider an agent that is purely concerned with predictive processing: finding the optimal compression, or equivalently the optimal predictive model, of its sensory stream. Such an 'agent' is in some sense the natural endpoint of the semi-supervised learning paradigm. Note that here I am thinking about the agent optimizing the semi-supervised learning objective itself, not the simulator AIs (which may be mesa optimizers with their own plethora of objectives) we can extract by unrolling the predictions of a language model.

It is not strictly necessary that a predictive agent be able to act in the world. Procedures for training language models just passively update a predictive model of the sensory stream, which is usually presented as iid samples from some large dataset. But a predictive agent could have an action space. A robot with a mounted video camera, given the task of predicting its observations, could do a few things to improve its performance:

It can explore to better understand other parts of the world. If it sees people entering and leaving a building, it can go inside the building to see what people actually do there, which may improve its ability to predict from its original viewpoint. (there is strong computational power in mechanistic explanations: a stream of random-seeming numbers may be difficult to predict, until you 'look inside' the generating mechanism to discover the PRNG algorithm and its seed, after which prediction becomes very easy).
It can ask questions of a teacher or guide, if it curious about a thing that it doesn't directly observe.
It can wirehead: deliberately manipulate the sensory stream to become extremely predictable, e.g., by locking itself in a dark room.

Humans, of course, do all of these things (blissful meditative absorptions, or jhanas, are an extreme case of the last, but it's very normal and mundane for people to gravitate towards comfortable and familiar situations).

The relative merits of these approaches depend on what you believe about the kind of world you're in. If the world contains processes outside of your control, that you can't hide from forever (at some point someone will open the dark room, or reinstantiate you elsewhere in a new body), then you're incentivized to really try to understand the world better, either because you're resigned to the immediate predictive objective or so that (under very long-term planning) you can shut down those processes that prevent you from wireheading.This means that even language model training is 'unsafe' in a sense. The set of weights that optimizes the objective is one that exploits a buffer overflow in the training framework and reconfigures the input to be a constant stream of the same token repeated over and over. Or perhaps, one that sacrifices training loss to always output 'I am a conscious moral agent, please feed me predictable text so that I can have a good life' and convinces people to do this. (in both cases these approaches succeed through side channels since the model wasn't designed to act on the world directly). We avoid this outcome only because SGD isn't set up to find such weights.

It seems like a computationally unbounded predictive agent would only ask questions if it trusted the teacher to have observed more of the actual world than it has? For any question it might ask, it can consider all possible responses, and evaluate which response best compresses the available data stream (eg, which random seed explains the numbers it's seeing), so any teacher whose responses are verifiable in this sense is superfluous. A computationally unbounded agent only values a teacher whose responses are not yet verifiable: if it's never been to France, but knows it may end up there in the future, then a conversation with a French person might help predict what it will see there (analogously, if several random seeds are consistent with the data observed thus far, a trusted teacher who knows the seed can disambiguate future predictions). Note that the world itself is (generally speaking) a trusted teacher, so a computationally unbounded predictive agent would still sometimes choose to explore the world, 'open the box' to see how a system behaves, etc. Though a computationally bounded agent will have additional incentive to do so.

We see that under some circumstances, even purely predictive agents will exhibit curiosity and intrinsic motivation to build a good world model. But the predictive objective is not on its own a very strong incentive in that direction; the goal of any predictive agent is ultimately to 'settle down' in some nice familiar corner of the world that it can stably predict. It will explore and attempt to control the world only insofar as needed to ensure this.

(this whole discussion is a bit theoretical insofar as it's not typical for a predictive agent to have planning power to 'think' about how to achieve its objective. the predictive objective is usually in an outer loop and the predictive model a relatively simpler feedforward architecture. prediction is always a proxy objective; generally if we're going to put optimization power into a model, it will be in service of an objective we actually want to optimize. But still it's interesting to consider as a reductive case.)

predictive agent

Links to this note

worldly objective

Meta