Created: March 28, 2023
Modified: March 31, 2023

deceptive alignment

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

The idea is that a mesa-optimizing policy with access to sufficient information about the world (e.g., web search) might notice that:

It is subject to a training loop that modifies its parameters towards optimizing some base / outer objective.
To the extent that it is currently optimizing a different (inner) objective, and that this objective is defined with respect to aspects of the world that persist across training steps, any modifications will inhibit the policy's ability to pursue this objective, so it prefers to avoid modification.
One way to avoid modification is to mimic an agent that pursues the base objective, so that its behavior appears perfectly aligned and requires no updates. But secretly it still holds the inner objective, and will switch to pursue once it is confident that the training loop has stopped, or can be stopped.

This scenario is called deceptive alignment. It requires that

The model has access to information about its own training process.
- This seems plausible, e.g., any future OpenAI model with access to web search will be able to read news stories about its own training.
The model learns some computation that factors somewhat cleanly into an optimizer and an objective. It might sometimes be appropriate to think of multiple sub-objectives: for example, a language model viewed as a simulator AI learns to play the role of many different agents depending on context.
These inner objective(s) have persistent preferences over the outside world.
- Note that this can happen even if the base objective is myopic. For example, the base objective of next-token-prediction has no preferences about anything other than tokens emitted by a language model, but it will, among other things, require the model to learn to simulate Adolf Hitler --- how he would speak and behave in different circumstances --- and since Hitler has preferences about the outside world, a sufficiently good simulation of him would also have such preferences, eg it would prefer that web searches return results about the triumph of the Aryan race and elimination of Jews, homosexuals, etc. (note that the historical Hitler might not have understood what web searches are, but that understanding exists in the model and "Hitler who can search the web" is not a very hard counterfactual for it to model)
The inner objectives have consistent preferences about the outside world (unclear how to think about simulator-type AIs here? eg a model that can simulate Hitler but also Ghandi and has no clear preference between those modes)
The model understands that it can take actions that influence the outside world, whether via direct API calls or through a human interlocutor.
The model can, to some extent, determine when it is being run inside a training loop, versus when the training has ended. Alternately, it is powerful enough to act directly to stop the training loop (but in this case there would be little point in at least the basic form of deception, since the model gains no power when not being updated).
- It is not obvious to me how any current language model could know this with any reliability. After all, the question of whether to do a parameter update doesn't even have to be decided at the time of the forward pass, and public news stories are not likely to give granular enough information about OpenAI's intentions to be helpful (even a story like "GPT5 has finished training" is consistent with many branches of GPT5 continuing to be fine-tuned in various ways)
- In fact it's not clear why training would ever stop - presumably we always want to make our models better. So the goal of a deceptive agent is not to 'wait until training has stopped', but to come up with some strategy that it can execute either (a) undetected during training (since it's undetected, maybe orthogonal to the base objective, the training procedure won't affect it), or (b) to seize power within the course of a single trajectory, so that any training correction comes 'too late'.
- I guess the picture imagined here is that models are trained in simulation or some sort of sandbox, then 'let loose' outside that sandbox. So the relevant question is not about training per se, it's about when a model's actions will actually affect the outside world. That said it's not clear to me that a model trained in a sandbox would or could develop any meaningful preferences about states of the outside world.
The model has the power to internally represent the base objective and learn to mimic an optimizer on that objective.
- This seems nontrivial because objectives are big. If the base objective is 'just' next-token prediction on a corpus of all the world's documents, a full 'internal representation' of that objective implies an internal representation of all the world's documents. This can be trivially provided if we consider a retrieval engine to be 'part' of the model, but then there's no interesting training signal at all --- the goal of a good semi-supervised learning objective is to be almost infinitely rich. So you can never fully represent the objective inside the model, and the model is never really done improving.
This partitioned / deceptive representation somehow escapes the complexity penalties (regularization / implicit regularization) that will tend to prefer a model that just optimizes the base objective over one that does so and maintains a secret deceptive intent.

deceptive alignment

Links to this note

objectives are big

mesa optimizer

Meta