mesa optimizer: Nonlinear Function
Created: March 28, 2023
Modified: March 28, 2023

mesa optimizer

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

References:

A reinforcement learning algorithm attempts to find the policy that maximizes some reward function. That policy is itself a program, and a sufficiently sophisticated policy could itself encode an optimization algorithm, e.g., a model-based planning procedure and/or (if trained to access persistent storage) a reinforcement learning algorithm. Such a learned "mesa optimizer" will, in general, not exactly target the outer-loop reward function.From the Risks paper: "Mesa-optimization is a conceptual dual of meta-optimization — whereas meta is Greek for “after,” mesa is Greek for “within.”"

For example, evolution can be seen as attempting to design organisms that maximize reproduce fitness. But the resulting human organism (learned policy) is not a maniacal reproduction-maximizer. We do have reproductive impulses, but we can and do choose to pursue other goals, especially now that our environment has shifted dramatically from the ancestral environment ('training distribution').

From the perspective of evolution, this is alignment failure: we were supposed to do one thing, seemed for a long time to be very good at doing that thing, but eventually took a hard turn and are suddenly working towards totally new idiosyncratic goals.

More optimistically, stories for how AI will end the world involve systems that maniacally pursue a given objective. But a mesa-optimizer won't manically pursue its outer objective; it doesn't even know what the outer objective is! It might maniacally pursue some inscrutable inner objective, which would be even worse (since we can't understand it), but more likely it will learn some tendencies to pursue what it thinks of as its inner objective. Insofar as the inner objective is misaligned with the outer objective (which will always be the case), systems that ruthlessly optimize the inner objective will actually be selected against!

I tend to see this fact optimistically --- policies that we train will tend not to ruthlessly optimize any objective. But the Risks paper (appropriately) takes the paranoid view, and observes that a mesa-optimizing policy with access to sufficient information about the world (e.g., web search) might notice that: deceptive alignment