embedded agent: Nonlinear Function
Created: April 07, 2023
Modified: April 07, 2023

embedded agent

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Notes on Abram Demski and Scott Garrabrant's sequence on Embedded Agency

Embedded Agents: Classic models of rational agency, such as AIXI, treat the agent as 'outside' of the environment. It's like playing a video game, in that:

  1. there is a well-defined action space
  2. over which the agent has 'free will'
  3. The environment is usually 'smaller than' the agent so that the agent can reason directly about world-states, and
  4. the environment does not contain the agent (at best it contains one or more avatar(s) representing the agent), so the agent can reason about the environment without needing to reason about itself,
  5. and it cannot 'die', it can only be reset.

But of course all of this is wrong. In the real world:

  1. agents are part of the environment and act according to its rules, so they do not really have 'free will'.
  2. They can take drugs or otherwise self-modify in ways that affect their future decision-making, and can die.
  3. There is no clear 'action space', simply the time-evolution of the environment.
  4. The environment is 'bigger than' the agent.
  5. The agent is made of parts, like the rest of the environment.
  6. Because it can reason about the environment, the agent can reason about itself as part of the environment, and can 'self-improve'.

Before proceeding with the sequence: what are some models of embedded agency?

  1. Game of life: a simple environment that can 'embed' agents.
  2. A more complex environment like Minecraft that can contain and simulate Turing machines.
  3. A Unix system on which the agent is a running program. Such a program may (or may not, depending on permissions) be able to read its own source code, spawn copies of itself, self-modify, crash the machine (and thus itself) unrecoverably, etc. Such an agent has some well-defined I/O channels (constituting a nominal action space of various syscalls, etc) but may also have side channels it may itself be unaware of, eg, it may slow other programs by using all the CPU, leak data about its internal computations through runtime analysis, etc.
  4. A computer network, which may contain many copies of the 'same' agent running on different systems.
  5. Obviously: actual robots and actual humans in the real world. We typically think about ourselves roughly as agents (taking the intentional stance) since this is a useful model, so it's reasonable to suppose that embedded agents would do this also, but we never want to reify this model. It's not even really a single model: we have many models of our own agency, imagining state and action spaces at different levels of abstraction.

I suspect that the MIRI view is focused more on the 'computer program' model, which is interesting, but feels potentially very different from the aspects of embedded agency that I'm interested in. I want 'agents' (in a non-classical sense) that have flexible notions of their 'boundary' or 'action space', that acknowledge that the self is a construct and there's not necessarily any reified 'entity' that exists to reap rewards from one timestep to the next, that can reason about themselves in models (just like their world model), that are composed of subsystems that may need to be aligned, but that are also like humans (and current language models) in that they don't necessarily have direct access to their own source code. This is partly because such cases are more 'intuitive', representative of current systems, and also because direct access to source code seems to quickly get you into halting-problem territory, where imagining that you can analyze the behavior of arbitrary source code (in fact an uncomputable problem) creates nerd-sniping paradoxes that don't shed light on the ability of real systems. The lesson of Rice's theorem is that, in general, the only thing you can do with source code is run it. That said, a program that can run its own code in arbitrary circumstances still gets into various reflective paradoxes. So we also maybe want some other constraints, like the environment is sufficiently complex / has unobserved features so that the agent can't simply spawn a VM --- perhaps (like humans) it can consider the effects of actions in simulated worlds, but all models are wrong so the simulations are necessarily simplifications that don't fully predict real-world behavior.

The sequence divides the issues into four parts. I'm not sure I'll agree with everything they choose to focus on, but it feels important to understand how people are thinking about this problem.

Decision Theory: issues come up when an agent can 'read its own source code' and know what it will do in various situations, or interact with copies of itself.

They use the example of a proof-searching agent, which uses both its own code and the universe's code, and searches for proofs that the agent's taking one action will lead to higher value than the other. Such agents run into 'spurious proofs' where the existence of the proof causes them to take a suboptimal action, which justifies the proof: because the agent takes a suboptimal action, any implications that begin with the agent taking a different (optimal) action are vacuous. Generally the concern is that such an agent can't reason about counterfactuals because such counterfactuals are actually logically impossible. This feels to me like a flaw in their model of the situation (why design this bizarre agent?) as opposed to anything particularly relevant to alignment, though it does get at the weirdness of counterfactuals as a concept more generally.

Embedded world-models:

  • Agents that exist in the world and try to model the world must model themselves. If they try to build a 'map as big as the territory' then they run into Gödelian self-reference issues, since the map is inside the territory and so must contain a perfect description of itself.
  • Similarly, such an agent must model all other agents in the world, which leads to infinite loops as those agents may also be trying to model you. Game theory is a simple model of such situations, where the world being modeled includes other agents, but it treats their agency as a special thing separate from other aspects of the environment (thus we get equilibrium concepts instead of just being able to plan in a predictable environment). The 'reflective oracle machine' construction is a way around this, in which we assume that agents are computations with access to a particular oracle that allows solving the halting problem on randomized computations (the randomness is necessary to avoid paradoxes), but of course this is uncomputable just like AIXI.
  • An agent 'smaller than the world' can model the full world only by compressing it, e.g., instead of representing the current state of the universe, it represents the laws of physics and some (simple) initial conditions from which the full world can in principle be generated. This avoids paradoxes, but the agent is now 'logically uncertain': it doesn't know all of the consequences of its beliefs (and can never know these, since the set of consequences includes the entire current world). This breaks the assumptions of classical logic and probabilistic analysis.
  • Thus you end up concluding that embedded agents will need high-level abstract world models, multi-level models, and many models in general. This is challenging because these models may have different ontologies, and things of value may appear in models at some levels but not in others (human well-being is not a term in the Schrodinger equation).

Robust delegation: an agent embedded in the environment is able to modify and improve itself, and will generally need to do so since its thinking capacity is constrained by being 'in the world' ('spend more time thinking about this' is itself a simple way for the agent to 'improve itself'). So now we have problems of agent identity. How can an agent trust future 'versions of itself' to carry out its goals?

This seems to carry in itself many of the difficulties of the alignment problem, since the future versions of the agent are hopefully more capable and more intelligent than the current agent, so it's not possible for the current agent to fully reason about what they might do.

Subsystem alignment: an agent that is 'made of parts' (embedded in an environment) needs to ensure that those parts aren't working at cross-purposes. This seems like an issue only if the parts are themselves agents. But there are maybe convergent instrumental reasons to want some of your parts to have agency.

One source of misalignments is that parts are given subgoals or instrumental goals, which are not the ultimate goal and may eventually conflict with it. In fact one way for a subagent to optimize its instrumental goal would be to corrupt the larger agent to focus all of its effort on it, forgetting the final goal! Ideally you want to boot up subagents with goals that 'point back' at the final goal: "paperclips in service of harmonious office functioning in service of human wellbeing" rather than just "paperclips". But this is (maybe) tough because the subagents are generally supposed to be simpler than the original agent, so might not be capable of representing or properly optimizing the original goal.

Even if subsystem alignment turns out to be easy for systems we (or an AI) design intentionally, we may at some point need to search over possible subsystem designs to find ones better than what we can explicitly design.