AI safety, as a term, is sterile and hard to get excited about. Preventing catastrophe is important, but doesn't motivate me, since [ the…
The law says that: when a measure becomes a target, it ceases to be a good measure . One can distinguish four types of Goodhart problems…
References: Cooperative Inverse Reinforcement Learning The Off-Switch Game Incorrigibility in the CIRL Framework The CIRL setting models…
The idea is that a [ mesa optimizer|mesa-optimizing ] policy with access to sufficient information about the world (e.g., web search) might…
Notes on Abram Demski and Scott Garrabrant's sequence on Embedded Agency Embedded Agents : Classic models of rational [ agency ], such as…
What does it mean to love someone? Of course this question has as many answers as there are people, and probably more. But here's one view…
A very incomplete and maybe nonsensical intuition I want to explore. Classically, people talk about very simple [ reward ] functions like…
How do we maintain values when our models of the world shift? If someone's goal in life is to "do God's will", and then they come to believe…
When thinking about the [ reward ] function for a real-world AI system, there is always some causal process that determines reward. For…
See also: [ cooperative inverse reinforcement learning ], [ love is value alignment ]
stray thoughts about reward functions (probably related to the [ agent ] abstraction and the [ intentional stance ]) one can make a…
Language is a really natural way to tell AI systems what we want them to do. Some current examples: [ GPT ]-3 and successors (InstructGPT…
Suppose I have an agent that generates text. I want it to generate text that is [ value alignment|aligned ] with human values. Approaches…
Notes on the Alignment Forum's Value Learning sequence curated by Rohin Shah. ambitious value learning : the idea of learning 'the human…
This may be a central point of confusion: how do we define AI systems that have preferences about the real world , so that their goals and…