AI safety, as a term, is sterile and hard to get excited about. Preventing catastrophe is important, but doesn't motivate me, since [ the…
Modified: January 24, 2022.
The law says that: when a measure becomes a target, it ceases to be a good measure . One can distinguish four types of Goodhart problems…
Modified: April 08, 2023.
References: Cooperative Inverse Reinforcement Learning The Off-Switch Game Incorrigibility in the CIRL Framework The CIRL setting models…
Modified: April 05, 2023.
The idea is that a [ mesa optimizer|mesa-optimizing ] policy with access to sufficient information about the world (e.g., web search) might…
Modified: March 31, 2023.
Notes on Abram Demski and Scott Garrabrant's sequence on Embedded Agency Embedded Agents : Classic models of rational [ agency ], such as…
Modified: April 07, 2023.
[ value alignment ] research often frames the problem as: first, learn the human 'value function' --- for every possible state of the world…
Modified: June 17, 2024.
What does it mean to [ love ] someone? Of course this question has as many answers as there are people, and probably more. But here's one…
Modified: November 28, 2023.
What does it mean to love someone? Of course this question has as many answers as there are people, and probably more. But here's one view…
Modified: November 28, 2023.
A very incomplete and maybe nonsensical intuition I want to explore. Classically, people talk about very simple [ reward ] functions like…
Modified: March 31, 2023.
How do we maintain values when our models of the world shift? If someone's goal in life is to "do God's will", and then they come to believe…
Modified: April 12, 2023.
When thinking about the [ reward ] function for a real-world AI system, there is always some causal process that determines reward. For…
Modified: April 12, 2023.
stray thoughts about reward functions (probably related to the [ agent ] abstraction and the [ intentional stance ]) one can make a…
Modified: April 06, 2023.
See also: [ cooperative inverse reinforcement learning ], [ love is value alignment ]
Modified: June 12, 2021.
Language is a really natural way to tell AI systems what we want them to do. Some current examples: [ GPT ]-3 and successors (InstructGPT…
Modified: April 07, 2022.
Suppose I have an agent that generates text. I want it to generate text that is [ value alignment|aligned ] with human values. Approaches…
Modified: February 21, 2022.
Modified: February 21, 2022.
Notes on the Alignment Forum's Value Learning sequence curated by Rohin Shah. ambitious value learning : the idea of learning 'the human…
Modified: April 07, 2023.
This may be a central point of confusion: how do we define AI systems that have preferences about the real world , so that their goals and…
Modified: April 12, 2023.