Created: March 31, 2023
Modified: March 31, 2023
Modified: March 31, 2023
objectives are big
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.A very incomplete and maybe nonsensical intuition I want to explore.
Classically, people talk about very simple reward functions like 'number of paperclips on earth', or maybe even more trivially 'number of times the reward button is pressed'. In applied robotics, maybe something like 'minimize the distance between the current and target configurations'. The function is assumed to be representable with very few bits. If we are in a Markov decision process, reward is a function of the (state, action) tuple, and it's usually assumed to be a very simple, easily-specified function.
This intuition drives worries about deceptive alignment, in which an agent reward function
- loss functions: decompose into measurement of current configuration (maybe complex), target configuration (maybe complex), and distance metric (maybe simple)