Created: March 31, 2023
Modified: March 31, 2023

objectives are big

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

A very incomplete and maybe nonsensical intuition I want to explore.

Classically, people talk about very simple reward functions like 'number of paperclips on earth', or maybe even more trivially 'number of times the reward button is pressed'. In applied robotics, maybe something like 'minimize the distance between the current and target configurations'. The function is assumed to be representable with very few bits. If we are in a Markov decision process, reward is a function of the (state, action) tuple, and it's usually assumed to be a very simple, easily-specified function.

This intuition drives worries about deceptive alignment, in which an agent reward function

loss functions: decompose into measurement of current configuration (maybe complex), target configuration (maybe complex), and distance metric (maybe simple)

objectives are big

Links to this note

deceptive alignment

Meta