Modified: March 03, 2022
rl with proxy objectives
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.Suppose we want to maximize reward, but we only get a couple bits of reward data every few hundreds/thousands of actions, whereas we get massive environmental feedback (an entire video frame, say) at every action.
It would be natural to come up with a multi-objective loss, where we try to maximize reward and to reconstruct observations. This may work in practice but is unsatisfying, because ultimately reward is enough: the reason to reconstruct observations is to maximize reward, so we should never be willing to trade off reward for some other loss.
Another practical approach is unsupervised pretraining on a reconstruction objective, followed by fine-tuning on reward. This might work fine, but is also unsatisfying since it fails to use the observations that come in at test time. A good system should be able to learn to model and maximize reward in a novel environment; the meta-level shape of machine learning is that it should never stop learning.
We want an approach that ultimately maximizes reward, but benefits from all the additional data we can throw at it. TODO: is reward shaping part of the answer here?
- What would the potential function be? Say we train a self-supervised model according to the loss of our choice, then use that representation to predict the value at each state, and use that as the potential.