Created: March 31, 2022
Modified: October 16, 2022

values all the way down

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

The standard Markov decision process formalism includes a reward function $R(s_t, a_t, s_{t+1})$ ; the total (discounted) reward across a trajectory is its return. For any policy $\pi$ and initial state $s$ we can also define a value function $V_\pi(s) = \mathbb{E}_{\pi}\left[\sum_t \gamma^t r_t)\right]$ , which predicts the return of a trajectory starting at $s$ .

Reward and value are importantly different: the reward is given, but value must be computed or estimated. Formally speaking, values are downstream of rewards. But in other ways they have a very similar flavor.

An intuition I have about the meta-level shape of machine learning is that reward and value are actually not fundamentally different. This would suggest that the standard formalism is somehow misleading or incomplete. Why do I think this and what are the implications?

A few initial observations:

In real life, rewards are not given. Humans somehow infer our goals and preferences from experience. So the MDP formalism that treats reward as a definite and fixed component of the environment is not a good model of human decision-making.
Value functions nominally predict future rewards, but in practice with temporal difference learning they mostly predict future values. Can it just be 'values all the way down'; an infinite oscillating loop of state values predicting each other?
Values are aggregations of reward over multiple timesteps. What if each individual 'reward' is itself an aggregation of sub-rewards, the 'value' of some policy in a sub-MDP invoked by that action?
The insight of distributional RL is that it's helpful to maintain a distribution of the return from a state rather than just the expected value. The insight of reward uncertainty (eg in cooperative inverse reinforcement learning) is that it's helpful to maintain posterior distributions over the reward. Perhaps there's a useful lens in which these were in some sense the same insight? (This seems like a stretch, because CIRL is about epistemic uncertainty over rewards, while the distributions in distributional RL track aleatoric uncertainty).

TODO: think this through more. literature search for writing in this space.

values all the way down

Meta