Modified: March 03, 2022
reward shaping
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.Suppose we have a Markov decision process in which we get reward only at the very end of a long trajectory. Until that point, we have no way to know if we're on the right track. Such an MDP is very hard to solve, because we essentially have to try all possible trajectories.
Imagine that we modified the reward function to give little bits of reward at each step along the optimal trajectory. The same trajectory is optimal, but now we can find it with a simple greedy approach.
In general, given a reward function, what is the equivalence class of reward functions that share the same optimal policy? Let's think about it:
- We can multiply all rewards by an arbitrary constant.
- We can add an arbitrary constant to all rewards if the environment is fixed-length, or infinite-horizon with a fixed discount factor (the difference between Heaven and Hell matters if you're choosing how much time to stay there, but otherwise it's just a constant shift to all final rewards).
Can we do more? Suppose we have a reward function defined on transitions. The reward of a trajectory is the sum of rewards from all transitions. Now consider the "shaped reward"
for some potential function on states. Let's continue to assume that the total number of steps is fixed. Does this preserve the optimal policy?
Theorem (sufficiency of potentials): For any trajectory the potential functions mostly cancel out: the total shaped reward
looks pretty similar to the original reward. Without loss of generality we can assume that all trajectories start at the same state (if there's some probability distribution over initial states, we define a dummy initial state that transitions to those states with the appropriate probability) and end at the same state (a sink state that we transition to deterministically after steps regardless of the previous state), so and can be treated as constants, guaranteeing that for any trajectory (and we can choose to simplify things further if we like). Since the value of a policy is just the expected value of trajectories under that policy,
the shaped values are simply offset by the same constant: . Thus any policy that's optimal under is optimal under , and vice versa.
Q: what is the optimal potential function?
References: