A few (relatively uninformed) thoughts about on- vs off-policy [ reinforcement learning ]. Advantages of on-policy learning: On-policy…
Suppose we want to maximize reward, but we only get a couple bits of reward data every few hundreds/thousands of actions, whereas we get…