off-policy: Nonlinear Function
Created: March 31, 2022
Modified: April 23, 2022

off-policy

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

A few (relatively uninformed) thoughts about on- vs off-policy reinforcement learning.

Advantages of on-policy learning:

  • On-policy learning can be exponentially more sample-efficient, because it focuses attention on the parts of the state space that actually matter to achieving our goal.

Why do off-policy learning?

  • Off-policy training is analogous to unsupervised pretraining and potentially useful for the same reasons. On-policy experience is expensive, while off-policy traces can be stored to build up increasingly large training sets.
  • experience replay is inherently off-policy, since your previous policy is not the same as (though might be similar to) your current policy.

Not all off-policy learning is created equal. Off-policy experience is useful to the extent that the demonstration policy shares our goals or at least explores the same regions of state space. It may not be very helpful to learn from the experience of a policy that explores an entirely different region of state space to achieve entirely different goals; by contrast, learning from a policy that is very similar to our target policy (for example, an ϵ\epsilon-greedy version, or slightly stale data from a replay buffer) might be 'almost as good' as on-policy training. Importance-weight corrections formalize this intuition.

Crucial distinctions in off-policy training:

  • do we know the action probabilities of the demonstration policy? (if so, we can do importance weighting)
  • did the demonstration policy see the same observations that we see?

Off-policy learning of causal effects

On-policy learning over a series of trials is obviously capably of learning causal effects, because we directly observe the result of our actions.

The off-policy case is more subtle. If the demonstration policy was able to observe any state features that we don't, those can confound our view of its decision making. For example, suppose we observe a person's total lifespan (the reward) along with a single feature indicating whether they decided to smoke or not. But the policy that generated the data got to see another feature indicating whether the person is in good health, and is (for whatever reason) configured so that a person will smoke if and only if they're otherwise healthy. We might erroneously conclude from this that smoking increases lifespan, because all the smokers that we see are healthier than the non-smokers (even after the negative effects of smoking).

We can view off-policy experience as representing a randomized controlled trial, even if we ourselves didn't run the trial, as long as:

  • we have explicit access to the action (assignment) probabilities, so that we can apply importance weighting (even if the probabilities depend on other state features, importance weights will correct for this; e.g., we can query importance-weighted expectations under the policy where everyone is assigned with 50/50 probability independent of their state), OR