state values, then action values: Nonlinear Function
Created: March 29, 2022
Modified: March 29, 2022

state values, then action values

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

A common pattern in reinforcement learning pedagogy is to develop some idea first in the context of estimating state values V(st)V(s_t), and then extend it to estimate action values Q(st,at)Q(s_t, a_t). For example: moving from temporal difference learning on states, to SARSA and Q-learning on actions.

Such an extension is always possible since we can view the state-action pair (st,at)(s_t, a_t) as an element of an augmented state space.For example, by separating action selection and execution as separate steps, so our trajectories look like (s0,)(s0,a0)(s1,)(s1,a1)(s_0, \emptyset) \rightarrow (s_0, a_0) \rightarrow (s_1, \emptyset) \rightarrow (s_1, a_1) \rightarrow \ldots. But why do things this way?

State-value estimates V(st)V(s_t) are a little bit easier to think about, just because there are fewer moving parts. But they're not directly useful for control. Control requires us to choose actions, so we need to know how good the actions are. This is generally why we ultimately end up formulating RL algorithms in terms of Q-values.

When would state values be useful?

  1. In planning or model-based RL settings where transition dynamics are available.
  2. It may aid generalization to share statistical strength across actions from a single state, even if we ultimately care about the action values.
  3. As a baseline for policy gradient methods, and to estimate the advantage function via temporal difference error.