Created: April 23, 2022
Modified: April 23, 2022
Modified: April 23, 2022
reinforcement learning notation
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.There tends to be a lot going on in RL algorithms, with a whole mess of different quantities defined across timesteps. It's useful to try to standardize notation. I'll attempt to use this notation consistently in my notes and to update it as needed.
Notation | Quantity | Notes |
---|---|---|
state and action spaces | ||
discount factor | ||
dynamics | ||
reward function | may be simplified to or when appropriate | |
state transition | reward is associated with the next timestep | |
trajectory of length | ||
empirical return from | ||
-step or -averaged estimate of return | typically depends on approximate values , may be written when this dependence is salient | |
temporal difference error | may indicate specific value function as if not clear from context | |
policy with parameters | ||
optimal policy | ||
Discounted state occupancy distribution under | ||
shorthand for policy entropy | ||
state and action values and advantage for policy | these are the true values (which we may not know) not approximations | |
values under the optimal policy | ||
approximate values (estimates) with parameters | ||
or | shorthand for the RL objective or equivalently |