Created: April 23, 2022
Modified: April 23, 2022

reinforcement learning notation

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

There tends to be a lot going on in RL algorithms, with a whole mess of different quantities defined across timesteps. It's useful to try to standardize notation. I'll attempt to use this notation consistently in my notes and to update it as needed.

Notation	Quantity	Notes
$\mathcal{S}, \mathcal{A}$	state and action spaces
$\gamma$	discount factor
$p(s' \vert s, a)$	dynamics
$r(s, a, s')$	reward function	may be simplified to $r(s)$ or $r(s, a)$ when appropriate
$(s_t, a_t, r_{t+1}, s_{t+1})$	state transition	reward is associated with the next timestep
$\tau = (s_0, \ldots, s_{T-1}, a_{T-1}, r_T)$	trajectory of length $T$
$v_t = \sum_{k=1}^{T-t} \gamma^{k-1} r_{t+k}$	empirical return from $(s_t, a_t)$
$v_t^{(n)}, v_t^\lambda$	$n$ -step or $\lambda$ -averaged estimate of return	typically depends on approximate values $V_\phi$ , may be written $v^{(n)}_{\phi, t}$ when this dependence is salient
$\delta_t = v_t - V(s_t)$	temporal difference error	may indicate specific value function as $\delta^\pi_t, \delta^*_t, \delta_{\phi, t}$ if not clear from context
$\pi_\theta(a \vert s)$	policy with parameters $\theta$
$\pi^*(a \vert s)$	optimal policy
$d_{\pi,\gamma}(s) \propto \mathbb{E}_\pi \sum_{t=0}^\infty \gamma^t \mathbb{1}[s_t = s]$	Discounted state occupancy distribution under $\pi$
$H_\pi[\cdot \vert s_t]$	shorthand for policy entropy $H(\pi(\cdot \vert s_t))$
$V^\pi(s), Q^\pi(s, a), A^\pi(s, a)$	state and action values and advantage for policy $\pi$	these are the true values (which we may not know) not approximations
$V^(s), Q^(s, a), A^*(s, a)$	values under the optimal policy $\pi^*$
$V_\phi(s), Q_\phi(s, a), A_\phi(s, a)$	approximate values (estimates) with parameters $\phi$
$J(\pi_\theta)$ or $J(\theta)$	shorthand for the RL objective $J(\pi_\theta) = \mathbb{E}_{\tau\sim{\pi_\theta}}\left[\sum_{t=0}^{T-1} \gamma^t r_{t+1}\right]$ or equivalently $J(\pi_\theta) = E_{s_0}\left[V^{\pi_\theta}(s_0)\right]$

reinforcement learning notation

Links to this note

reinforcement learning

Meta