reinforcement learning notation: Nonlinear Function
Created: April 23, 2022
Modified: April 23, 2022

reinforcement learning notation

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

There tends to be a lot going on in RL algorithms, with a whole mess of different quantities defined across timesteps. It's useful to try to standardize notation. I'll attempt to use this notation consistently in my notes and to update it as needed.

NotationQuantityNotes
S,A\mathcal{S}, \mathcal{A}state and action spaces
γ\gammadiscount factor
p(ss,a)p(s' \vert s, a)dynamics
r(s,a,s)r(s, a, s')reward functionmay be simplified to r(s)r(s) or r(s,a)r(s, a) when appropriate
(st,at,rt+1,st+1)(s_t, a_t, r_{t+1}, s_{t+1})state transitionreward is associated with the next timestep
τ=(s0,,sT1,aT1,rT)\tau = (s_0, \ldots, s_{T-1}, a_{T-1}, r_T)trajectory of length TT
vt=k=1Ttγk1rt+kv_t = \sum_{k=1}^{T-t} \gamma^{k-1} r_{t+k}empirical return from (st,at)(s_t, a_t)
vt(n),vtλv_t^{(n)}, v_t^\lambdann-step or λ\lambda-averaged estimate of returntypically depends on approximate values VϕV_\phi, may be written vϕ,t(n)v^{(n)}_{\phi, t} when this dependence is salient
δt=vtV(st)\delta_t = v_t - V(s_t)temporal difference errormay indicate specific value function as δtπ,δt,δϕ,t\delta^\pi_t, \delta^*_t, \delta_{\phi, t} if not clear from context
πθ(as)\pi_\theta(a \vert s)policy with parameters θ\theta
π(as)\pi^*(a \vert s)optimal policy
dπ,γ(s)Eπt=0γt1[st=s]d_{\pi,\gamma}(s) \propto \mathbb{E}_\pi \sum_{t=0}^\infty \gamma^t \mathbb{1}[s_t = s]Discounted state occupancy distribution under π\pi
Hπ[st]H_\pi[\cdot \vert s_t]shorthand for policy entropy H(π(st))H(\pi(\cdot \vert s_t))
Vπ(s),Qπ(s,a),Aπ(s,a)V^\pi(s), Q^\pi(s, a), A^\pi(s, a)state and action values and advantage for policy π\pithese are the true values (which we may not know) not approximations
V(s),Q(s,a),A(s,a)V^*(s), Q^*(s, a), A^*(s, a)values under the optimal policy π\pi^*
Vϕ(s),Qϕ(s,a),Aϕ(s,a)V_\phi(s), Q_\phi(s, a), A_\phi(s, a)approximate values (estimates) with parameters ϕ\phi
J(πθ)J(\pi_\theta) or J(θ)J(\theta)shorthand for the RL objective J(πθ)=Eτπθ[t=0T1γtrt+1]J(\pi_\theta) = \mathbb{E}_{\tau\sim{\pi_\theta}}\left[\sum_{t=0}^{T-1} \gamma^t r_{t+1}\right] or equivalently J(πθ)=Es0[Vπθ(s0)]J(\pi_\theta) = E_{s_0}\left[V^{\pi_\theta}(s_0)\right]