Q-learning: Nonlinear Function
Created: March 29, 2022
Modified: April 23, 2022

Q-learning

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Following the pattern of state values, then action values, the one-step temporal difference update for action values

Q(s,a)Q(s,a)+α[rt+1+γQ(st+1,at+1)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s, a)\right]

is called SARSA. Presumably you can define multi-step versions of SARSA analogous to nn-step TD updates, and SARSA(λ\lambda) analogous to TD(λ\lambda).

Both TD state-value estimation methods and SARSA are considered on-policy, because they expect that the transition is representative of the policy π\pi under evaluation, and they converge to the state and action values of that policy. We can modify SARSA to be off-policy by replacing the action at+1a_{t+1} actually taken by the optimal choice:

Q(s,a)Q(s,a)+α[rt+1+γmaxaQ(st+1,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s, a)\right]

This is the Q-learning update.

Deep Q-learning

(references: original Atari paper, Julien Vitay's notes)

Like all TD updates, Q-learning and SARSA can also be derived as stochastic gradient descent steps on a squared loss, for example,

L(ϕ)=12((rt+1+γmaxaQϕ(st+1,a))Qϕ(s,a))2L(\phi) = \frac{1}{2}\left(\left(r_{t+1} + \gamma \max_{a'}Q_\phi(s_{t+1}, a')\right) - Q_\phi(s, a)\right)^2

so that each step attempts to bring our approximation QϕQ_\phi closer to an improved target. Unfortunately we lose the standard SGD convergence guarantees because the state transitions we see are not iid---they are typically correlated over time---and because the the target changes as we update the parameters ϕ\phi so we are not actually optimizing any fixed objective. This leads to instability. But this can be addressed with two tricks: experience replay and by using an infrequently-updated target network.

Architecture of QQ-networks. We can have a deep network take as input both sts_t and ata_t and produce a single estimated QQ-value, or it can just take sts_t and produce estimated QQ-values for all possible actions ata_t.Insofar as these correspond to feeding in the action at the first layer versus the last layer, respectively, we could in principle also consider schemes that feed in the action at some intermediate layer. This could be used to support a richer action space while still sharing some of the state representation across actions from the same state. The latter is usually preferred when we have a small number of discrete actions, since we can maximize over QQ-values using a single forward pass.

An even better approach is the 'duelling network' architecture: have the second-to-last layer of the network separately predict V(s)V(s) and the advantages for each action A(s,a)A(s, a), and then define a final parameterless layer to simply combine the two,

Q(s,a)=V(s)+A(s,a)1AaAA(s,a)Q(s, a) = V(s) + A(s, a) - \frac{1}{|\mathcal{A}|}\sum_{a'\in \mathcal{A}}A(s, a')

subtracting the mean advantage so that the value function is identifiable (otherwise we could add a constant to VV and subtract the same constant from AA without changing the results).

The original Deep Q-learning worked impressively well in Atari games but was very sample-inefficient. Some improvements proposed in later papers were to use duelling networks, prioritized experience replay, and double Q-learning, described below.

Double Q-learning

(references: Double Q-learning (van Hasselt 2010) and Deep RL with Double Q-learning (van Hasselt, Suez, Silver 2015))

In noisy environments, QQ-learning tends to overestimate values because it targets the maximum of the estimated QQ-values, and noise affects these asymmetrically.

To see this, suppose that all actions from some state have true value Q(s,a)=1Q^*(s, a) = 1, but that we only see unbiased estimates of their values. We expect that some of these estimates will be less than 1 and a roughly equal number will be greater than 1. Importantly, it is extremely unlikely that they are all less than 1, meaning that, with high probability, their maximum will be greater than 1. So we'll overestimate the maximum QQ-value even if our individual estimates were totally unbiased.More rigorously, this follows from Jensen's inequality and the convexity of the max\max function.

It's not obvious that this is a problem: it would be fine if the overestimation were uniform, and indeed we know that overestimating values can be useful to encourage exploration. But the practical evidence is that the overestimation is not uniform, and that addressing the issue helps in practice.

One way to do this is to use the double estimator, which requires two independent unbiased estimators of the action values. We pick an action using the first, but evaluate it using the second. This ensures that we get an unbiased estimate of the value of whatever action we pick. In the example above, we would pick whatever action aa^* maximizes our first noisy estimates, but then report an unbiased estimate of its value (which is just 11), so the double estimator is unbiased. It can be shown that the double estimator is not unbiased in general---it tends to slightly underestimate the maximum---but is unbiased if the quantities being estimated are iid (van Hasselt, 2010).

In the deep RL setting where we are already maintaining a target network in addition to our online network, we can just use these as our two sources of estimates (von Hasselt, Suez, Silver 2015). That is, instead of targeting

rt+1+γmaxaQθ(st+1,a)r_{t+1} + \gamma \max_{a'} Q_{\theta'}(s_{t+1}, a')

we target

rt+1+γQθ(st+1,argmaxaQθ(st+1,a))r_{t+1} + \gamma Q_{\theta'}(s_{t+1}, \text{argmax}_{a'}Q_\theta(s_{t+1}, a'))

still using the target network to compute the target value, but for an action selected by the online network. We are fudging things slightly from the 'pure' double Q-learning setting, because the online and target networks are not totally independent of each other, but apparently this still works pretty well.