The problem of [ exposure bias ] (where an autoregressive sequence model goes off the rails of its training distribution) comes up as a…
Yann Lecun's famous cake analogy is that: "If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake…
Following the pattern of [ state values, then action values ], the one-step [ temporal difference ] update for action values is called…
(references: https://julien-vitay.net/deeprl/ActorCritic.html ) Advantage actor-critic The advantage function is a 'centered' version of…
In reinforcement learning, the advantage of a state-action pair under a policy is the improvement in value from taking action a (and…
References: Cooperative Inverse Reinforcement Learning The Off-Switch Game Incorrigibility in the CIRL Framework The CIRL setting models…
paper: Chen, Lu, et al. 2021, https://arxiv.org/abs/2106.01345 Trajectories are represented as sequences: where is the return-to-go, i.e…
Deep deterministic policy gradient (DDPG) is an interesting RL algorithm with a somewhat misleading name. Although its name indicates that…
Notes from John Schulman's Berkeley course on deep [ reinforcement learning ], Spring 2016. Value vs Policy-based learning Value-based…
Maybe a stupid idea, but I wonder if the idea behind differentiable physics simulators (like Brax) can be extended more broadly to rich…
References: Direct Preference Optimization: Your Language Model is Secretly a Reward Model This seems like a compelling reframing of…
A few ways to think about eligibility traces: an explicit accounting of credit assignment a [ sufficient statistic ] for the history of the…
The state transitions we observe in [ reinforcement learning ] are typically correlated over time, both within a trajectory (obviously) and…
For any reward function and policy , consider the entropy-regularized reward Taking as our objective the (expected, discounted…
References: Risks from Learned Optimization in Advanced Machine Learning Systems A [ reinforcement learning ] algorithm attempts to find the…
References: Gu et al., Continuous Deep Q-Learning with Model-based Acceleration (2016). Instead of modeling directly, we build a network…
A very incomplete and maybe nonsensical intuition I want to explore. Classically, people talk about very simple [ reward ] functions like…
A few (relatively uninformed) thoughts about on- vs off-policy [ reinforcement learning ]. Advantages of on-policy learning: On-policy…
(see also my [ deep RL notes ] from John Schulman's class several years ago, which cover much of the same material) We can approach…
references: paper: https://arxiv.org/abs/1707.06347 great blog post on implementation details: https://iclr-blog-track.github.io/2022/0…
The policy gradient theorem says that For simplicity we'll assume a fixed initial state and fixed-length finite trajectories, but the…
Note : see [ reinforcement learning notation ] for a guide to the notation I'm attempting to use through my RL notes. Three paradigmatic…
There tends to be a lot going on in RL algorithms, with a whole mess of different quantities defined across timesteps. It's useful to try to…
Silver, Singh, Precup, and Sutton argue that Reward is enough : maximizing a reward signal implies, on its own, a very broad range of…
Suppose we have a [ Markov decision process ] in which we get reward only at the very end of a long trajectory. Until that point, we have no…
When thinking about the [ reward ] function for a real-world AI system, there is always some causal process that determines reward. For…
Things that might be useful to log in a [ reinforcement learning ] algorithm: Return of each trajectory. (summarize as mean/std/min/max…
See also: [ cooperative inverse reinforcement learning ], [ love is value alignment ]
Suppose we want to maximize reward, but we only get a couple bits of reward data every few hundreds/thousands of actions, whereas we get…
A common pattern in [ reinforcement learning ] pedagogy is to develop some idea first in the context of estimating state values , and then…
A general issue with [ temporal difference ] learning methods, which 'update a guess towards a guess', is that they can end up 'chasing…
From David Silver's slides : TD-learning 'updates a guess towards a guess'. Sutton and Barto define the temporal difference error as the…
(notes loosely based on the Berkeley deep RL course lecture ) Setup: RL with policy gradients The basic setup is that we want to optimize…
The standard [ Markov decision process ] formalism includes a reward function ; the total (discounted) reward across a trajectory is its…
Reference: Mahmood et al., 2014. Weighted importance sampling for off-policy learning with linear function approximation Here's a situation…