The problem of [ exposure bias ] (where an autoregressive sequence model goes off the rails of its training distribution) comes up as a…

Yann Lecun's famous cake analogy is that: "If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake…

Following the pattern of [ state values, then action values ], the one-step [ temporal difference ] update for action values is called…

(references: https://julien-vitay.net/deeprl/ActorCritic.html ) Advantage actor-critic The advantage function is a 'centered' version of…

In reinforcement learning, the advantage of a state-action pair under a policy is the improvement in value from taking action a (and…

References: Cooperative Inverse Reinforcement Learning The Off-Switch Game Incorrigibility in the CIRL Framework The CIRL setting models…

paper: Chen, Lu, et al. 2021, https://arxiv.org/abs/2106.01345 Trajectories are represented as sequences: where is the return-to-go, i.e…

Deep deterministic policy gradient (DDPG) is an interesting RL algorithm with a somewhat misleading name. Although its name indicates that…

Notes from John Schulman's Berkeley course on deep [ reinforcement learning ], Spring 2016. Value vs Policy-based learning Value-based…

Maybe a stupid idea, but I wonder if the idea behind differentiable physics simulators (like Brax) can be extended more broadly to rich…

References: Direct Preference Optimization: Your Language Model is Secretly a Reward Model This seems like a compelling reframing of…

A few ways to think about eligibility traces: an explicit accounting of credit assignment a [ sufficient statistic ] for the history of the…

The state transitions we observe in [ reinforcement learning ] are typically correlated over time, both within a trajectory (obviously) and…

For any reward function and policy , consider the entropy-regularized reward Taking as our objective the (expected, discounted…

References: Risks from Learned Optimization in Advanced Machine Learning Systems A [ reinforcement learning ] algorithm attempts to find the…

References: Gu et al., Continuous Deep Q-Learning with Model-based Acceleration (2016). Instead of modeling directly, we build a network…

A very incomplete and maybe nonsensical intuition I want to explore. Classically, people talk about very simple [ reward ] functions like…

A few (relatively uninformed) thoughts about on- vs off-policy [ reinforcement learning ]. Advantages of on-policy learning: On-policy…

(see also my [ deep RL notes ] from John Schulman's class several years ago, which cover much of the same material) We can approach…

references: paper: https://arxiv.org/abs/1707.06347 great blog post on implementation details: https://iclr-blog-track.github.io/2022/0…

The policy gradient theorem says that For simplicity we'll assume a fixed initial state and fixed-length finite trajectories, but the…

Note : see [ reinforcement learning notation ] for a guide to the notation I'm attempting to use through my RL notes. Three paradigmatic…

There tends to be a lot going on in RL algorithms, with a whole mess of different quantities defined across timesteps. It's useful to try to…

Silver, Singh, Precup, and Sutton argue that Reward is enough : maximizing a reward signal implies, on its own, a very broad range of…

Suppose we have a [ Markov decision process ] in which we get reward only at the very end of a long trajectory. Until that point, we have no…

When thinking about the [ reward ] function for a real-world AI system, there is always some causal process that determines reward. For…

Things that might be useful to log in a [ reinforcement learning ] algorithm: Return of each trajectory. (summarize as mean/std/min/max…

See also: [ cooperative inverse reinforcement learning ], [ love is value alignment ]

Suppose we want to maximize reward, but we only get a couple bits of reward data every few hundreds/thousands of actions, whereas we get…

A common pattern in [ reinforcement learning ] pedagogy is to develop some idea first in the context of estimating state values , and then…

A general issue with [ temporal difference ] learning methods, which 'update a guess towards a guess', is that they can end up 'chasing…

From David Silver's slides : TD-learning 'updates a guess towards a guess'. Sutton and Barto define the temporal difference error as the…

(notes loosely based on the Berkeley deep RL course lecture ) Setup: RL with policy gradients The basic setup is that we want to optimize…

The standard [ Markov decision process ] formalism includes a reward function ; the total (discounted) reward across a trajectory is its…

Reference: Mahmood et al., 2014. Weighted importance sampling for off-policy learning with linear function approximation Here's a situation…