The problem of [ exposure bias ] (where an autoregressive sequence model goes off the rails of its training distribution) comes up as a…
Modified: October 13, 2022.
Yann Lecun's famous cake analogy is that: "If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake…
Modified: June 12, 2021.
Following the pattern of [ state values, then action values ], the one-step [ temporal difference ] update for action values is called…
Modified: April 23, 2022.
In a bandit setting, in each round we see a context , choose an action , and receive a reward sampled from a distribution with some…
Modified: March 26, 2025.
(references: https://julien-vitay.net/deeprl/ActorCritic.html ) Advantage actor-critic The advantage function is a 'centered' version of…
Modified: June 22, 2022.
In reinforcement learning, the advantage of a state-action pair under a policy is the improvement in value from taking action a (and…
Modified: July 05, 2022.
References: Cooperative Inverse Reinforcement Learning The Off-Switch Game Incorrigibility in the CIRL Framework The CIRL setting models…
Modified: April 05, 2023.
paper: Chen, Lu, et al. 2021, https://arxiv.org/abs/2106.01345 Trajectories are represented as sequences: where is the return-to-go, i.e…
Modified: April 15, 2022.
Deep deterministic policy gradient (DDPG) is an interesting RL algorithm with a somewhat misleading name. Although its name indicates that…
Modified: July 27, 2022.
Notes from John Schulman's Berkeley course on deep [ reinforcement learning ], Spring 2016. Value vs Policy-based learning Value-based…
Modified: February 22, 2022.
References: Direct Preference Optimization: Your Language Model is Secretly a Reward Model This seems like a compelling reframing of…
Modified: May 31, 2023.
Maybe a stupid idea, but I wonder if the idea behind differentiable physics simulators (like Brax) can be extended more broadly to rich…
Modified: July 04, 2022.
A few ways to think about eligibility traces: an explicit accounting of credit assignment a [ sufficient statistic ] for the history of the…
Modified: March 29, 2022.
The state transitions we observe in [ reinforcement learning ] are typically correlated over time, both within a trajectory (obviously) and…
Modified: September 04, 2022.
For any reward function and policy , consider the entropy-regularized reward Taking as our objective the (expected, discounted…
Modified: July 28, 2022.
References: Risks from Learned Optimization in Advanced Machine Learning Systems A [ reinforcement learning ] algorithm attempts to find the…
Modified: March 28, 2023.
References: Gu et al., Continuous Deep Q-Learning with Model-based Acceleration (2016). Instead of modeling directly, we build a network…
Modified: July 19, 2022.
A very incomplete and maybe nonsensical intuition I want to explore. Classically, people talk about very simple [ reward ] functions like…
Modified: March 31, 2023.
A few (relatively uninformed) thoughts about on- vs off-policy [ reinforcement learning ]. Advantages of on-policy learning: On-policy…
Modified: April 23, 2022.
(see also my [ deep RL notes ] from John Schulman's class several years ago, which cover much of the same material) We can approach…
Modified: March 14, 2024.
The policy gradient theorem says that For simplicity we'll assume a fixed initial state and fixed-length finite trajectories, but the…
Modified: April 02, 2022.
references: paper: https://arxiv.org/abs/1707.06347 great blog post on implementation details: https://iclr-blog-track.github.io/2022/0…
Modified: July 21, 2022.
Note : see [ reinforcement learning notation ] for a guide to the notation I'm attempting to use through my RL notes. Three paradigmatic…
Modified: April 23, 2022.
There tends to be a lot going on in RL algorithms, with a whole mess of different quantities defined across timesteps. It's useful to try to…
Modified: April 23, 2022.
When thinking about the [ reward ] function for a real-world AI system, there is always some causal process that determines reward. For…
Modified: April 12, 2023.
Silver, Singh, Precup, and Sutton argue that Reward is enough : maximizing a reward signal implies, on its own, a very broad range of…
Modified: March 02, 2022.
Suppose we have a [ Markov decision process ] in which we get reward only at the very end of a long trajectory. Until that point, we have no…
Modified: March 03, 2022.
stray thoughts about reward functions (probably related to the [ agent ] abstraction and the [ intentional stance ]) one can make a…
Modified: April 06, 2023.
See also: [ cooperative inverse reinforcement learning ], [ love is value alignment ]
Modified: June 12, 2021.
Things that might be useful to log in a [ reinforcement learning ] algorithm: Return of each trajectory. (summarize as mean/std/min/max…
Modified: April 11, 2022.
Suppose we want to maximize reward, but we only get a couple bits of reward data every few hundreds/thousands of actions, whereas we get…
Modified: March 03, 2022.
A common pattern in [ reinforcement learning ] pedagogy is to develop some idea first in the context of estimating state values , and then…
Modified: March 29, 2022.
A general issue with [ temporal difference ] learning methods, which 'update a guess towards a guess', is that they can end up 'chasing…
Modified: April 23, 2022.
From David Silver's slides : TD-learning 'updates a guess towards a guess'. Sutton and Barto define the temporal difference error as the…
Modified: April 04, 2022.
(notes loosely based on the Berkeley deep RL course lecture ) Setup: RL with policy gradients The basic setup is that we want to optimize…
Modified: July 06, 2022.
The standard [ Markov decision process ] formalism includes a reward function ; the total (discounted) reward across a trajectory is its…
Modified: October 16, 2022.
Reference: Mahmood et al., 2014. Weighted importance sampling for off-policy learning with linear function approximation Here's a situation…
Modified: April 23, 2022.