Created: January 16, 2022
Modified: January 23, 2022
Modified: January 23, 2022
hard attention
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.- Closely related to discrete latent variables and to reinforcement learning with discrete actions.
- If I do a thing and it goes well, I should try to do the thing more often.
- When the action is a thought action, this corresponds to 'if I think about a thing and it goes well, I should try to think about the thing more often'.
- How do we measure 'goes well'? We need a baseline. We need an estimate of how well you'd expect things to have gone if you hadn't done the thing. This baseline is also sometimes called a control variate.
- If we don't have a baseline (or equivalently, a bad baseline like "always zero"), the best we can do is to estimate the policy gradient by multiplying the actual reward by the score function, i.e., the parameter update vector that would increase the probability of the given outcome.