Modified: April 23, 2022
weighted importance sampling
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.Reference: Mahmood et al., 2014. Weighted importance sampling for off-policy learning with linear function approximation
Here's a situation where ordinary importance sampling performs badly. Suppose that and are very different distributions, so that the weights vary wildly, but that is constant, say for all . Clearly the distribution of doesn't matter at all to the expectation of a constant function, but ordinary importance sampling in this setting will return the average weight, , which is a high-variance random quantity. We would be better off using the weighted importance sampling (WIS) estimate
which 'divides out' the randomness in the weights. Note that the denominator is the same in expectation, . This estimator is biased, but is consistent in the limit and can have much lower variance than the OIS estimator.
In supervised learning: suppose we are minimizing squared loss
Here we have two different random quantities: given a sampled dataset , we can define an empirical for any , and we can also define the optimal parameters .
To simplify things enormously, let's use the 'constant predictor' that just predicts a single value for all outputs. We can then solve the minimization problem trivially as the mean .
Now suppose that we have labels sampled from a conditional distribution different from our target distribution , so we have (for current purposes we'll assume that the marginal distribution of inputs is the same in both cases and so cancels out). How should we estimate the quantity ?
One approach is to apply ordinary importance sampling directly to estimate . Working backwards, we can view this as implicitly minimizing a loss in which each target is scaled by the importance weight:
Another approach is to define and minimize the importance-sampled loss:
Surprisingly we see that this recovers the weighted (WIS) solution for the parameter !
Similar math goes through for the case of linear function approximation, where (Mahmood et al., 2014.), which also specializes to the tabular case . Since WIS solutions are often preferred, this suggests a general strategy of addressing shifts in the output distribution (e.g., in off-policy reinforcement learning) by minimizing an importance-weighted loss.
Generalizing to nonlinear cases, one can still define the two objectives
and the argument here is that the latter should in general be preferred. I'm not sure I would have ever even thought of doing the former, but considering the simpler linear/constant cases at least gives some intuition for what's going on.