Created: July 20, 2022
Modified: July 21, 2022
Modified: July 21, 2022
limitations of autodiff
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.In principle we can apply automatic differentiation through any composition of differentiable operations. This lets us get gradients of simulators, optimization problems, and other complex computations. When are autodiff gradients not good?
Note that autodiff gives an exact analytic gradient, up to any numerical issues. So any problems with autodiff gradients must be one or more of:
- Not differentiating the full calculation: for example, when estimating the expectation of using samples from , it's not enough to differentiate through ; we also need to differentiate through the sampling process for (the 'reparameterization trick') or otherwise rewrite the objective to internalize all dependence on the parameters .
- The chain of gradient computations is ill-conditioned, e.g., because we're multiplying a sequence of Jacobians, some of which have very large eigenvalues while others have very small eigenvalues (some interaction of vanishing and exploding gradients), or summing a bunch of terms with exploding gradients in different directions that should cancel out but don't necessarily do so in practice.
- Expressing a differentiable computation as the composition of nondifferentiable computations: for example, has derivative everywhere, including at , but the absolute value and sign functions are nondifferentiable at that point, so autodiff on that representation will return an undefined result.
- A special case of the previous: autodiff through a gradient-based optimization procedure requires second-order gradient information, which may not be available or even defined in principle, whereas the derivative at the optimum may still be defined.
- Sometimes the objective is not really smooth, so its true gradients are terrible, but differentiating a 'smoothed' objective implicitly through finite-difference gradients or similar gives good ascent directions.
- Sometimes it's memory-intensive to store the activations needed for reverse-mode autodiff.
alternatives:
- adjoint gradients for ODEs
- fixed-point gradients by implicit differentiation
- save memory through reversible computation (RevNets, etc)