Modified: April 08, 2023
Goodhart's law
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.The law says that: when a measure becomes a target, it ceases to be a good measure.
One can distinguish four types of Goodhart problems (originally from Manheim and Garrabrant, Categorizing Variants of Goodhart's Law):
- Regressional Goodhart is when the target can be modeled a priori as
target = proxy + uncorrelated noise
, for example,basketball ability = height + other factors
. If we choose a basketball team strictly by their height, we will be predictably disappointed, because this is equivalent to selecting onability - other factors
and maximizing this quantity will tend to select cases where the 'other factors' are strongly negative. This is the "optimizer's curse". - Extremal Goodhart occurs when the optimization pushes us outside the regime in which the proxy was trained or is valid. For example, basketball ability may generally correlate with height, up to a point, but the tallest people in the world actually have health problems that make them poor basketball players.
- Causal Goodhart occurs when the act of optimizing a proxy breaks the connection with the target. For example, basketball ability is correlated with height, but training at basketball will not cause you to become taller.
- Adversarial Goodhart occurs when agents specifically manipulate the proxy. For example, if it becomes known you're selecting a basketball team based on height, someone may show up to the tryouts on stilts, which increases their height while likely decreasing their actual ability.
The law was originally proposed in economics (Charles Goodhart was a macroeconomist), as
Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.
which specifically considers the adversarial setting. But the generalization above is now used more broadly including in AI.
Jascha Sohl-Dickstein notes that we can view overfitting in machine learning as an example of Goodhart's law, where training loss is the proxy for test loss. He proposes a "strong version of Goodhart's law":
When a measure becomes a target, if it is effectively optimized, then the thing it is designed to measure will grow worse.
He doesn't propose a specific mechanism for this (noting that the mechanisms of overfitting itself are an active area of research). But I think this roughly corresponds to the regressional Goodhart case, where we're implicitly optimizing for ability - other factors
, and if the impact of ability is bounded, but other factors are unbounded (or at least way larger), then eventually the optimization will tend to be dominated by the other factors, even at the cost of pushing down ability.