On the strength of goodhart's law

Published: 10 Jun 2025, Last Modified: 30 Jun 2025MoFA PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: causality, Goodhart's law, alignment, AI safety
TL;DR: We tried to see the strength of good hart's law in several causal context.
Abstract: Goodhart's law is an adage in policy-making stating that ``\emph{when a measure becomes a target, it ceases to be a good measure}''. In the past few years, efforts have been made to formalise the law and assess its validity in the context of machine learning. Specifically, formalisms were proposed to distinguish cases where optimising a proxy metric is useful for an (unknown) intended goal from those where doing so harms the true goal. In the broader effort to formalise Goodhart's law, one central question is that of causality. Namely, does learning on a causal structure without being aware of it (and taking it into account) leads to a misalignment between the true goal and the proxy metric being optimised ? This paper provides a positive answer to this question and proposes a causal formalism that separates three different causal relationships: (1) the classic case of a confounding factor, (2) a new causal structure we call the ``\emph{mirror confounding}'', and (3) the cascading structure that we adapt from a previous work. Each causal structure involves a true goal, a proxy metric, the covariates on which the model learns and, when applicable, hidden variables.
Submission Number: 75
Loading