Explanation Design in Strategic Learning: Sufficient Explanations that Induce Non-harmful Responses
Abstract: We study the design of explanations in algorithmic decision-making with strategic agents---individuals who may modify their inputs in response to explanations of a decision maker's (DM's) predictive model. While the demand for algorithmic transparency has led much prior work to assume full model disclosure, in practice DMs typically provide only partial information via explanations, which can cause agents to misinterpret the model and take actions that unintentionally reduce their own utility. A central open question is therefore how DMs should communicate explanations that avoid harming strategic agents while still supporting their own goals, e.g., minimising predictive error. In this work, we analyse widely used explanation methods and establish a necessary condition to prevent explanations from inducing self-harming responses. Furthermore, we show that action recommendation-based explanations (ARexes), which encompass counterfactual explanations, are sufficient to induce all non-harmful responses. Under a conditional homogeneity assumption, this sufficiency extends to ARex-generating methods, echoing the revelation principle in information design. To demonstrate their practical utility, we introduce a simple learning procedure that jointly optimises the predictive model and the explanation-generating policy. Experiments on both synthetic and real-world tasks show that ARexes enable DMs to achieve high predictive performance while preserving agents' utility, offering a principled strategy for safe and effective partial model disclosure.
Submission Number: 266
Loading