Zero Attribution Is Not Zero Influence: Feature Lock Attacks and the Limits of Post-Hoc Fairness Auditing
Abstract: Post-hoc explainability methods such as SHAP have become the de facto standard as auditing tools to detect whether protected features influence a machine learning model's prediction. The reliability of this auditing paradigm rests on the assumption that these methods accurately report a feature's influence. We demonstrate that this paradigm is fundamentally vulnerable to a class of input-layer manipulation attacks. This work introduces the Feature Lock Attack, a post-hoc adversarial wrapper that allows a model trained with a protected feature to evade detection by any perturbation-based post-hoc explainability audit where attribution depends on observing output variation when a feature is perturbed. The attack guarantees zero Shapley attribution by construction as it triggers the Dummy Player axiom of cooperative game theory. We then extend this guarantee to LIME and formalize the theoretical boundary of the attack. This paper evaluates the attack across 40 distinct experimental configurations. The attack suppresses the SHAP and LIME attributions to the noise floor of genuine non-use, with zero accuracy cost. Furthermore, the attack becomes proportionally more effective as the model's dependence on the protected feature grows. Our results show that under adversarial deployment, relying on post-hoc explainability tools for fairness auditing is fundamentally brittle, as zero attribution is not evidence of equity, but an artifact of non-detection.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Meisam_Razaviyayn1
Submission Number: 8987
Loading