Confidential Guardian: Cryptographically Prohibiting the Abuse of Model Abstention

Stephan Rabanser; Ali Shahin Shamsabadi; Olive Franzese; Xiao Wang; Adrian Weller; Nicolas Papernot

Confidential Guardian: Cryptographically Prohibiting the Abuse of Model Abstention

Stephan Rabanser, Ali Shahin Shamsabadi, Olive Franzese, Xiao Wang, Adrian Weller, Nicolas Papernot

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We show that a model owner can artificially introduce uncertainty into their model and provide a corresponding detection mechanism.

Abstract: Cautious predictions—where a machine learning model abstains when uncertain—are crucial for limiting harmful errors in safety-critical applications. In this work, we identify a novel threat: a dishonest institution can exploit these mechanisms to discriminate or unjustly deny services under the guise of uncertainty. We demonstrate the practicality of this threat by introducing an uncertainty-inducing attack called Mirage, which deliberately reduces confidence in targeted input regions, thereby covertly disadvantaging specific individuals. At the same time, Mirage maintains high predictive performance across all data points. To counter this threat, we propose Confidential Guardian, a framework that analyzes calibration metrics on a reference dataset to detect artificially suppressed confidence. Additionally, it employs zero-knowledge proofs of verified inference to ensure that reported confidence scores genuinely originate from the deployed model. This prevents the provider from fabricating arbitrary model confidence values while protecting the model’s proprietary details. Our results confirm that Confidential Guardian effectively prevents the misuse of cautious predictions, providing verifiable assurances that abstention reflects genuine model uncertainty rather than malicious intent.

Lay Summary: When artificial intelligence (AI) systems are unsure, they often choose to “abstain” from making a prediction. This cautious behavior helps avoid harmful mistakes in high-stakes settings like medicine, finance, or criminal justice. But what if that very mechanism—meant to promote safety—could be twisted into a tool for harm? In our work, we reveal a troubling possibility: an organization could deliberately make its AI system appear uncertain for certain people—not because the task is genuinely hard, but to quietly deny them services like loans or benefits. We call this deceptive strategy Mirage, an attack that reduces the AI's confidence in specific cases while still performing well overall. This makes it hard for outside observers to notice anything suspicious. To stop such misuse, we introduce Confidential Guardian, a new system that allows independent auditors to check whether an AI’s cautious behavior is real or artificially manufactured. It does this by analyzing how the AI behaves on trusted test cases, and verifying its behavior using a technique that ensures honesty—without revealing the model’s inner workings. Our findings highlight a hidden danger in today’s AI systems, and offer a path toward greater transparency and fairness—ensuring that caution is used for safety, not for discrimination.

Link To Code: https://github.com/cleverhans-lab/confidential-guardian

Primary Area: Social Aspects->Accountability, Transparency, and Interpretability

Keywords: uncertainty, abstention, rejection, selective prediction, zero knowledge, confidence, auditing

Submission Number: 1628

Loading