Know When to Say No: Improving the Safety of Machine Learning Models Through Refusal

MathAI 2025 Conference Submission16 Authors

30 Jan 2025 (modified: 20 Feb 2025)MathAI 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Trusted Machine Learning, Model Robustness, Security, Adversarial Attacks, Out-of-Distribution (OOD), Outliers, Interpretability
Abstract: Machine learning models are vulnerable to adversarial attacks and errors caused by natural factors such as out-of-distribution (OOD) inputs, compromising their reliability in critical applications. Addressing this challenge requires effective techniques to identify and handle compromised inputs. In this work, we propose a methodology based on selective classification with the option to refuse from decision-making, enabling the model to defer predictions when uncertainty is detected. This framework is applied to various attack scenarios, enhancing the model's ability to detect adversarial perturbations and naturally occurring anomalies. Experimental results demonstrate that the proposed approach significantly improves the detection of compromised inputs and increases overall model robustness across diverse conditions.
Submission Number: 16
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview