MIRA: Quantifying Neural Network Monitorability via Feature Space Analysis

MIRA: Quantifying Neural Network Monitorability via Feature Space Analysis

ICLR 2026 Conference Submission24804 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Neural Networks, Monitorability, Out-of-Distribution Detection, Anomaly Detection, Runtime Monitoring, Activation Patterns

TL;DR: We propose the MIRA Score, a metric that quantifies a neural network’s ability to expose its own failures by perturbing inputs and measuring feature separability.

Abstract: Monitoring neural networks is increasingly important for detecting potential failures in safety-critical applications. Although out-of-distribution (OoD) detection and uncertainty estimation have been widely studied, they often rely on the assumption that neural networks learn high-quality features. However, this assumption may not hold in practice, potentially leading to undetected failures. In this work, we introduce the concept of monitorability, which captures the intrinsic ability of a model to highlight potential inference errors through internal activations. We provide a formal definition of monitorability and propose the Monitorability via Input peRturbAtion (MIRA) Score, a practical measure that quantifies this property without requiring access to external OoD data. Our method accounts for the behavior of the model near the decision boundary by applying norm-bounded input perturbations and evaluates how distinguishable the resulting internal representations are by using Mahalanobis distance. Since no established baseline exists for monitorability, we validate MIRA by comparing it against the best achievable OoD detection performance across three representative methods. Through experiments across multiple architectures and domain applications, we show that the MIRA Score correlates with the strongest actual detection performance, providing a tool for evaluating and comparing monitorability across different models. To the best of our knowledge, this is the first formalization and quantitative measure of monitorability. Our findings offer both theoretical grounding and empirical insight into the conditions under which model failures become detectable.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 24804

Loading