Red Pill or Blue Pill? Thresholding Strategies for Neural Network Monitoring

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Supplementary Material: zip
Primary Area: societal considerations including fairness, safety, privacy
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Neural Network Runtime Monitoring, Machine Learning Safety, Threshold Optimization
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We compare experimentally different strategies for threshold tuning of neural network runtime monitors.
Abstract: With the increasing deployment of neural networks in critical systems, runtime monitoring plays a critical role in rejecting unsafe predictions during inference. Various techniques have emerged to establish rejection scores that aim to maximize the separability between the distributions of safe and unsafe predictions. In most works, the efficacy of these approaches is evaluated using threshold-agnostic metrics, such as the area under the receiver operating characteristic curve. However, in real-world applications, the effectiveness of a monitor also requires identifying a good threshold to transform these scores into meaningful binary decisions. Despite the pivotal importance of threshold optimization in practice, this problem has received little to no attention in the literature. In this work, we address this question by comparing four strategies for constructing threshold optimization datasets, each reflecting a different assumption about the data available for threshold tuning. We present rigorous experiments on various image datasets and conclude that: 1. Knowledge about runtime threats actually impacting the system helps greatly in identifying an optimal threshold. 2. Without this information, relying solely on in-distribution data is advised, as adding unrelated generic threat data produces worse thresholds.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7444
Loading