Online Detection for Black-Box Large Language Models with Adaptive Prompt Selection

Jiawei Guo; Minshuo Chen; Liyan Xie; Mengdi Wang

Online Detection for Black-Box Large Language Models with Adaptive Prompt Selection

Jiawei Guo, Minshuo Chen, Liyan Xie, Mengdi Wang

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: online change detection, LLM security, active prompt selection

TL;DR: This paper proposed a novel online detection algorithm to detect changes happened in the LLM, with an adaptive prompt selection strategy to enhance detection power.

Abstract: The widespread success of large language models (LLMs) has made them integral to various applications, yet security and reliability concerns are growing. It now becomes critical to safeguard LLMs from unintended changes caused by tampering, malicious prompt injection, or unauthorized parameter updates, etc. Early detection of these changes is essential to maintain the performance, fairness, and trustworthiness of LLM-powered applications. However, in black-box settings, where access to model parameters and output probabilities is unavailable, few detection methods exist. In this paper, we propose a novel online change-point detection method for quickly detecting changes in black-box LLMs. Our method features several key innovations: 1) we derive a CUSUM-type detection statistic based on the entropy and the Gini coefficient of the response distribution, and 2) we utilize a UCB-based adaptive prompt selection strategy for identifying change-sensitive prompts to enhance detection. We evaluate the effectiveness of the proposed method using synthetic data, where changes are simulated through watermarking and model version updates. Our proposed method is able to detect changes quickly while well controlling the false alarm rate. Moreover, for real-world data, our method also accurately detects announced changes in LLM APIs via daily online interactions with APIs. We also demonstrate strong evidence of unreported changes in APIs, which may be of independent interest.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13317

Loading