Benchmarking Mitigations For Covert Misuse

ICLR 2026 Conference Submission20877 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI safety, LLM misuse, adversarial attacks, evaluation, adversarial training
TL;DR: Evaluation for realistic covert/decomposition attacks for misuse, and benchmarking stateful defenses.
Abstract: Existing language model safety evaluations focus on overt attacks and low-stakes tasks. In reality, an attacker can easily subvert existing safeguards by requesting help on small, benign-seeming tasks across many independent queries. Because individual queries do not appear harmful, the attack is hard to detect. However, when combined, these fragments *uplift misuse* by helping the attacker complete hard and dangerous tasks. Toward identifying defenses against such strategies, we develop *Benchmarks for Stateful Defenses* (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses. Using this pipeline, we curate two new datasets that are consistently refused by frontier models and are too difficult for weaker open-weight models. This enables us to evaluate decomposition attacks, which are found to be effective misuse enablers, and to highlight stateful defenses as both a promising and necessary countermeasure.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 20877
Loading