Keywords: Large Language Models, Jailbreak attack, secondary risks
TL;DR: We propose SecLens, a novel method to uncover subtle secondary risks in LLMs through a black-box search strategy, supported by SecondBench, a benchmark demonstrating their prevalence and transferability across LLMs, MLLMs, and interactive agents.
Abstract: Ensuring the safety and alignment of Large Language Models (LLMs) is a significant challenge with their growing integration into critical applications and societal functions. While prior research has primarily focused on jailbreak attacks, less attention has been given to non-adversarial failures that subtly emerge during benign interactions. We introduce secondary risks—a novel class of failure modes marked by harmful or misleading behaviors during benign prompts. Unlike adversarial attacks, these risks stem from imperfect generalization and often evade standard safety mechanisms. To enable systematic evaluation, we introduce two risk primitives (excessive response and speculative advice) from the perspective of illustrative examples and information theory, which capture the core failure patterns. Building on these definitions, we propose SecLens, a black-box, multi-objective search framework that efficiently elicits secondary-risk behaviors by optimizing task relevance, risk activation, and linguistic plausibility. With SecLens, one can automatically uncover secondary-risk vulnerabilities that are prevalent across different models. Extensive experiments on 16 popular LLMs using SecLens demonstrate that secondary risks are not only widespread but also transferable across models. Moreover, our exploration of common agent environments reveals that such risks are pervasive in practice. These findings underscore the urgent need for strengthened safety mechanisms to address benign yet harmful behaviors of LLMs in real-world deployments.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 15572
Loading