Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

TMLR Paper7905 Authors

13 Mar 2026 (modified: 29 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large language models (LLMs). However, it typically suppresses rather than eliminates unsafe behaviors, leaving rare but critical failures hidden in the long tail of the output distribution. While most red-teaming work emphasizes adversarial prompt search (input-space search), we show that safety failures can be systematically exposed through diverse response generation (output-space search) for a fixed safety-critical prompt, where increasing the number and diversity of sampled responses monotonically raises the jailbreak success rate. To efficiently uncover such failures, we propose Progressive Diverse Population Sampling (PDPS), which replaces naive large-scale IID sampling with a multi-stage expansion-and-selection strategy that generates a compact, semantically diverse set of responses at substantially lower computational cost. Across multiple jailbreak benchmarks and open-source LLMs, PDPS achieves attack success rates comparable to large-scale IID sampling while using only $8\%-29\%$ of the computational cost, and outperforms IID sampling and Diverse Beam Search by $26\%-40\%$ under limited-response budgets, while uncovering a broader and more semantically diverse range of failure modes. Critically, this diversity translates directly into more effective safety hardening: when integrated into an RLHF-based safety-tuning pipeline, PDPS-generated unsafe responses yield $33\%$ and $41\%$ greater reductions in ASR than those generated by IID sampling and Diverse Beam Search, respectively. Finally, we show that while input-space prompt optimization methods fall short of output-space exploration when used in isolation, combining input-space perturbation with diversity-driven output-space exploration covers a wider range of failure modes more efficiently than either paradigm alone.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: N/A
Assigned Action Editor: ~Sirisha_Rambhatla1
Submission Number: 7905
Loading