From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models

08 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Hallucination, Jailbreak, Large Foundation Models, Safety
TL;DR: We reveal a shared failure mode in large foundation models by showing that hallucinations and jailbreaks arise from similar gradient dynamics—and that defending against one can help mitigate the other.
Abstract: Large foundation models (LFMs) are susceptible to two distinct vulnerabilities: hallucinations and jailbreak attacks. While typically studied in isolation, we observe that defenses targeting one often affect the other, hinting at a deeper connection. We propose a unified theoretical framework that models jailbreaks as token-level optimization and hallucinations as attention-level optimization. Within this framework, we establish two key propositions: (1) Similar Loss Convergence—the loss functions for both vulnerabilities converge similarly when optimizing for target-specific outputs; and (2) Gradient Consistency in Attention Redistribution—both exhibit consistent gradient behavior driven by shared attention dynamics. We validate these propositions empirically on LLaVA-1.5 and MiniGPT-4, showing consistent optimization trends and aligned gradients. Leveraging this connection, we demonstrate that mitigation techniques for hallucinations can reduce jailbreak success rates, and vice versa. Our findings reveal a shared failure mode in LFMs and suggest that robustness strategies should jointly address both vulnerabilities.
Primary Area: interpretability and explainable AI
Submission Number: 3182
Loading