From Echo to Nexus: How Attacker Personality Dictates Success in Jailbreaking LLMs

From Echo to Nexus: How Attacker Personality Dictates Success in Jailbreaking LLMs

ACL ARR 2026 January Submission3308 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Jailbreaking, Adversarial Attacks, AI Safety, Red Teaming

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities; however, ensuring their safety remains a significant challenge. While recent adversarial attacks have optimized prompts for efficacy, they predominantly treat attacks as static artifacts, overlooking the critical sociolinguistic dimension of the attacker's persona. In this paper, we bridge this gap with PRISM (Personality-Reified Iterative Strategic Modulation), a framework that formalizes attacker personality as a controllable variable within an automated optimization loop. Using PRISM to refract the attack surface, we investigate two distinct agent personas: the compliance-oriented Echo and the diversity-seeking Nexus. Experiments across state-of-the-art LLMs (e.g., GPT-4o, GPT-5, o1) reveal a striking dichotomy: Echo agents achieve state-of-the-art attack success rates (>93%) by weaponizing confirmation bias, whereas Nexus agents exhibit unexpected resilience, suggesting that information diversity acts as a natural "cognitive buffer." Our findings demonstrate that attack success is dictated not just by optimization intensity but by the social dynamics of the interaction, establishing PRISM as a vital tool for uncovering persona-driven vulnerabilities.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: Ethics and NLP

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 3308

Loading