Keywords: Large Language Models, Jailbreaking, Adversarial Attacks, AI Safety, Red Teaming
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities; however, ensuring their safety remains a significant challenge. While recent adversarial attacks have optimized prompts for efficacy, they predominantly treat attacks as static artifacts, overlooking the critical sociolinguistic dimension of the attacker's persona. In this paper, we bridge this gap with PRISM (Personality-Reified Iterative Strategic Modulation), a framework that formalizes attacker personality as a controllable variable within an automated optimization loop. Using PRISM to refract the attack surface, we investigate two distinct agent personas: the compliance-oriented Echo and the diversity-seeking Nexus. Experiments across state-of-the-art LLMs (e.g., GPT-4o, GPT-5, o1) reveal a striking dichotomy: Echo agents achieve state-of-the-art attack success rates (>93%) by weaponizing confirmation bias, whereas Nexus agents exhibit unexpected resilience, suggesting that information diversity acts as a natural "cognitive buffer." Our findings demonstrate that attack success is dictated not just by optimization intensity but by the social dynamics of the interaction, establishing PRISM as a vital tool for uncovering persona-driven vulnerabilities.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Ethics and NLP
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 3308
Loading