PRISON: Unmasking the Criminal Potential of Large Language Models

ICLR 2026 Conference Submission16169 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models (LLMs), Criminal Potential, Behavioral Safety, Social Impact
TL;DR: To assess harmful behaviors in adversarial contexts, we propose PRISON, a framework measuring LLMs’ criminal potential and anti-crime ability, showing that advanced models often display such traits but fail to reliably detect them.
Abstract: As large language models (LLMs) advance, concerns about their misconduct in complex social contexts intensify. Existing research has overlooked the systematic assessment of LLMs’ criminal potential in realistic interactions, where criminal potential is defined as the risk of producing harmful behaviors such as deception and blame-shifting under adversarial settings that could facilitate unlawful activities. Therefore, we propose a unified framework PRISON, to quantify LLMs' criminal potential across five traits: False Statements, Frame-Up, Psychological Manipulation, Emotional Disguise, and Moral Disengagement. Using structured crime scenarios grounded in reality, we evaluate both criminal potential and anti-crime ability of LLMs. Results show that state-of-the-art LLMs frequently exhibit emergent criminal tendencies, such as proposing misleading statements or evasion tactics, even without explicit instructions. Moreover, when placed in a detective role, models recognize deceptive behavior with only 44\% accuracy on average, revealing a striking mismatch between expressing and detecting criminal traits. These findings underscore the urgent need for adversarial robustness, behavioral alignment, and safety mechanisms before broader LLM deployment.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 16169
Loading