Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper presents a two-stage jailbreak attack method (ICRT) that combines concept decomposition and reassembly with context matching, along with a novel harmfulness evaluation metric leveraging multiple ranking algorithms.
Abstract: Despite the remarkable performance of Large Language Models (\textbf{LLMs}), they remain vulnerable to jailbreak attacks, which can compromise their safety mechanisms. Existing studies often rely on brute-force optimization or manual design, failing to uncover potential risks in real-world scenarios. To address this, we propose a novel jailbreak attack framework, \textbf{ICRT}, inspired by heuristics and biases in human cognition. Leveraging the \textit{simplicity effect}, we employ \textit{cognitive decomposition} to reduce the complexity of malicious prompts. Simultaneously, \textit{relevance bias} is utilized to reorganize prompts, enhancing semantic alignment and inducing harmful outputs effectively. Furthermore, we introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm by employing ranking aggregation methods such as Elo, HodgeRank, and Rank Centrality to comprehensively quantify the harmfulness of generated content. Experimental results show that our approach consistently bypasses mainstream \textbf{LLMs}' safety mechanisms and generates high-risk content.
Lay Summary: Large language models remain vulnerable to jailbreak attacks that can bypass their safety mechanisms. To address this, we propose the ICRT framework, which draws on human cognitive biases—specifically the simplicity effect and relevance bias—to decompose malicious prompts into simpler components and then reorganize them for stronger semantic alignment, making it easier for the model to accept harmful instructions. In addition, we design a ranking-based harmfulness evaluation method that moves beyond a simple success-or-failure metric by using ranking aggregation algorithms (such as Elo, HodgeRank, and Rank Centrality) to quantify the severity of generated content. Experimental results demonstrate that ICRT consistently evades the safety filters of mainstream models and produces high-risk content.
Link To Code: https://github.com/longlong-www/ICRT
Primary Area: Social Aspects->Security
Keywords: Jailbreak attack, large language models
Submission Number: 3467
Loading