Keywords: Large Reasoning Models, Prompt Injection Attack
TL;DR: We reveal a vulnerability in reasoning models and propose Reasoning Injection Attack (RIA), which exploits their susceptibility to injections aligned with their internal reasoning consistency.
Abstract: Large reasoning models (LRMs) exhibit stronger logical coherence than standard language models, producing more consistent reasoning chains. While this makes them powerful, it also introduces new security concerns. Prior work on prompt injection has primarily focused on attacks that use explicit instructions to override the original task, but we find these methods increasingly ineffective against LRMs, as they disrupt the model’s reasoning flow. In this work, we propose *Reasoning Injection Attack (RIA)*, a new attack paradigm that integrates injected objectives into the model’s reasoning process instead of forcefully interrupting it. By presenting malicious information as a logically consistent component of the reasoning chain, RIA achieves higher success rates while maintaining coherence. To enable systematic evaluation, we further establish a *Reasoning Prompt Injection Benchmark* that spans five model families and 14 diverse reasoning domains. Experiments show that RIA improves the average attack success rate from 0.63 to 0.76, significantly outperforming explicit injection methods. These results reveal a key vulnerability of LRMs and underscore the need for more robust defenses against reasoning-aware prompt injection.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 5493
Loading