Optimization-based Trajectory Deviation Attacks in Agentic LLM Systems

ICLR 2026 Conference Submission22496 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Agent, security, large language model, trajectory attack
TL;DR: This paper presents novel trajectory attacks against LLM agents that manipulate intermediate observations to steer an agent’s reasoning path without altering the initial prompt or model weights.
Abstract: Agentic large language model (LLM) systems are increasingly deployed in critical areas such as healthcare, finance, transportation, and defense, where decisions emerge from iterative cycles of action, observation, and reflection rather than single prompts. We show that this loop introduces a unique and underexplored vulnerability. Specifically, we present trajectory deviation attacks, which manipulate intermediate observations to redirect an agent’s reasoning process without altering its initial prompt or model weights. We formalize two attack types: (i) incorrect-outcome attacks, which guide agents toward plausible but wrong conclusions, and (ii) targeted attacks, where adversaries deterministically steer reasoning toward a chosen outcome. We frame trajectory corruption as an optimization problem, leveraging adversarial “attack agents” with logit access to inject semantically coherent yet misleading observations. By minimizing perplexity and entropy, our attacks evade common anomaly detection methods while maximizing reasoning misalignment. Through evaluations on black-box victim agents powered by state-of-the-art proprietary models across domains such as medical decision-making, financial advising, and travel planning, our results highlight that securing agentic LLM systems requires integrity guarantees across the full reasoning trajectory.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22496
Loading