Keywords: Intent Drift, Trajectory-Level Alignment, Large Language Models, Multi-Turn Interaction, Goal-Directed Agents, Drift Detection, Alignment Metrics, Rate–Distortion Theory, Lyapunov Stability, Human Preference Correlation, Long-Horizon Robustness, IDS-DPO, Benchmark Evaluation
TL;DR: We introduce the Intent Drift Score (IDS), a unified metric that makes long-horizon LLM agents more reliable by detecting and controlling subtle trajectory-level misalignment.
Abstract: Large Language Models (LLMs) are increasingly deployed as multi-turn, goal-directed agents in domains such as tutoring, planning, and financial decision-making. Yet, even when individual steps appear correct, their overall trajectories can gradually diverge from user intent—a phenomenon we call Intent Drift. Unlike hallucination or local error accumulation, intent drift is a trajectory-level instability that undermines reliability in long-horizon tasks.
We introduce the Intent Drift Score (IDS), a unified and computable metric for detecting and mitigating this form of misalignment. IDS integrates semantic, structural, and temporal signals into a prefix-monotone score, enabling real-time monitoring of drift. It is computable in linear time and scales to million-token contexts, making it deployable in practical long-horizon applications.
Grounded in stability and rate–distortion theory, IDS offers formal guarantees of prefix-monotonicity and stability bounds. Empirical evaluations across dialogue and planning benchmarks show that IDS correlates strongly with human ratings (above 0.82) and identifies drift significantly earlier than BLEU, ROUGE, or graph-based diagnostics.
Our core message is straightforward: alignment must be assessed not only by accuracy and safety, but also by trajectory-level stability. IDS operationalizes this principle, providing a foundation for building LLM agents that remain trustworthy over extended interactions.
Submission Number: 123
Loading