Evaluation of Multi-Turn Consistency in LLM Agents: Survival Analysis and Failure-Rationale Taxonomy
Track: long paper (up to 10 pages)
Keywords: LLM reasoning evaluation, multi-turn consistency, logical contradiction detection, LLM reasoning evaluation, self-contradiction in chain-of-thought, reasoning trace faithfulness, deliberation-inconsistency, temporal reasoning reliability, agent reasoning benchmark, failure rationale taxonomy, consistency benchmarking
TL;DR: We evaluate multi-turn reasoning consistency in LLMs via survival analysis and failure-rationale taxonomy, finding that denser deliberation traces produce more logical self-contradictions, not fewer
Abstract: Large language model (LLM) agents may perform well on isolated tasks yet drift into inconsistency over extended interaction. We evaluate temporal consistency in a controlled 20-step multi-agent setting inspired by studies on delayed gratification. At each step, an agent chooses between delaying a reward or claiming it immediately (terminating the episode). Across a full-factorial manipulation of social visibility (private vs public), persona stressors, and deliberation policy, we run 84,540 trajectories spanning 8 model families. Treating the first reward-claim as a time-to-event outcome, we estimate Kaplan-Meier survival curves and fit discrete-time hazard regression to quantify how experimental factors shift failure risk over time. Then, to analyze rationales and language patterns associated with failure, we build a seven-category taxonomy from 13,780 deliberation traces from agents who choose to terminate the episode, using an LLM-assisted labelling paired with human audit ($\kappa=0.83$). Rationale profiles change systematically with time and context: early failures are more impulse-driven, later failures more fatigue- and cost–benefit-framed, while public settings increase norm-oriented justifications. We also find a deliberation-inconsistency association: among failures, longer deliberation correlates with higher rates of intra-rationale contradiction (simultaneous pro-delay and pro-claim statements), challenging the assumption that more reasoning text implies greater consistency. Together, the survival and rationale analyses reveal distinct temporal reliability regimes and model-specific "failure fingerprints", offering an evaluation lens for diagnosing inconsistency in multi-turn agent behavior
Presenter: ~Olga_Manakina1
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 127
Loading