Long‑Horizon Reliability of LLM Agents: Social Exposure, Personas, and Metacognitive Policy on a Delay‑of‑Gratification Survival Benchmark

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Agentic LLMs, Multi-turn interaction, Long-horizon reasoning, ReAct, Multi-agent systems, Social influence, Persona prompting, Metacognition, Time-resolved evaluation, Survival analysis, Kaplan–Meier, Discrete-time hazard models, RMST, Benchmarking and evaluation, Robustness, Safety, Generalization, Reproducibility, Open-source framework
TL;DR: We show through survival analysis that LLMs' multi-turn decision-making is systematically influenced by peer visibility, internal personas, and metacognitive scaffolding
Abstract: Large language models (LLMs) are increasingly deployed as long-horizon, multi-turn agents that must reason, plan, utilize tools, and interact with peers. Yet, most evaluations lack auditable, multi-factorial experiments with time-resolved statistics that reveal how behavior unfolds under explicit constraints. Inspired by the Stanford marshmallow experiment, we introduce a compact, multi-agent microbenchmark that reframes the delay of gratification as a discrete-time survival task. ReAct-style agents operate at minute‑level granularity with an internal "raise_a_question" tool, subject to a per-step budget. We factorially manipulate social visibility (broadcast vs. isolated), persona prompts (hedonic drive, age), and metacognitive policy (mandatory vs. optional self-questioning). From complete step-level traces, we estimate Kaplan-Meier (KM) survival and discrete‑time hazards, enabling transparent inspection of social influence and tool-use dynamics. We extend the study to 8 model families (open- and closed-weight), totaling 84,540 trajectories across 512 cells, with $\approx$100\% valid runs. Aggregate behavior exhibits a sharp early impulse (initial eat 0.062) followed by a long low‑hazard tail; completion is 0.824, with median time‑to‑eat $\approx$17 and Restricted mean survival time (RMST) $\approx$16.47. In pooled hazards, mandatory self‑questioning increases per‑minute risk ($\beta$ $\approx$0.093; Odds Ratio (OR) $\approx$1.10), while persona factors strongly modulate hazard (vs. crave: like OR $\approx$0.45, neutral $\approx$0.26, none $\approx$0.24; vs. adult: child $\approx$8.65, senior $\approx$5.60). The broadcast vs. isolated main effect is near zero on average ($\ beta \ approx$ -0.009; OR $ \ approx$ 0.99), but we uncover three hazard-shape regimes (near-flat, early-spike, and bi-modal) that vary by model family and mediate when social exposure matters. Ablations that remove hedonic and/or age instructions flatten hazards and raise completion toward 1.0. We release code, prompts, logs, and analysis artifacts to facilitate replication and future work on causal social exposure, networked interaction, and other long‑horizon agent tasks.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 24039
Loading