Delay-of-Gratification as a Multi-Agent Survival Micro-benchmark for Long-Horizon LLMs: Social Exposure, Personas, and Tool Use Budgets
Keywords: multi-turn LLM agents, long-horizon decision making, time-to-event evaluation, survival analysis, discrete-time hazard models, Kaplan–Meier, POMDPs, Markov decision processes (MDPs), multi-agent interaction, peer influence; tool-augmented reasoning (ReAct), metacognitive scaffolding, question budget, delayed gratification benchmark, auditable traces, factorial experimental design
TL;DR: We show through survival analysis that LLMs' multi-turn decision-making is systematically influenced by peer visibility, internal personas, and metacognitive scaffolding
Abstract: Large language models (LLMs) are increasingly deployed as multi-turn agents that must sustain goals, use tools, and adapt to other agents over extended interactions. However, existing research lacks auditable, multi-turn, multi-factorial experiments
that quantify LLM behavior under explicit constraints, with time-resolved statistics that reveal how behavior unfolds over long horizons. To address this gap, we develop a multi-agent microbenchmark inspired by the Stanford marshmallow experiment: ReAct agents operate minute-by-minute with a "raise a question" tool under a per-step budget, while we factorially manipulate social context (broadcast
vs. isolated), personas (age, hedonic drive), and metacognitive policy (must vs. may follow instructions). We analyze outcomes with Kaplan-Meier(KM) survival curves and discrete-time hazard models over a long risk horizon. Across 19,200 agent trajectories in 64 cells (horizon T = 19), 99.9% of runs were valid. The behavior exhibits a sharp early "eat" impulse (initial eat = 0.125), a total eat rate of 0.241, and 75.9% of agents persist to the end. The waiting profile is summarized by a median time-to-eat ≈ of approximately 14.8 and a RMST ≈ of approximately 14.8. In a discrete-time hazard model, isolation reduces per-minute risk relative to broadcast (OR = 0.78, 95% CI [0.73, 0.83], p < .001), whereas a MUST-use self-questioning policy increases risk (OR = 1.42, [1.35, 1.50], p < .001). Hedonic and age personas strongly modulate risk: vs. crave, like (OR = 0.28), none (0.19), and neutral (0.03) reduce hazard; vs. adult, child increases hazard (OR = 66.3), and senior is elevated (7.55) (all p < .001). On average, agents ask ≈ 7.12 questions and hit the per-step budget in ≈ 6% of minutes; question-asking declines faster under broadcast than isolation. Further ablation experiments demonstrated that removing hedonic drive and/or persona age systematically increases survival and completion, narrows the broadcast/isolated gap, and leaves the must vs. may ordering intact (must is riskier); the combined ablation (no hedonic + no persona age) yields the highest completion (approaching 1.0) and distinct tool-usage dynamics with higher initial questioning rates that gradually decrease over time. These results establish delay-of-gratification as a compact multi-turn interaction benchmark that captures social contagion and tool-use dynamics in LLM agents, offering a reproducible testbed and statistics to analyze long-horizon, multi-agent behavior.
Submission Number: 233
Loading