Keywords: Benchmarking, Proactive Agent
Abstract: Proactive agents are expected to anticipate user needs and provide autonomous assistance by perceiving environmental context without explicit instructions. A fundamental capability of such agents is to identify and track users’ upcoming events, enabling continuous and event-specific assistance. For example, by recording the time and location of a planned hike, an agent can deliver weather reminders in advance or provide navigation support before departure. However, existing works on proactive agents largely overlook event-centric assistance, and the open-ended nature of proactive assistance poses challenges for reliable evaluation.
To bridge these gaps, we introduce \textsc{ProEvent}, the first event-centric benchmark designed to assess an agent’s ability to proactively maintain a user’s timetable based on ongoing instant messaging chats. \textsc{ProEvent} provides realistic chats that consider the dynamic interaction among users, concurrent chat threads, and noise in the real world, and evaluates proactive agents along three dimensions: response timing, single-step response correctness, and multi-step response correctness. Experiments on eight LLMs and pipelines reveal that current agents frequently overact and struggle with event cancellation. Notably, even the state-of-the-art GPT-5.1 provides redundant assistance in $30\%$ of cases and achieves only $26.7\%$ recall in event cancellation scenarios. Further qualitative analysis reveals fundamental limitations of current LLMs as proactive agents, particularly in detecting implicit events and reasoning from the user’s first-person perspective.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking,agent evaluation
Contribution Types: Data resources
Languages Studied: English
Submission Number: 7549
Loading