Keywords: AI Agent, LLMs, Preference Alignment, Evaluation
Abstract: Overlapping calendar invitations force busy professionals to repeatedly decide which meetings to attend, reschedule, or decline. We refer to this preference-driven decision process as calendar conflict resolution. Automating such process is crucial yet challenging: Scheduling logistics drain hours, and human delegation often fails at scale, raising the question of whether large language model (LLM) agents can reliably learn and apply user preferences to manage time. To enable systematic study, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution. Conflicts are presented sequentially and agents receive feedback after each round, requiring them to infer and adapt to user preferences progressively. Our experiments show that current LLM agents perform poorly with high error rates, e.g., Qwen-3-30B-Think with 0.35 average error rate. To address this gap, we propose PEARL, a reinforcement-learning framework that augments language agent with an external memory module and optimized round-wise reward design, enabling agent to progressively infer and adapt to user preferences on-the-fly. Experiments on CalConflictBench shows that PEARL achieves 0.76 error reduction rate, and 55% improvement on average error rate compared to the strongest baseline.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: Autonomous agents, LLM agents, agent evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 1689
Loading