Keywords: temporal reasoning, reinforcement learning, memory selection, multi-session dialogue
Abstract: Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. As dialogue histories grow in length and accumulate noise, existing long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce **Memory-T1**, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set with temporal and retriever filters, followed by an RL agent that selects the precise evidence. The RL training is guided by a multi-level reward function optimizing (i) accuracy, (ii) evidence grounding, and (iii) temporal consistency. This temporal consistency reward provides a dense signal by evaluating alignment at both the session-level (range proximity) and the utterance-level (evidence density), enabling the agent to resolve subtle chronological ambiguities. On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0\%, establishing a new state-of-the-art performance for open-source models and outperforming a 14B baseline by 10.2\%. Ablation studies show temporal consistency and evidence grounding rewards jointly contributing to a 15.0\% performance gain.Moreover, Memory-T1 maintains robustness up to 128k tokens, where baseline models collapse, proving effectiveness against noise in extensive dialogue histories.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 5997
Loading