Keywords: Neuro-Symbolic Framework, Multi-Agent, LLMs Temporal reasoning
Abstract: Despite the strong performances of large language models (LLMs) on the temporal reasoning task, recent studies using the UnseenTimeQA benchmark indicates that this proficiency may stem largely from data contamination—specifically, memorizing test-set facts during pre-training. Notably, even advanced LLMs like DeepSeek-V3 struggle on this benchmark. To address this limitation, we propose NSMATeR, a neuro-symbolic multi-agent framework that formalizes the error-prone reasoning steps into executable code. By integrating this symbolic component, our approach effectively mitigates LLM failures, such as event forgetting, rule misunderstanding and hallucination. To ensure the code correctness, we further introduce a case-verification mechanism that iteratively refines the code using a small set of annotated cases. Extensive experiments on UnseenTimeQA show that our framework significantly boosts accuracy, improving even DeepSeek-V3's performance by 27.89\% in the most challenging scenarios.
Paper Type: Long
Research Area: Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning
Research Area Keywords: Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 10042
Loading