Abstract: Temporal question answering (TQA) remains a persistent challenge for large language models (LLMs), particularly in retrieval-augmented generation (RAG) settings where retrieved content may be irrelevant, outdated, or temporally inconsistent. This is especially critical in applications like clinical event ordering, policy tracking, and real-time decision-making, which require reliable temporal reasoning even under noisy or misleading context. To address this challenge, we introduce RASTeR: Robust, Agentic, and Structured, Temporal Reasoning, an agentic prompting framework that separates context evaluation from answer generation. RASTeR first assesses the relevance and temporal coherence of retrieved context, then constructs a structured temporal knowledge graph (TKG) to better facilitate reasoning. When inconsistencies are detected, RASTeR selectively corrects or discards context before generating an answer. Across multiple datasets and LLMs, RASTeR consistently improves robustness: defined here as the model's ability to generate correct predictions despite suboptimal context. We further validate our approach through a ``needle-in-the-haystack'' study, in which relevant context is buried among irrelevant distractors. Even with forty distractors, RASTeR achieves 75% accuracy, compared to the runner-up model, which reaches only 62%.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: temporal question answering, retrieval-augmented generation, temporal robustness, temporal knowledge graph, structured reasoning, parametric knowledge, context evaluation, model robustness
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Previous URL: https://openreview.net/forum?id=WaGrn3IF98
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: Yes, I want a different area chair for our submission
Reassignment Request Reviewers: Yes, I want a different set of reviewers
Justification For Not Keeping Action Editor Or Reviewers: The paper has changed a lot since the original submission last year. We want fresh eyes.
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: See Results section
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Yes, we describe all standard datasets in the Results section with their intended use.
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: We used open access datasets in Results section.
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: Yes
B6 Elaboration: Yes, we provide basic stats for the datasets in the Appendix.
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: We discuss model size of all models in method and results section.
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: See Results and Appendix
C3 Descriptive Statistics: Yes
C3 Elaboration: See Appendix
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: We just used it to check and fix writing mistakes
Author Submission Checklist: yes
Submission Number: 1026
Loading