Keywords: Benchmark, Logical Reasoning, LLMs, Logic QA, Evaluation, LLMs
Abstract: Large Language Models (LLMs) are increasingly used in conversational applications that require interactive reasoning, such as tutoring systems and legal assistants. While these models perform well on static QA tasks, it remains unclear whether they can consistently revise their beliefs and perform logical reasoning when prior information is retracted or new information is introduced over multiple conversational turns. To address this, we introduce ReviseQA, a benchmark for belief revision in multi-turn logical reasoning. Each turn gradually modifies the previous context by removing or adding facts and rules, requiring the model to reassess its conclusion. This dynamic setting reflects real-world reasoning, where agents must update their conclusions based on evolving information. Our experiments show that current LLMs often fail to maintain logical consistency when updating beliefs, highlighting ReviseQA as a necessary benchmark toward evaluating and improving multi-turn reasoning in LLMs.
Submission Number: 5
Loading