CompQA: Investigating the Weakness of Multihop QA on Comparison Questions

CompQA: Investigating the Weakness of Multihop QA on Comparison Questions

ACL ARR 2025 February Submission7141 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Retrieval-Augmented Generation (RAG) models are increasingly employed in Multihop Question Answering (MHQA). However, we identify a critical limitation: existing methods exhibit suboptimal performance on comparison-type questions, with the performance decline being notably greater than that observed for bridge-type questions. Empirical analysis reveals that existing methods consistently underperform relative to LLM only baselines, particularly as the number of hops increases. Moreover, they require significantly more inference and retriever calls without delivering equivalent performance gains. To demonstrate, we introduced the CompQA dataset, which includes questions with a higher number of hops, alongside the MuSiQue benchmark. Finally, we discuss our findings, examine potential underlying causes, and highlight the limitations of RAG strategies in reasoning over complex question types.

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: retrieval-augmented generation, benchmark dataset, multihop question answering, comparison question

Contribution Types: Data resources

Languages Studied: english

Submission Number: 7141

Loading