Abstract: Retrieval-Augmented Generation (RAG) models are increasingly employed in Multihop Question Answering (MHQA). However, we identify a critical limitation: existing methods exhibit suboptimal performance on comparison-type questions, with the performance decline being notably greater than that observed for bridge-type questions. Empirical analysis reveals that existing methods consistently underperform relative to LLM only baselines, particularly as the number of hops increases. Moreover, they require significantly more inference and retriever calls without delivering equivalent performance gains. To demonstrate, we introduced the CompQA dataset, which includes questions with a higher number of hops, alongside the MuSiQue benchmark. Finally, we discuss our findings, examine potential underlying causes, and highlight the limitations of RAG strategies in reasoning over complex question types.
Paper Type: Short
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: retrieval-augmented generation, benchmark dataset, multihop question answering, comparison question
Contribution Types: Data resources
Languages Studied: english
Submission Number: 7141
Loading