Keywords: Dense Retrieval, Query Generation, Data Augmentation, Multi-hop Reasoning, Query Diversity
Abstract: Prior work reports conflicting results on query diversity in synthetic data generation for dense retrieval. We identify this conflict and design Q-D metrics to quantify diversity's impact, making the problem measurable. Through experiments on 4 benchmark types (31 datasets), we find query diversity especially benefits multi-hop retrieval. Deep analysis on multi-hop data reveals that diversity benefit correlates strongly with query complexity (r≥0.95, p<0.05 in 12/14 conditions), measured by content words (CW). We formalize this as the Complexity-Diversity Principle (CDP): query complexity determines optimal diversity. CDP provides actionable thresholds (CW>10: use diversity; CW<7: avoid it). Guided by CDP, we propose zero-shot multi-query synthesis for multi-hop tasks, achieving state-of-the-art performance.
Paper Type: Long
Research Area: Information Extraction and Retrieval
Research Area Keywords: Dense Retrieval, Query Generation, Multi-hop Question Answering, Data Augmentation, Synthetic Data, Contrastive Learning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Data analysis, Theory
Languages Studied: English
Submission Number: 10489
Loading