Abstract: Multi-Hop Question Answering (MHQA) is crucial for evaluating the model's capability to integrate information from diverse sources. However, creating extensive and high-quality MHQA datasets is challenging: (i) manual annotation is expensive, and (ii) current synthesis methods often produce simplistic questions or require extensive manual guidance. This paper introduces HopWeaver, the first automatic framework synthesizing authentic multi-hop questions from unstructured text corpora without human intervention. HopWeaver synthesizes two types of multi-hop questions (bridge and comparison) using an innovative approach that identifies complementary documents across corpora. Its coherent pipeline constructs authentic reasoning paths that integrate information across multiple documents, ensuring synthesized questions necessitate authentic multi-hop reasoning. We further present a comprehensive system for evaluating synthesized multi-hop questions. Empirical evaluations demonstrate that the synthesized questions achieve comparable or superior quality to human-annotated datasets at a lower cost. Our approach is valuable for developing MHQA datasets in specialized domains with scarce annotated resources.
Paper Type: Long
Research Area: Generation
Research Area Keywords: retrieval-augmented generation, interactive and collaborative generation
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Previous URL: https://openreview.net/forum?id=RYuFmE130I
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: Yes, I want a different area chair for our submission
Reassignment Request Reviewers: Yes, I want a different set of reviewers
Justification For Not Keeping Action Editor Or Reviewers: We respectfully request a new AC and reviewers. The previous review cycle revealed a fundamental misalignment on the paper's scope. One reviewer's feedback focused on experiments beyond our stated contributions, while another key perspective remained unaddressed despite our detailed rebuttal. The AC's meta-review largely repeated these initial views, assigning an aggregate score without addressing significant parts of our rebuttal. Given the limited feedback and potential for biased perspectives from this previous cycle, we believe a new evaluation is necessary for a fair assessment.
Software: zip
Data: zip
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Yes, we cite all creators of used artifacts. Citations for baseline datasets and models are provided in Section 5 and Appendix G, with full details in the References section.
B2 Discuss The License For Artifacts: No
B2 Elaboration: No, we did not discuss licenses as the artifacts used are standard academic benchmarks and models intended for research purposes.
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Our use of baseline datasets for benchmarking, as shown in Section 5.1, is consistent with their intended use. The intended purpose of our created artifact, HopWeaver, is detailed in the Abstract and Introduction.
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: No, this was not discussed as our work relies on the public English Wikipedia corpus (as stated in Appendix G), which primarily contains information about public figures and entities.
B5 Documentation Of Artifacts: N/A
B5 Elaboration: Yes, the paper serves as documentation for our framework. Section 3 details the design, Section 2 defines the question types, and Appendix G specifies the English Wikipedia data source.
B6 Statistics For Data: Yes
B6 Elaboration: Appendix A.
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 5, Appendix C and G.
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Appendix F and G.
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 5, Appendix B and E.
C4 Parameters For Packages: Yes
C4 Elaboration: Appendix G.
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: No
D1 Elaboration: The human validation was a small-scale study conducted with three Master's students in Computer Science who served as expert evaluators. The instructions consisted of a direct request to perform pairwise comparisons based on the detailed evaluation criteria already presented in the paper's appendix (Appendix E of the provided PDF). Given the evaluators' expertise and the straightforward nature of the task, a separate, lengthy instruction document was not created.
D2 Recruitment And Payment: No
D2 Elaboration: The three human evaluators were Master's students from our research group who participated as part of their academic research activities. They were not recruited through a formal process or crowdsourcing platform, and no monetary payment was provided for this specific validation task.
D3 Data Consent: No
D3 Elaboration: The participants were student co-authors and collaborators on this research project. They provided verbal consent to participate in the validation study. The purpose and use of their evaluation data within this paper were fully understood and agreed upon as part of the collaborative research process.
D4 Ethics Review Board Approval: No
D4 Elaboration: Formal ethics review board approval was not sought for this part of the study. The protocol involved a small number of expert evaluators (student collaborators) performing a low-risk, non-sensitive task of evaluating text quality. This type of internal validation activity generally does not require formal ethics board approval at our institution.
D5 Characteristics Of Annotators: Yes
D5 Elaboration: Yes, in Appendix B.2, we specify that the evaluators were "Three Master's students in Computer Science."
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 206
Loading