Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis

Linqing Chen; Hanmeng Zhong; Wentao Wu; weilei wang

Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis

Linqing Chen, Hanmeng Zhong, Wentao Wu, weilei wang

17 Sept 2025 (modified: 26 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data Synthesis, LLM, Multi-Hop Question Generation

Abstract: Large language model (LLM) training faces a critical bottleneck: the scarcity of high-quality, reasoning-intensive question-answer pairs, especially from sparse, domain-specific sources like PubMed papers or legal documents. Existing methods rely on surface patterns, fundamentally failing to generate controllable, complex multi-hop reasoning questions that test genuine understanding—essential for advancing LLM training paradigms. We present Semantic Bridge, the first universal framework for controllably generating sophisticated multi-hop reasoning questions from arbitrary sources. Our breakthrough innovation is semantic graph weaving—three complementary bridging mechanisms (entity bridging for role-varying shared entities, predicate chain bridging for temporal/causal/logical sequences, and causal bridging for explicit reasoning chains)—that systematically construct complex pathways across documents, with fine-grained control over complexity and types via AMR-driven analysis. Our multi-modal AMR pipeline achieves up to 9.5% better round-trip quality, enabling production-ready controllable QA generation. Extensive evaluation demonstrates performance across both general-purpose datasets (Wikipedia) and specialized domains (biomedicine) It yields consistent 18.3%–25.4% gains over baselines across four languages (English, Chinese, French, German). Question pairs generated from 200 sources outperform 600 native human annotation examples with 67% fewer materials. Human evaluation shows 23.4% higher complexity, 18.7% better answerability, and 31.2% improved pattern coverage. Semantic Bridge establishes a new paradigm for LLM training data synthesis, enabling controllable generation of targeted reasoning questions from sparse sources. We will release our core code and semantic bridge model.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2026/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9069

Loading