Learning from Synthetic Data Improves Multi-hop Reasoning

Learning from Synthetic Data Improves Multi-hop Reasoning

ICLR 2026 Conference Submission21366 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-hop reasoning, large language models, reinforcement learning, synthetic data

TL;DR: RL fine-tuning LLMs on synthetic data improves real-world multi-hop reasoning by teaching knowledge composition skills

Abstract: Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often obtained through human-annotated datasets and LLM-as-verifier loops. Both of these data types have considerable limitations: human-annotated datasets are small and expensive to curate, while LLM verifiers have high scoring latency and are costly to operate. In this work, we investigate the use of synthetic datasets in RL fine-tuning for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, even though the synthetic data only contain fictional knowledge. On stratifying model performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge, which we to be a fundamental and generalizable reasoning skill. Our work thus highlights the utility of synthetic reasoning datasets in improving LLM reasoning capabilities.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 21366

Loading