Learning from Examples and Self-Exploration: A New Paradigm for Dynamic Fusion

ICLR 2026 Conference Submission24956 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Supervised Fine-Tuning, Large Language Models, Reinforcement Learning, Mathematical Reasoning, Dynamic Fusion
Abstract: Alignment of Large Language Models with human preferences is dominated by two paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), exemplified by methods like Group Relative Policy Optimization. Yet, they face a trade-off challenge: SFT excels at incorporating external knowledge but often fails to foster deep comprehension, whereas RL can internalize knowledge but struggles to expand the model's knowledge frontier. To resolve this, we propose **LESE** (**L**earning from **E**xamples and **S**elf-**E**xploration), a framework that dynamically interpolates between SFT and RL. LESE introduces an instance-adaptive mechanism that assesses a model's real-time task proficiency and exploration diversity, thereby allocating a dynamic weight between SFT and RL for each training instance. This adaptive methodology addresses the limitations of static strategies by adjusting the balance between SFT and RL at the instance level. Empirically, it improves performance on mathematical benchmarks and enhances training stability, while maintaining consistency with human-preferred outputs.
Primary Area: reinforcement learning
Submission Number: 24956
Loading