MENTOR: A Reinforcement Learning Framework for Distilling Tool Use in Small Models via Teacher-Optimized Rewards

ACL ARR 2026 January Submission1319 Authors

29 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge Distillation, Reinforcement Learning, Tool-Calling, Agent, Small Language Model
Abstract: Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL methods using simple outcome-based reward fail to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to address insufficient process guidance, it uses a teacher's reference trajectory to construct a composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard RL baselines.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Efficient/Low-Resource Methods for NLP, Machine Learning for NLP
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 1319
Loading