CM2: Reinforcement Learning via Checklist Reward for Multi-Turn Multi-Step Agentic Tool Use

ACL ARR 2026 January Submission7409 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Tool Use, LLM Agent, Reinforcement Learning, Reinforcement Learning with Checklist Reward, Multi-turn
Abstract: AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn’s intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, Our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from Qwen3-8B-Base and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by **8** points on $\tau^2$-Bench, by **10** points on BFCL-v4, and by **12** points on ToolSandbox. The results match or even outperform similarly sized open-source baselines. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: LLM agents, tool use, function calling, reinforcement learning in agents
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7409
Loading