Keywords: Large Language Models (LLMs), Autonomous Agents, Tool-Use
Abstract: Compact open-source language models lag behind their larger counterparts in agentic tool-use reliability, yet standard remedies face fundamental obstacles: supervised fine-tuning suffers from exposure bias, while reinforcement learning is hampered by sparse credit assignment over long tool-interaction trajectories. We introduce Skill-CDPO, a progressive framework that first acquires tool-use skills at inference time through static tool analysis and dynamic strategy refinement, then distills the resulting error-correction signals into parameter updates via Critical Step DPO (CDPO). CDPO identifies the specific trajectory steps where model capability is the bottleneck—through rollout divergence between a local policy and an expert model—and constructs distributional preference pairs from all cross-group rollouts at those steps, weighted by both step-level criticality and pair-level score gaps. This provides dense, fine-grained supervision without requiring a process reward model. We evaluate Skill-CDPO on three medical agent benchmarks—PubMed Search (a new PubMed-based deep research benchmark we contribute), CureBench, and MedBrowseComp—using an 8B-parameter deep research model. Skill-CDPO substantially outperforms SFT and trajectory-level DPO baselines and achieves competitive or superior performance compared to GPT-5.2 on retrieval-intensive tasks. Our code and data are available at https://github.com/Adam135792468/CDPO
Paper Type: Long
Research Area: LLM agents
Research Area Keywords: LLM agents, Tool-Use, Direct Preference Optimization, Reinforcement Learning, Information Retrieval
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 1557
Loading