Skill-CDPO: Evolving Agent Tool-Use via Critical Step Preference Optimization

Skill-CDPO: Evolving Agent Tool-Use via Critical Step Preference Optimization

ACL ARR 2026 March Submission1557 Authors

17 Mar 2026 (modified: 07 Jun 2026)ACL ARR 2026 March SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models (LLMs), Autonomous Agents, Tool-Use

Abstract: Compact open-source language models lag behind their larger counterparts in agentic tool-use reliability, yet standard remedies face fundamental obstacles: supervised fine-tuning suffers from exposure bias, while reinforcement learning is hampered by sparse credit assignment over long tool-interaction trajectories. We introduce Skill-CDPO, a progressive framework that first acquires tool-use skills at inference time through static tool analysis and dynamic strategy refinement, then distills the resulting error-correction signals into parameter updates via Critical Step DPO (CDPO). CDPO identifies the specific trajectory steps where model capability is the bottleneck—through rollout divergence between a local policy and an expert model—and constructs distributional preference pairs from all cross-group rollouts at those steps, weighted by both step-level criticality and pair-level score gaps. This provides dense, fine-grained supervision without requiring a process reward model. We evaluate Skill-CDPO on three medical agent benchmarks—PubMed Search (a new PubMed-based deep research benchmark we contribute), CureBench, and MedBrowseComp—using an 8B-parameter deep research model. Skill-CDPO substantially outperforms SFT and trajectory-level DPO baselines and achieves competitive or superior performance compared to GPT-5.2 on retrieval-intensive tasks. Our code and data are available at https://github.com/Adam135792468/CDPO

Paper Type: Long

Research Area: LLM agents

Research Area Keywords: LLM agents, Tool-Use, Direct Preference Optimization, Reinforcement Learning, Information Retrieval

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 1557

Loading