Keywords: Chain of thought, Data efficient fine-tuning, Preference Learning, Mathematical Reasoning, Small language models, Direct Preference Optimization
TL;DR: AWDPO is a self-distillation method that trains small learning models by using their own few-shot outputs as teachers for zero-shot behavior, yielding near supervised chain of thought accuracy with lower data requirements
Abstract: Small language models (0.5B–3B) typically lack mathematical reasoning ability, often scoring near 0% on tasks they can solve with few-shot demonstrations. Existing approaches rely on thousands of supervised chain-of-thought (CoT) traces or complex multi-round self-distillation pipelines. We introduce Advantage-Weighted Direct Preference Optimization (AWDPO), a lightweight alignment method that bridges the gap between few-shot and zero-shot reasoning. Unlike prior approaches, AWDPO formulates training as a single-pass preference optimization objective that aligns a model’s zero-shot distribution with its own few-shot behavior. Our loss combines an advantage-weighted preference term with a dynamic MLE anchor, yielding stable training and implicit trust-region regularization.
On GSM8K, AWDPO transforms Qwen-2.5 base models (0.5B–3B) from 0% to 39%–77% accuracy, recovering over 90% of a supervised fine-tune that uses 7,473 CoT traces — a 1,750× reduction in CoT data. The method generalizes to SVAMP, ASDiv, and MATH500, where AWDPO recovers up to 90% of supervised CoT performance. Our analysis shows that AWDPO is equivalent to a Kullback-Leibler (KL)-constrained policy improvement step under projected DPO. These results demonstrate that small base models can substantially improve their mathematical reasoning ability from minimal supervision, providing a principled and data-efficient alternative to supervised CoT or Reinforcement Learning (RL)-based methods for mathematical reasoning.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 14202
Loading