ORPO-Distill: Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge Distillation, Contrastive Learning, LLMs, Efficient Reasoning, SLMs, model compression, Chain-of-thought distillation, preference optimization
TL;DR: ORPO-Distill is a cross-architecture LLM distillation technique that models distillation as a preference optimization problem via contrastive reasoning traces generated from the teacher and student LLM.
Abstract: We introduce ORPO-Distill, a general-purpose method for cross-architecture LLM distillation that formulates the problem as a preference optimization task. Unlike standard CoT distillation, the approach transfers knowledge through diverse reasoning traces. It employs an Odds-Ratio Preference Optimization objective that contrasts teacher and student traces for more effective learning, and adopts a mixed-policy strategy for utilizing student-generated outputs, outperforming both off- and on-policy alternatives. Experiments on five datasets and multiple student models show consistent improvements over conventional black-box KD baselines.
Submission Number: 163
Loading