Proxy-KD: A Proxy-Based Distillation Framework for Black-Box Large Language Models

ACL ARR 2025 May Submission2790 Authors

19 May 2025 (modified: 04 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Given the exceptional performance of proprietary large language models (LLMs) like GPT-4, recent research has increasingly focused on boosting the capabilities of smaller models through knowledge distillation (KD) from these powerful yet black-box teachers. While leveraging the high-quality outputs of these teachers is advantageous, the inaccessibility of their internal states often limits effective knowledge transfer. To overcome this limitation, we introduce Proxy-KD, a novel method that uses a proxy model to facilitate the efficient transfer of knowledge from black-box LLMs to smaller models. The white-box proxy is first aligned with the black-box teacher through supervised fine-tuning and preference optimization. Subsequently, the student model is trained using the black-box teacher's hard labels and weighted soft logits from the aligned proxy, where the weights are based on the proxy’s alignment quality. Experimental results on multiple benchmark demonstrate that Proxy-KD significantly outperforms existing white-box and black-box knowledge distillation methods. This approach presents a compelling new avenue for distilling knowledge from advanced LLMs.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Distillation
Contribution Types: Approaches to low-resource settings
Languages Studied: English
Submission Number: 2790
Loading