Proxy-KD: A Proxy-Based Distillation Framework for Black-Box Large Language Models

Proxy-KD: A Proxy-Based Distillation Framework for Black-Box Large Language Models

ACL ARR 2025 May Submission2790 Authors

19 May 2025 (modified: 04 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Given the exceptional performance of proprietary large language models (LLMs) like GPT-4, recent research has increasingly focused on boosting the capabilities of smaller models through knowledge distillation (KD) from these powerful yet black-box teachers. While leveraging the high-quality outputs of these teachers is advantageous, the inaccessibility of their internal states often limits effective knowledge transfer. To overcome this limitation, we introduce Proxy-KD, a novel method that uses a proxy model to facilitate the efficient transfer of knowledge from black-box LLMs to smaller models. The white-box proxy is first aligned with the black-box teacher through supervised fine-tuning and preference optimization. Subsequently, the student model is trained using the black-box teacher's hard labels and weighted soft logits from the aligned proxy, where the weights are based on the proxy’s alignment quality. Experimental results on multiple benchmark demonstrate that Proxy-KD significantly outperforms existing white-box and black-box knowledge distillation methods. This approach presents a compelling new avenue for distilling knowledge from advanced LLMs.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Distillation

Contribution Types: Approaches to low-resource settings

Languages Studied: English

Submission Number: 2790

Loading