Under review as a conference paper at ICLR 2024

K NOWLEDGE D ISTILLATION L ANGUAGE M ODELS

FOR

C LOSED -S OURCE

Anonymous authors Paper under double-blind review

A BSTRACT Closed-source language models such as GPT-
4 have achieved remarkable performance.
Recently, many studies have focused on enhancing the capabilities of smaller models,
through knowledge distillation (KD) on those closed-
source language models.
However, due to the inability to directly access the closedsource language model’s output distribution,
KD methods can currently only be performed using one-
hot labels, which hinders the effectiveness of KD.
To address this limitation, we propose a Bayesian estimation-
based knowledge distillation method.
Specifically, our method comprises prior estimation and posterior estimation.
The prior estimation obtains a prior distribution by leveraging the corpus generated by the closed-
source language model.
The posterior estimation updates the prior distribution to obtain a posterior distribution,
based on continued sampling results.
Then we utilize the prior and posterior distributions for distillation.
Experimental results showcase that, in the context of KD for closed-
source language model, our method outperforms the current KD methods that directly fine-
tune on the one-hot labels.

1

I NTRODUCTION

While closed-source large language models (LLMs) such as GPT-
3.
5 and GPT-4 have shown great superiority over open-
source counterparts like LLaMA(Touvron et al.
, 2023) and Falcon(Penedo et al.
, 2023), they can only be accessed via API calls and allow limited customization and transparency.
One way to address this problem is to transfer their capabilities to open-
source language models, typically smaller in size,
by prompting closed-source LLMs to generate samples that reflect their capabilities and fine-
tuning open-source language models on the generated one-
hot labels.
Knowledge distillation (KD) (Hinton et al.
, 2015) is an effective technology that aims to obtain a small but strong student model by distilling knowledge from a large teacher model.
The objective function in Hinton et al.
(2015) involves calculating the Kullback-Leibler (
KL) divergence between the output distributions of the teacher model and the student model.
By minimizing the KL divergence, the student model is able to mimic the behavior and learn the intrinsic knowledge of the teacher model.
However, many current methods (Hsieh et al.
, 2023; Jiang et al.
, 2023; Ho et al.
, 2022) that perform KD on the closed-source LLMs involves solely fine-
tuning student model on one-hot labels generated by the teacher model,
as illustrated in Figure1.
In contrast to using output distribution (soft labels)
to compute KL divergence, transferring deeper and more fundamental knowledge from teacher model to student model is constrained when relying solely on fine-
tuning with one-hot labels.
This represents a limitation in current KD methods for closed-
source LLMs.
To address this limitation, we propose Bayesian estimation-
based knowledge distillation to perform effective knowledge distillation on closed-
source language model (LM).
Our method first estimates the inaccessible output distribution (
referred as to latent distribution) of closed-source LM,
and then performs KD on the estimated distribution.
Our approach comprises two main components: prior estimation and posterior estimation.
(1) The prior estimation is designed to estimate the latent distribution by leveraging corpus generated by the closed-
source LM.
Our hypothesis is that within the generated corpus,
there are underlying patterns that characterize the latent distribution.
Through prior estimation, a prior distribution that approximates the latent distribution can be obtained.
(2) By continuously sampling from a proxy of the closed-
source LM, posterior estimation derives a posterior distribution to approximate the latent distribution.
Then we perform KD on these esti1

Under review as a conference paper at ICLR 2024 Hard Label

Soft Label

Estimated Soft Label Countries in Europe include__

Proxy Model Closed-Source Model

Open-Source Model

(a)

(b)

Closed-Source Model

Artificial intelligence is __

(c)

France German Moon Japan Ocean

powerful writing tropic evolving transforming

(d)

Figure 1: (a) In knowledge distillation of closed-
source models, only one-hot labels (hard labels) can be obtained.
(b) In knowledge distillation of open-source models,
output distributions (soft labels) can be obtained.
(c) Our method obtains estimated soft labels from closed-
source models by leveraging a proxy model.
(d) Compared to hard labels, soft labels allow students to learn more profound knowledge by guiding them to learn from multiple valid targets.
mated distributions.
The utilization of the estimated distributions enables student model to tap into more profound and essential aspects of the closed-
source teacher model’s knowledge during distillation process.
It fosters a more comprehensive and insightful learning experience compared to the previous closed-
source KD paradigm relying solely on one-hot labels.
We conduct extensive experiments with LLaMA (Touvron et al.
, 2023) across various representative benchmarks, such as BBH(
Suzgun et al.
, 2022), AGIEval(Zhong et al.
, 2023), ARCClark et al.
(2018), MMLU(Hendrycks et al.
, 2021), CSQA(Talmor et al.
, 2019) and GSM8K(Cobbe et al.
, 2021).
In the context of KD for closed-source LM, the empirical results demonstrate the effectiveness of our method over directly fine-
tuning on the one-hot labels.
For example, our method achieves an average accuracy improvement across the six benchmarks from 36.
31% to 39.
43% with LLaMA-7B, over methods that solely fine-tune on one-
hot labels.
These findings provide compelling evidence of the effectiveness of the proposed method.

2

R ELATED W ORK

The concept of knowledge distillation (KD) was originally introduced by Hinton et al.
(2015) with the aim of transferring the knowledge from a teacher model to a smaller student model.
This knowledge transfer is achieved by minimizing KL divergence between output distributions of the teacher model and the student model.
Current KD methods can be categorized into two primary types:
knowledge distillation for open-source models and knowledge distillation for closed-
source models.
2.
1

O PEN -S OURCE K NOWLEDGE D ISTILLATION

Knowledge distillation was first applied to distilling open-
source models.
For instance, Sanh et al.
(2019) applies KD to the pre-training process of BERT (
Devlin et al.
, 2019), yielding smaller models with minor performance drops.
Jiao et al.
(2020) allows the student model’s intermediate features to mimic the teacher model’s intermediate features,
by minimizing the Mean Squared Error (MSE) loss function.
Other approaches, such as the one proposed by Gu et al.
(2023), focus on distilling opensource generative language model like LLaMA.
Additionally, Park et al.
(2019) leverages samplewise relative information within the teacher model to perform knowledge distillation on ResNet (
He et al.
, 2016).
Mirzadeh et al.
(2019) introduces an intermediate network to bridge the parameter size gap between the CNN teacher model and the CNN student model.
However, it’s important to note that in all these methods,
the student model needs access to the internal features or parameters of the teacher model,
which is not feasible in the context of distilling closed-
source LM.
2.
2

C LOSED -S OURCE K NOWLEDGE D ISTILLATION

Given the outstanding performance of current SOTA closed-
source LLMs like GPT-3.
5 and GPT4, many studies have shifted their focus towards transferring knowledge from these closed-
source LLMs into smaller models.
Some approaches, such as Hsieh et al.
(2023); Ho et al.
(2022); Mukherjee et al.
(2023) utilize rationales generated by closed-source LLMs as training data.
They then perform fine-tuning on these generated rationales to transfer the teacher model’s reasoning abilities 2

Under review as a conference paper at ICLR 2024

Notations

Descriptions

C V M I wt Qwt Pw∗t Pwt Y fWt (Pwt ) fWt |Y (Pwt |
Y ) E(Pwt ) E(Pwt |M)

Corpus generated by the closed-source language model Vocabulary of language model Proxy model Input instruction The tth response token,
wt ∈ V Probability Pr(wt |wt−1 , .
.
.
, w1 , I) in the student model Probability Pr(wt |
wt−1 , .
.
.
, w1 , I) in the closed-source model Random variable associated with the value of Pw∗t Discrete random event,
Y ∈ {0, 1} Probability dense function of Pwt Conditional probability dense function of Pwt given event Y Prior probability Posterior probability

Table 1: Notations and descriptions.

into the student model.
To enhance the student’s capabilities, Jiang et al.
(2023), for instance, identifies challenging samples and has the closed-
source teacher generate more to fine-tune the student.
However, in the context of knowledge distillation for closed-
source LM, most existing methods stop at fine-tuning on the teacher-
generated one-hot labels.
Our work, on the other hand, focuses on distilling knowledge from the closed-
source LM more efficiently by estimating the latent distribution.
We achieve this by introducing Bayesian estimation-
based methods to soften the one-hot labels provided by the closed-
source teacher.
We enhance the effectiveness of knowledge transferring from the closed-
source teacher model to the student model, by minimizing the KL divergence between the output distribution of the student model and the estimated output distribution.

3

M ETHOD

We present Bayesian estimation-based knowledge distillation to enhance the efficiency of knowledge distillation for closed-
source LM.
3.
1

P ROBLEM S TATEMENT

In this section, we first provide notations in Table 1.
We consider a language model with vocabulary V, takes an instruction I as input and generates response tokens w1 , w2 , w3 . . . as output.
At time t, the probability of generating token wt can be represented as Pr(
wt |wt−1 , .
.
.
, w1 , I). We refer the distribution as the probabilities Pr(
wt |wt−1 , .
.
.
, w1 , I) encompassing all words within vocabulary V. Let Pw∗t be the probability Pr(
wt |wt−1 , .
.
.
, w1 , I) in closed-source LM, then token-level objective function of KD for the closed-
source LM at time t can be derived as follows:

Lkl t =

X

Pw∗t log

wt ∈V

Pw∗t Qwt

(1)

Where the Qwt is the probability Pr(wt |wt−1 , .
.
.
, w1 , I) in student model.
Due to the inaccessibility of Pw∗t , this objective function degrades to computing cross entropy with one-
hot labels, which might limit the performance of KD.
To this end, our goal is to estimate a probability to approximate the Pw∗t (referred to as latent probability)
.
Subsequently, we perform KD on the estimated probabilities.
The overall architecture of our method is shown in Figure 2.
3

Under review as a conference paper at ICLR 2024

ℒ𝑡𝑐𝑒

One-Hot Label One-Hot Label

Closed-Source Language Model

ℒ𝑡𝑘𝑙

Prior Distribution

Student Model 𝑘𝑙 ℒ𝑡|𝑀

Posterior Distribution Prior Distribution

Corpus 𝓒

Posterior Distribution

Prior Estimation

generated by closed-source model

Fine-Tuning

Sampling applying

applying

putting

applying

using

Proxy Model Posterior Estimation

Figure 2: Overview of our method.
We first obtain prior distribution through the prior estimation.
Then in the posterior estimation, the prior distribution is updated through iterative sampling from a proxy of the closed-
source LM.
The final objective function involves three targets:
one-hot label, prior distribution, and posterior distribution.

3.
2 3.
2.
1

E STIMATION M ETHODS P RIOR E STIMATION

In this section, we elaborate on the proposed prior estimation method.
The prior estimation aims to estimate a probability to approximate the latent probability Pw∗t at each time step t. Given sequence ′ (
wt′ , wt−1 , .
.
.
, w1′ , I), the prior estimation aims to inform, at time step t, a high probabilities for the student model to generate the ground-
truth token wt′ while still allowing for some probability of other valid tokens.
Given corpus C generated by the closed-source LM,
for a specific sequence ′ ′ , .
.
.
, w1′ , I) , . . . , w1′ , I) ∈ C, and for ∀wt ∈ V, if wt =
wt′ , then the value of Pr(wt |wt−1 (wt′ , wt−1 can be computed as:

pwt =

′ ′ ) γ−1 #(wt , wt−1 , .
.
.
, wt−n + ′ ′ γ#(wt−1 , .
.
.
, wt−n ) γ

(2)

′ If wt ̸= wt′ , then the value of Pr(wt |wt−1 , .
.
.
, w1′ , I) can be computed as:

pwt =

′ ′ #(wt , wt−1 , .
.
.
, wt−n ) ′ ′ γ#(wt−1 , .
.
.
, wt−n )

(3)

Where the # represents the count of a particular response tokens sequence appears in C. The n is the window size.
The γ is a hyperparameter, γ ∈ Z+ . The γ is used to adjust the dominant probability contribution of the ground-
truth token wt′ .
For instance, when γ = 2, term γ−1 γ ensures that the ′ probability Pr(
wt′ |wt−1 , .
.
.
, w1′ , I) of generating ground-truth token wt′ is greater than 50%
.
An assumption behind the prior estimation is that language models typically generate the next token with a strong association to the most recent preceding tokens.
Through Equation 2 and Equation 3, we obtain a scalar probability value pwt .
We consider the value of Pw∗t as a continuous random variable denoted as Pwt ,
Pwt ∈ [0, 1], with probability density function fWt (
Pwt ).
The fWt (Pwt ) can be predefined in a way that the expected value of Pwt is equal to the previously computed scalar pwt .
Then a prior probability for approximating the latent probability Pw∗t can be obtained by calculating the expectation of Pwt (
replace Pwt with x): 4

Under review as a conference paper at ICLR 2024

1

Z E(Pwt ) =

xfWt (x)dx = pwt

(4)

0

3.
2.
2

P OSTERIOR E STIMATION

The posterior estimation is based on the prior estimation to estimate Pw∗t . Specifically,
the posterior estimation involves continued sampling from the closed-
source LM.
An intuitive idea is that, given a sampled token ŵt and a target token wt ,
if the sampling results in ŵt = wt , the probability of generating wt should be increased;
on the other hand, if the sampling results in ŵt ̸= wt ,
then the probability of generating wt should be decreased.
A discrete random event Y is defined as follows: In a sampling round of the closed-
source LM, given input sequence (wt−1 , .
.
.
, w1 , I) and a target token wt , if the sampled token ŵt = wt ,
then Y = 1; otherwise, Y = 0. In practice, we achieve this by introducing an open-
source language model M as a proxy of the closed-source model.
The M is first fine-tuned on the corpus C for preliminary alignment.
We feed the sequence (wt−1 , .
.
.
, w1 , I) into M to sample a generated token ŵt at time t. In a sampling round,
we update the prior probability dense function fWt (
Pwt ) based on the event Y . If Y = 1 occurs, according to Bayes’ theorem:
fWt |Y (Pwt |Y = 1) ∝ Pr(Y = 1|Pwt )fWt (Pwt ) = Pwt fWt (
Pwt )

(5)

Where fWt |Y (Pwt |Y = 1) is the posterior probability dense function conditioned on event Y = 1. Then,
we integrating over Pwt fWt (Pwt ) to get a normalization factor η:
1

Z η=

xfWt (x)dx

(6)

0

Then the value of fWt |Y (Pwt |Y = 1) can be calculated as fWt |
Y (Pwt |Y = 1) = In a sampling round, if event Y = 0 occurs instead,
according to Bayes’ theorem:

1 η Pwt fWt (Pwt ).

fWt |Y (Pwt |Y = 0) ∝ Pr(Y = 0|Pwt )fWt (Pwt ) = (
1 − Pwt )fWt (Pwt )

(7)

Where fWt |Y (Pwt |Y = 0) is the posterior probability dense function conditioned on event Y = 0. Similarly,
we integrating over (1 − Pwt )fWt (Pwt ) to get the normalization factor η:
Z

1

(1 − x)fWt (x)dx

η=

(8)

0

Then the value of fWt |Y (Pwt |Y = 0) can be calculated as fWt |
Y (Pwt |Y = 0) = η1 (1 − Pwt )fWt (Pwt ).
The sampling process for M typically involves multiple iterations,
where posterior probability density function fWt |
Y (Pwt |Y ) of each round will update the prior probability density function fWt (
Pwt ) for the next round.
And we define fWt (Pwt ) in the first round as the probability density function obtained through prior estimation.
We denote the final posterior probability dense function as fWt |
M (Pwt |M). Then a posterior probability for approximating the latent probability Pw∗t can be obtained by calculating the conditional expectation:
Z E(Pwt |M) =

1

xfWt |M (x|M)dx

(9)

0

3.
3

OVERALL O BJECTIVE

The overall objective function at time step t comprises three objectives.
Let 1wt be the onehot label,Pthe first objective at time step t can be derived by calculating the cross entropy as Lce t =
− wt ∈V 1wt log Qwt .
The second objective at time step t can be derived based on the prior 5

Under review as a conference paper at ICLR 2024

estimation as Lkl t =

P

wt ∈V

E(Pwt ) log

E(Pwt ) Qwt .

E(Pwt |M) E(Pw′ |M) , t estimation as Lkl t|M =

We first normalize E(Pwt |M) =

P

′ ∈V wt

then the third objective at time step t can be derived based on the posterior P E(
Pwt |M) . Given a sequence with length T , the overall objective function wt ∈V E(Pwt |
M) log Qwt can be derived as follows: T 1 X ce kl L=
(L + αLkl t + βLt|M ) T t=1 t

(10)

Where the α and β are hyperparameters used to adjust the contributions of the Lt and Lt|
M in the total loss.
When α = 0 and β > 0, the student model does not learn from the prior distribution.
And the student model does not learn from the posterior distribution when α >
0 and β = 0.

4

E XPERIMENTAL S ETUP

In this section, we setup a series of experiments to test the distilled models’ capabilities on various benchmarks.
These benchmarks assess the model across wide range of capabilities including reading comprehension,
commonsense knowledge, mathematical skills and logical reasoning.
4.
1

DATASETS

We utilize the OpenOrca(Mukherjee et al.
, 2023) dataset as our training corpus.
The OpenOrca dataset is a collection of FLAN(Longpre et al.
, 2023) data augmented by closed-source LLMs like GPT-
4 and GPT-3.
5.
Following the settings in OpenOrca-Preview1-13B1 of paper Mukherjee et al.
(2023), and consider time efficiency, we conduct training on a subset of the original corpus containing 200k instances.
We also utilize the Alpaca(Taori et al.
, 2023) dataset as an additional experimental configuration.
We utilize benchmarks including BBH(Suzgun et al.
, 2022), AGIEval(Zhong et al.
, 2023), ARC(Challenge)(Clark et al.
, 2018), MMLU(Hendrycks et al.
, 2021), CSQA(Talmor et al.
, 2019) and GSM8K(Cobbe et al.
, 2021) for evaluation.
Following the settings of Mukherjee et al.
(2023), we focus on datasets that involve multiple-
choice questions.
For all datasets, we conduct evaluation under zero-
shot setting without any exemplars and without any CoT(
Wei et al.
, 2022).
4.
2

BACKBONE M ODELS

We employ currently state-of-the-art closed-source LLMs GPT-
4 as well as text-davinci-003 as the closed-source teacher models.
We utilize LLaMA-7B and LLaMA-13B as student models,
which are initialized with pre-trained weights obtained from Hugging Face2 .
We choose LLaMA-33B as the proxy model.
We employ top-p sampling for decoding.
We train our models on 8 32GB V100 GPUs.
To accelerate training, we leverage LoRA (Hu et al.
, 2021).
Additional details can be found in Appendix A. 4.3

BASELINES

We consider instruction fine-tuning (IFT) approach as our baseline.
For baseline models, to ensure a fair comparison,
we only consider models that have access to their original fine-
tuning datasets.
Therefore we select OpenOrca-Perview1-13B from Mukherjee et al.
(2023) and Alpaca (Taori et al.
, 2023) as our baseline models.
In addition, we also train our own version of baseline models.

5

R ESULT AND A NALYSIS

In this section, we present the main results, ablation studies and additional experiments.
All corpus for proxy model fine-tuning, prior estimation,
posterior estimation, and student distillation are iden1 2

https://huggingface.
co/Open-Orca/OpenOrca-Preview1-13B https://huggingface.
co/models

6

Under review as a conference paper at ICLR 2024

Models GPT-4 LLaMA-7B (IFT) LLaMA-7B (ours) OpenOrca-
Preview1-13B LLaMA-13B (IFT) LLaMA-13B (ours)

#Params

BBH

AGIEval

ARC

MMLU

CSQA

GSM8K

Average

7B 7B 13B 13B 13B

67.
4 36.
08 38.
52 41.
47 42.
77 44.
83

56.
4 24.
14 26.
92 30.
12 26.
74 29.
35

47.
49 52.
40 59.
77 58.
2 61.
84

86.
4 38.
81 41.
18 48.
10 45.
3 48.
17

58.
71 62.
52 69.
77 66.
27 68.
94

92.
0 12.
65 14.
97 18.
22 20.
93 23.
36

36.
31 39.
43 44.
58 43.
37 46.
08

Table 2: The results of the LLaMA models with different sizes on six benchmarks.
We compare our approach to methods directly instruction fine-
tuning on the one-hot labels.
The performance of OpenOrca-Preview1-13B is assessed through our own evaluation.
All student models are trained on the OpenOrca dataset.
Models

#Params

BBH

AGIEval

ARC

MMLU

CSQA

GSM8K

Average

7B 7B 13B 13B

70.
7 34.
19 34.
92 38.
1 40.
82

41.
9 24.
16 24.
32 26.
9 28.
35

39.
35 40.
3 52.
57 53.
84

64.
6 33.
66 34.
14 41.
41 42.
17

36.
16 38.
32 55.
27 56.
78

13.
99 14.
33 19.
27 19.
83

30.
25 31.
06 38.
92 40.
3

text-davinci-003 Alpaca-7B LLaMA-7B (ours) Alpaca-
13B LLaMA-13B (ours)

Table 3: The results of the LLaMA models with different sizes on six benchmarks.
We compare our method with Alpaca.
All student models are trained on the Alpaca dataset.
tical.
Unless otherwise specified, ”IFT” represents the baseline model that we have implemented ourselves,
and the default training corpus we utilize is OpenOrca.
5.
1

M AIN R ESULTS

Table 2 shows the performance comparison of our method against baseline models on the six benchmarks.
Detailed experimental results can be found in Appendix C. The training corpus we utilized in this table is the OpenOrca dataset.
Our method outperforms OpenOrca-Preview1-13B from Mukherjee et al.
(2023) as well as our own implemented IFT models under both the 7B and 13B model parameter scales.
Table 3 shows the performance comparison between our method and Alpaca.
The training corpus we utilized in this table is the Alpaca dataset.
It shows a similar trend, with our method outperforming Alpaca.
A case study in Table 4 demonstrates that our model exhibits better comprehension and answer generation capabilities in terms of reasoning ability compared to the baseline IFT model.
The experimental results demonstrate that in the context of KD for closedsource LM,
distilling knowledge using the estimated soft labels through our method yields superior results compared to directly using one-
hot labels.
5.
2

A BLATION S TUDY

This ablation study examines the impact of components within our method.
While retaining the standard cross-entropy loss Lce t ,
we evaluate the effect of using only the prior estimation (
α > 0, Distilling+Posterior MMLU

44 42 40 38 36 34 32

Distilling+Prior+Posterior CSQA

GSM8K

65

18

60

16

55 50 45

Accuracy (%)

Distilling+Prior

Accuracy (%)

30 28 26 24 22 20 18

IFT AGIEval

Accuracy (%)

BBH Accuracy (%)

Accuracy (%)

Pretrain 42 40 38 36 34 32 30

14 12 10 8

Figure 3: Ablation results of LLaMA-7B across five representative benchmarks.
In ”Distilling+Prior” we adjust α = 0.
5, β = 0, to investigate the effect of the prior estimation.
In ”Distilling+Posterior” we adjust α = 0, β = 1,
to investigate the effect of the posterior estimation.
In ”Distilling+Prior+Posterior” we adjust α = 0.
5, β = 1 to study combined effect of both.
7

Under review as a conference paper at ICLR 2024

Instruction

The fastest car was 1.
25 times faster than the 2nd fastest car.
The 2nd fastest car was 5 times faster than the average car.
The average car can go 120 Miles per hour.
How fast does the faster car go?

Ours

The 2nd fastest car is 5 times faster than the average car,
so it can go 5 x 120 = 600 Miles per hour.
The faster car is 1.
25 times faster than the 2nd fastest car, so it can go 1.
25 x 600 = 750 Miles per hour.
Therefore, the faster car goes 750 Miles per hour.
✓ The faster car can go 1.
25 x 5 = 60 Miles per hour.
Therefore, the faster car goes 60 Miles per hour.
×

IFT

Table 4: Examples from GSM8K dataset.
Our method enhances the student model’s comprehension and leads to more detailed responses compared to the IFT method.

27

36 5 10

20 Sampling Rounds

26 25 24

50

40 39 38

5 10

20 Sampling Rounds

50

5 10

CSQA

63 Accuracy (%)

37

IFT

41 Accuracy (%)

38

Accuracy (%)

Accuracy (%)

Distilling+Posterior MMLU

AGIEval

20 Sampling Rounds

61 60 59

50

GSM8K

15

62

Accuracy (%)

BBH

39

14 13 12

5 10

20 Sampling Rounds

50

5 10

20 Sampling Rounds

50

Figure 4: The comparison of knowledge distillation performance using the posterior distribution under different sampling round settings,
as well as the comparison with IFT, with the model utilizing LLaMA-
7B.

β = 0), and using only posterior estimation (α = 0,
β > 0), and using both (α > 0, β > 0).
We select five representative benchmarks.
All results are presented in Figure 3.
Effect of the prior estimation Compared to IFT, distilling on the prior distribution (
Distilling+Prior) can enhance the model performance.
The results indicate that, in addition to guiding the student towards learning from ground-
truth token, informing the student model about other valid tokens benefits the distillation.
The consistent improvement over IFT suggests that the prior estimation can capture these valid tokens that represent the capabilities of the teacher model.
Effect of the posterior estimation Compared to IFT,
distilling on the posterior distribution (Distilling+
Posterior) significantly boosts the performance.
The improvement over ”Distilling+Prior” indicates that,
the sampling results from proxy model further refines the prior distribution.
The posterior distribution can provide more comprehensive information that is beneficial for distillation.
Combined effect of both As shown in Figure 3, we incorporate the prior distribution and the posterior distribution into the distillation process (
Distilling+Prior+Posterior).
We observe that the effect is similar to ”Distilling+
Posterior”, with limited improvements seen on only a subset of the benchmarks.
We analyze the reason for this phenomenon is the posterior distribution already contains the information from the prior distribution,
the improvement gained from incorporating the prior distribution is limited.

36 35 34

10 30 50 100 Dataset size (K)

200

10 30 50 100 Dataset size (K)

200

41 40 39 38 37 36

Distilling+Prior+Posterior MMLU

CSQA

10 30 50 100 Dataset size (K)

200

15

60

Accuracy (%)

37

Distilling+Prior

AGIEval Accuracy (%)

Accuracy (%)

Accuracy (%)

38

27 26 25 24 23 22

Accuracy (%)

IFT BBH

55 50 10 30 50 100 Dataset size (K)

200

GSM8K

14 13 12 11

10 30 50 100 Dataset size (K)

200

Figure 5: Under different dataset sizes, we investigate the comparison of three methods:
IFT, distilling on prior distribution (Distilling+
Prior), and distilling on both prior and posterior distributions (
Distilling+Prior+Posterior), with the student model utilizing LLaMA-
7B. 8

Under review as a conference paper at ICLR 2024

5.
3

I MPACT OF S AMPLING ROUNDS

In this section, we discuss the impact of the number of sampling rounds on posterior estimation.
The results are represented in Figure 4.
We observe that the best performance is achieved on most benchmarks when the sampling rounds falls within the range of [
10,20].
Furthermore, excessive sampling, such as 50 times,
leads to a decline in performance.
We analyze this phenomenon can be attributed to the distribution discrepancy and prior distribution vanishing.
Distribution discrepancy We observe there exist discrepancies between the ground-
truth one-hot labels provided by the closed-source LM and the output distribution of the proxy model.
Although the proxy model has been aligned by fine-
tuning on corpus C generated by the closed-source LM,
the token with the highest probability given by the proxy model at some positions is different from the ground-
truth token (For example, when the ground-truth label at the current position is ”\
n”, the proxy model assigns a high probability (e.g., 0.99) to ”<
\s>”, while the probability of ”\n” becomes close to 0)
, as elaborated in Appendix B.2. In this case, the inconsistency in distributions may negatively impact the performance of the distillation.
Prior Distribution Vanishing In Bayesian estimation,
there exists a phenomenon where the prior distribution vanishing as the posterior estimation undergoes excessive iterations.
In other words, the impact of the prior distribution weakens with each successive iteration.
We analyze that in Figure 4, excessive sampling (e.g., 50 times)
leads to the degeneration of the posterior distribution into the proxy model’s output distribution,
resulting in negative impact on the performance of knowledge distillation.
Therefore, it is important to control the number of samples within a reasonable range.
Based on our experimental results, we find that choosing a sampling count between 10 and 20 works fine.
5.
4

I MPACT OF C ORPUS S IZE

We investigate the effect of training corpus C size,
as shown in Figure 5.
We observe that as the size of the training corpus C increases,
the method ”Distilling+Prior+Posterior” consistently outperforms the performance of IFT across benchmarks.
A similar trend can also be observed in the method ”Distilling+
Prior”.
We analyze that our method benefits from a larger corpus.
As the corpus size increases, it becomes more advantageous for the prior estimation to estimate a more accurate and information-
rich distribution, subsequently influencing the posterior estimation.

6

C ONCLUSION

In this work, we address the challenge of knowledge distillation for closed-
source language models, where directly access to the teacher’s output distribution is not available.
We proposed Bayesian estimation-based knowledge distillation to estimate the output distribution of closed-
source language models, enabling effective knowledge distillation.
Our approach comprises two main components: prior estimation and posterior estimation.
The prior estimation involves obtaining a prior distribution by leveraging the corpus generated by the closed-
source language model.
The posterior estimation updates prior distribution based on continued sampling results from a proxy model.
Extensive experiments are conducted based on LLaMA.
The results across various benchmarks consistently show that our method outperforms directly fine-
tuning on one-hot labels, when it comes to knowledge distillation of closed-
source language models.

R EFERENCES Peter Clark, Isaac Cowhey, Oren Etzioni,
Tushar Khot, Ashish Sabharwal, Carissa Schoenick,
and Oyvind Tafjord.
Think you have solved question answering?
try arc, the ai2 reasoning challenge, 2018.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen,
Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek,
Jacob Hilton, Reiichiro Nakano, et al.
Training verifiers to solve math word problems.
arXiv preprint arXiv:2110.
14168, 2021.
9

Under review as a conference paper at ICLR 2024

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
BERT: Pre-training of deep bidirectional transformers for language understanding.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers)
, pp.
4171–4186, Minneapolis, Minnesota, June 2019.
Association for Computational Linguistics.
doi: 10.
18653/v1/N19-1423. URL https: //aclanthology.
org/N19-1423. Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang.
Knowledge distillation of large language models, 2023.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In 2016 IEEE Conference on Computer Vision and Pattern Recognition (
CVPR), pp.
770–778, 2016.
doi: 10.
1109/CVPR.
2016.
90.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,
Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
Measuring massive multitask language understanding,
2021.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.
Distilling the knowledge in a neural network.
arXiv preprint arXiv:1503.
02531, 2015.
Namgyu Ho, Laura Schmid, and Se-Young Yun.
Large language models are reasoning teachers.
arXiv preprint arXiv:2212.
10071, 2022.
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost,
Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna,
Chen-Yu Lee, and Tomas Pfister.
Distilling step-by-step!
outperforming larger language models with less training data and smaller model sizes.
arXiv preprint arXiv:2305.
02301, 2023.
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li,
Shean Wang, Lu Wang, Weizhu Chen, et al.
Lora: Low-rank adaptation of large language models.
In International Conference on Learning Representations,
2021.
Yuxin Jiang, Chunkit Chan, Mingyang Chen, and Wei Wang.
Lion: Adversarial distillation of closed-source large language model.
arXiv preprint arXiv:2305.
12870, 2023.
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen,
Linlin Li, Fang Wang, and Qun Liu.
TinyBERT: Distilling BERT for natural language understanding.
In Findings of the Association for Computational Linguistics:
EMNLP 2020, pp.
4163–4174, Online, November 2020.
Association for Computational Linguistics.
doi: 10.
18653/v1/2020.findings-emnlp.
372.
URL https://aclanthology.
org/2020.
findings-emnlp.
372.
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung,
Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei,
and Adam Roberts.
The flan collection: Designing data and methods for effective instruction tuning,
2023.
Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine,
Akihiro Matsukawa, and Hassan Ghasemzadeh.
Improved knowledge distillation via teacher assistant,
2019.
Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar,
Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah.
Orca: Progressive learning from complex explanation traces of gpt-
4, 2023.
Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho.
Relational knowledge distillation.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (
CVPR), June 2019.
Guilherme Penedo, Quentin Malartic, Daniel Hesslow,
Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli,
Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay.
The refinedweb dataset for falcon llm: Outperforming curated corpora with web data,
and web data only, 2023.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf.
Distilbert, a distilled version of bert: smaller,
faster, cheaper and lighter.
arXiv preprint arXiv:1910.
01108, 2019.
10

Under review as a conference paper at ICLR 2024

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann,
Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le,
Ed H. Chi, Denny Zhou, and Jason Wei.
Challenging bigbench tasks and whether chain-of-thought can solve them,
2022.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant.
CommonsenseQA: A question answering challenge targeting commonsense knowledge.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers)
, pp.
4149–4158, Minneapolis, Minnesota, June 2019.
Association for Computational Linguistics.
doi: 10.
18653/v1/N19-1421. URL https://aclanthology.
org/N19-1421. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang,
Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
and Tatsunori B. Hashimoto.
Stanford alpaca: An instruction-following llama model.
https://github.
com/tatsu-lab/stanford_alpaca, 2023.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet,
Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière,
Naman Goyal, Eric Hambro, Faisal Azhar, et al.
Llama: Open and efficient foundation language models.
arXiv preprint arXiv:2302.
13971, 2023.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma,
brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou.
Chain-of-thought prompting elicits reasoning in large language models.
In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
K. Cho, and A. Oh (eds.
), Advances in Neural Information Processing Systems,
volume 35, pp.
24824–24837.
Curran Associates, Inc.
, 2022.
URL https://proceedings.
neurips.
cc/paper_files/paper/2022/ file/9d5609613524ecf4f15af0f7b31abca4-
Paper-Conference.
pdf.
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang,
Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan.
Agieval: A human-centric benchmark for evaluating foundation models,
2023.

11

Under review as a conference paper at ICLR 2024

Models LLaMA-33B LLaMA-13B LLaMA-7B

Batch Size 1 4 6

Max Length 512 512 512

Lora Rank 96 16 16

#GPUs 8 8 4

Precision float16 float16 float16

Dimension 6656 5120 4096

#Heads 52 40 32

#Layers 60 40 32

0.
7 0.
6 0.
5 0.
4 0.
3 0.
2 0.
1 0.
0

1.
0 0.
8 Probability

Probability

Table 5: Model configurations.

5353

282

0.
0

10664 10662 10657 10669 10665 10666 10667 10668 Token Indices

310

10663 10659 10671 10666 10667 10668 10669 10670 31999 Token Indices

0.
4 Probability

0.
5 Probability

0.
4 0.
2

0.
6 0.
4 0.
3 0.
2

0.
3 0.
2 0.
1

0.
1 0.
0

0.
6

450

1763

350

16696

313 3869 Token Indices

319

512

315

0.
0

360

13

2

910

10671 10670 10666 10667 10668 10669 31999 Token Indices

Figure 6: The issue of probability sparsity in the output distribution.
A significant portion of probability values concentrates on a few tokens,
while the probabilities for other tokens are close to zero.

A

E XPERIMENTAL D ETAILS

The model configurations are provided in Table 5.
We train the student models for three epochs, experimenting with learning rates of 1e-5, 3e-5, and 5e-5 during training.
In the knowledge distillation process, we use the following hyperparameters:
For the total loss, α = 0.
5 and β = 1.
For prior estimation, we set γ = 3 and n = 5. For posterior estimation,
we conduct 10 rounds of sampling.
We evaluate the models on the benchmarks using the final checkpoint.

B B.
1

D ISTRIBUTION A NALYSIS P ROBABILITY S PARSITY

During the distillation process, we observed a phenomenon of probability sparsity in the output distribution of the proxy model.
Typically, only a few tokens have high probabilities,
while the probabilities of other tokens are close to zero,
as shown in Figure 6.
In our distillation process, we retained only the probabilities of the top ten tokens with the highest probabilities,
setting the probabilities of the remaining tokens to zero.
This phenomena indicates that during the sampling process of the proxy model,
we don’t need to perform a large number of samples to cover all tokens with non-
zero probabilities.
B.2

D ISTRIBUTION D ISCREPANCY

We observe that as the number of sampling rounds increased,
the model’s performance improved on most benchmarks.
However, when the number of sampling rounds becomes excessive,
such as 50 rounds, the model’s performance started to decrease,
as shown in Figure 4.
We analyze that when the number of sampling rounds becomes excessive,
the posterior distribution tends to degenerate into the proxy distribution.
When directly using the proxy distribution for knowledge distillation,
we observe discrepancies between the proxy distribution and ground-
truth labels (For example, when the ground-truth label at the current position is ”\
n”, the proxy distribution assigns a high probability (
e.g., 0.99) to ”<\s>”, while the probability of ”\
n” becomes close to 0.
), which can lead to issues in distillation.
More cases are shown in Figure 7.
12

Under review as a conference paper at ICLR 2024 One-
Hot Label Proxy Distribution Posterior Distribution

Figure 7: Discrepancies between the the ground-truth distribution and the output distribution of proxy model (
proxy distribution) in terms of the top-4 token, while the posterior distribution can stay consistent with the ground-
truth distribution.

Accuracy (%)

38 37 36 1

2 3 Training Epoch

4

CSQA

63

41

Accuracy (%)

39 Accuracy (%)

Distilling+Posterior MMLU

BBH

40 39

62 61 60 59

1

2 3 Training Epoch

4

1

2 3 Training Epoch

4

Figure 8: The change in performance of distilling on the posterior distribution (
Distilling+Posterior) with the fine-tuning epochs of the proxy model.
We utilize LLaMA-7B as the student model, and LLaMA-
33B as the proxy model.

C

E XPERIMENTAL R ESULTS

The detailed experimental results for the LLaMA model on BBH,
AGIEval, and MMLU benchmarks are presented in Table 8,
Table 7 and Table 9.
We also conducted experiments on the FlanT5(Longpre et al.
, 2023) model using the OpenOrca dataset, and the results are shown in the Table 6.
We find that, compared to the IFT method, our approach does lead to some improvement,
although the improvement is limited.
We speculate that this might be because FlanT5 is a model that has been fine-
tuned with instructions, and its original model already had some basic capabilities for these tasks.
Therefore, the additional training results in limited improvement.
We also investigated the impact of continuous fine-
tuning of the proxy model on the OpenOrca corpus, as shown in the Figure 8.
We find that as the number of epochs for fine-tuning the proxy model increases,
it leads to a decrease in the performance of posterior estimation.
We speculate that this may be due to the proxy model overfitting to the current corpus,
resulting in a decrease in the effectiveness.
During training, we avoid excessive fine-tuning epochs for the proxy model.

13

Under review as a conference paper at ICLR 2024

Models GPT-4 FlanT5-large (IFT) FlanT5-large (ours)
FlanT5-xl (IFT) FlanT5-xl (ours)

#Params

BBH

AGIEval

ARC

MMLU

CSQA

GSM8K

Average

780M 780M 3B 3B

34.
63 35.
22 38.
47 39.
51

56.
4 28.
12 28.
84 28.
34 30.
1

46.
44 46.
61 59.
6 60.
12

86.
4 39.
41 39.
34 46.
91 46.
78

76.
78 76.
93 84.
79 85.
38

92.
0 4.
54 4.
71 6.
12 7.
1

38.
32 38.
61 44.
04 44.
83

Table 6: The results of the FlanT5 models with different parameter sizes on the six benchmarks.
We compare our method with IFT.

Models LLaMA-7B (IFT) LLaMA-7B (ours) LLaMA-13B (IFT)
LLaMA-13B (ours)

#Params

AQuA-RAT

LogiQA

LSAT-AR

LSAT-LR

SAT-English (w/o Psg.
)

SAT-Math

Average

7B 7B 13B 13B

19.
71 22.
39 18.
61 25.
22

26.
81 29.
68 27.
59 29.
63

18.
22 19.
46 17.
7 19.
65

27.
44 33.
33 34.
58 36.
67

30.
35 30.
46 36.
27 33.
5

22.
29 26.
19 25.
7 31.
43

24.
14 26.
92 26.
74 29.
35

Table 7: Performance comparison in AGIEval benchmark on the selected multiple-
choice English questions.
We use OpenOrca dataset as training corpus.

Tasks Boolean Expressions Causal Judgement Date Understanding Disambiguation QA Formal Fallacies Geometric Shapes Hyperbaton Logical Deduction (
5 objects) Logical Deduction (3 objects) Logical Deduction (
7 objects) Movie Recommendation Navigate Penguins in a Table Reasoning about Colored Objects Ruin Names Salient Translation Error Detection Snarks Sports Understanding Temporal Sequences Tracking Shuffled Objects (
5 objects) Tracking Shuffled Objects (7 objects) Tracking Shuffled Objects (
3 objects) Average

LLaMA-13B (IFT) 58.
8 61.
27 50.
0 56.
8 56.
4 25.
2 63.
6 33.
8 23.
39 44.
2 77.
59 51.
6 32.
61 39.
6 36.
4 31.
6 48.
31 60.
8 17.
28 19.
46 14.
63 37.
5 42.
77

LLaMA-13B (ours) 62.
4 63.
01 54.
02 60.
0 54.
4 23.
6 66.
8 36.
14 30.
12 51.
6 79.
32 56.
8 36.
11 42.
8 33.
8 37.
2 52.
25 60.
4 11.
2 21.
1 17.
17 36.
02 44.
83

LLaMA-7B (IFT) 65.
06 56.
98 49.
3 49.
4 54.
0 12.
42 49.
2 26.
51 18.
7 42.
17 50.
78 45.
6 30.
58 27.
54 15.
2 24.
0 43.
82 56.
0 13.
49 17.
2 11.
98 33.
9 36.
08

LLaMA-7B (ours) 66.
4 61.
85 49.
26 54.
8 54.
0 22.
4 54.
8 30.
96 18.
11 42.
8 53.
42 55.
2 34.
91 30.
33 14.
8 28.
4 45.
7 55.
6 9.
68 17.
74 14.
8 32.
52 38.
52

Table 8: Zero-shot performance comparison in Big-Bench Hard benchmark on multiple-
choice questions.

Models LLaMA-7B (IFT) LLaMA-7B (ours) LLaMA-13B (IFT)
LLaMA-13B (ours)

#Params

Humanities

Other

Social Sciences

STEM

Average

7B 7B 13B 13B

38.
49 41.
4 46.
02 47.
81

44.
63 47.
32 53.
19 56.
7

40.
24 42.
17 48.
24 51.
36

31.
87 33.
82 33.
91 36.
79

38.
81 41.
18 45.
34 48.
17

Table 9: Performance comparison on the Massive Multitask Language Understanding benchmark.

14

