Under review as a conference paper at ICLR 2024

K NOWLEDGE D ISTILLATION L ANGUAGE M ODELS

FOR

C LOSED -S OURCE

Anonymous authors Paper under double-blind review

A BSTRACT Closed-source language models such as GPT-
4 have achieved remarkable performance.
Many recent studies focus on enhancing the capabilities of smaller models through knowledge distillation from closed-
source language models.
However, due to the incapability to directly access the weights,
hidden states, and output distributions of these closed-
source models, the distillation can only be performed by fine-
tuning smaller models with samples generated by closed-
source language models, which constrains the effectiveness of knowledge distillation.
In this paper, we propose to estimate the output distributions of closed-
source language models within a Bayesian estimation framework,
involving both prior and posterior estimation.
The prior estimation aims to derive a prior distribution by utilizing the corpus generated by closed-
source language models, while the posterior estimation employs a proxy model to update the prior distribution and derive a posterior distribution.
By leveraging the estimated output distribution of closed-
source language models, traditional knowledge distillation can be executed.
Experimental results demonstrate that our method surpasses the performance of current models directly fine-
tuned on data generated by closed-source language models.

1

I NTRODUCTION

While closed-source large language models (LLMs) such as GPT-
3.
5 and GPT-4 have shown great superiority over open-
source counterparts like LLaMA (Touvron et al.
, 2023) and Falcon (Penedo et al.
, 2023), they can only be accessed via API calls and allow limited customization and transparency.
One way to address this problem is to transfer their capabilities to open-
source language models, typically smaller in size,
by prompting closed-source LLMs to generate samples that reflect their capabilities and fine-
tuning open-source language models on these samples (
Hsieh et al.
, 2023; Jiang et al.
, 2023; Ho et al.
, 2022).
However, this approach only enables open-source language models to emulate the input-
output behavior of closed-source LLMs without acquiring their intrinsic knowledge related to logits,
weights, activations, and so forth.
Knowledge distillation (KD) (Hinton et al.
, 2015) is a popular compression technology that aims to train a small but strong student model by distilling knowledge from a large teacher model.
Among various sources of knowledge, the logits of the teacher model are typically utilized as an essential part of the objective function,
implemented by minimizing the Kullback-Leibler (KL)
divergence between the output distribution (soft labels)
of the teacher model and the output distribution of the student model.
This approach enables the student model to mimic the predictive behavior and acquire the knowledge of the teacher model.
However, such approaches are not readily applicable to closed-
source LLMs as the soft labels are not feasible.
To tackle this challenge, we propose to estimate the output distributions of closed-
source LLMs within a Bayesian estimation framework,
including both prior and posterior estimation.
The aim of prior estimation is to derive a prior distribution by leveraging the corpus generated by closed-
source language models.
The rationale is that the corpus may contain coarse-
grained information regarding the output distributions of closed-
source LLMs.
Meanwhile, the posterior estimation utilizes a proxy model,
another open-source LLM typically larger than the student model,
to calibrate the results of the prior estimation.
This proxy model is initially aligned with the closed-
source teacher model and then functions as a bridge between the teacher and the student,
as illustrated in Figure 1.
By leveraging the estimated output distribution of closed-
source LLMs, traditional knowledge distillation can 1

Under review as a conference paper at ICLR 2024 Hard Label

Soft Label

Estimated Soft Label Countries in Europe include__

Proxy Model Closed-Source Model

Open-Source Model

(a)

(b)

Closed-Source Model

Artificial intelligence is __

(c)

France German Moon Japan Ocean

powerful writing tropic evolving transforming

(d)

Figure 1: (a) In current knowledge distillation of closed-
source models, only hard labels can be obtained.
(b) In traditional knowledge distillation of open-
source models, soft labels can be obtained.
(c) Our method obtains estimated soft labels from closed-
source models by leveraging a proxy model.
(d) Compared to hard labels, soft labels allow students to learn more profound knowledge by guiding them to learn from multiple valid targets during distillation.

be carried out.
Compared to previous approaches addressing this objective,
our method enables the student model to learn from both the generated samples by the closed-
source teacher and the soft labels provided by the proxy model,
allowing the distillation of more intrinsic knowledge.
To validate our approach, we performed comprehensive experiments on a range of well-
established benchmarks, including complex reasoning datasets BBH (
Suzgun et al.
, 2022) and ARC Clark et al.
(2018), knowledge-based datasets AGIEval (Zhong et al.
, 2023) and MMLU (Hendrycks et al.
, 2021), commonsense reasoning dataset CSQA (Talmor et al.
, 2019), and mathematical reasoning dataset GSM8K (Cobbe et al.
, 2021).
We used GPT-4 as the closed-source teacher model,
LLaMA33B as the proxy model, and LLaMA-13B/7B as the student model.
The empirical results demonstrate the superiority of our method over directly fine-
tuning the student model on samples generated by GPT-
4, with an average improvement from points 36.
31 to 39.
43 across the six benchmarks.
The experimental results show that, the introduction of a proxy model can serve as an intermediary bridge for student model to learn knowledge from the closed-
source teacher model.
It benefits from the proxy model that aligns better with the teacher model.
This facilitates the transfer of more profound knowledge from the closed-
source teacher model to the student model more effectively.

2

R ELATED W ORK

The concept of knowledge distillation (KD) was originally introduced by Hinton et al.
(2015) with the aim of transferring the knowledge from a teacher model to a smaller student model.
Current KD methods can be organized into two primary categories:
knowledge distillation for open-source models and knowledge distillation for closed-
source models.
2.
1

O PEN -S OURCE K NOWLEDGE D ISTILLATION

KD can be applied to open-source models for natural language understanding.
For instance, Sanh et al.
(2019) applied KD to the pre-training process of BERT (
Devlin et al.
, 2019), yielding smaller models with minor performance drops.
Jiao et al.
(2020) allowed the student model’s intermediate features to mimic the teacher model’s intermediate features,
by minimizing the Mean Squared Error (MSE) loss function.
KD can also be applied to open-source models for natural language generation.
Lin et al.
(2020) investigated the exposure bias problem in the process of distillation for opensource language models.
Similarly, Agarwal et al.
(2023) studied the distribution mismatch between output sequences during training and the sequences generated by the open-
source student during its deployment.
Other approaches, such as the one proposed by Gu et al.
(2023), focused on distilling open-source LLMs like LLaMA (
Touvron et al.
, 2023).
However, in all these methods, the student model needs access to the internal weights and features of the teacher model,
which is not feasible in the context of distilling closed-
source LLMs.
Most similar to our work, Mirzadeh et al.
(2019) introduced an intermediate network to bridge the parameter size gap between the CNN teacher model and the CNN student model.
In contrast to their approach, we introduce an intermediate network with the specific purpose of estimating output distributions of closed-
source LLMs and achieving enhanced knowledge distillation.
2

Under review as a conference paper at ICLR 2024

ℒ𝑡𝑐𝑒

Hard Label

Closed-Source LLM

ℒ𝑡𝑘𝑙

Prior Distribution

Student Model 𝑘𝑙 ℒ𝑡|ℳ

Posterior Distribution

Corpus 𝓒

Prior Estimation

generated by closed-source LLM

Fine-Tuning

Sampling applying

applying

Proxy Model (Open-Source LLM)

putting

applying

using

Posterior Estimation

Figure 2: Overview of our method.
The output distributions of closed-source LLMs are estimated within a Bayesian estimation framework,
including both prior and posterior estimation.
The prior estimation leverages the corpus generated by closed-
source language models to derive a prior distribution,
while the posterior estimation utilizes a proxy model to calibrate the results of the prior estimation.
Traditional knowledge distillation is applied using the estimated output distributions.

2.
2

C LOSED -S OURCE K NOWLEDGE D ISTILLATION

In light of the remarkable performance of closed-source LLMs such as GPT-
3.
5 and GPT-4, numerous studies have shifted their attention toward transferring the diverse capabilities from these proprietary LLMs into smaller open-
source models.
For instance, Liang et al.
(2023) improved the mathematical capability of a small model by training it with tailored exercise samples generated by GPT-
3 (Brown et al.
, 2020).
To transfer the code generation capability, Azerbayev et al.
(2023) prompted Codex (Chen et al.
, 2021) to create natural language-code pairs and fine-
tuned a smaller model on those samples.
To transfer the tool usage capability, Gou et al.
(2023) utilized GPT-4 to generate interactive tool-
use trajectories as training samples for the target model.
Other approaches, such as Hsieh et al.
(2023); Ho et al.
(2022); Mukherjee et al.
(2023) utilized rationales generated by closed-source LLMs as training data to transfer their general reasoning capabilities.
To sum up, these works typically transfer the capabilities of closed-
source LLMs by prompting them to generate samples,
which are then utilized to train a smaller open-source model.
Essentially, these approaches mainly capture the input-
output patterns of closed-source LLMs without delving into more nuanced knowledge as traditional knowledge distillation methods.
In contrast, our approach aims to estimate the output distribution of closed-
source LLMs to train the student model within the traditional knowledge distillation framework.

3

M ETHOD

To perform knowledge distillation in traditional approaches,
we propose to estimate the output distributions of closed-
source LLMs within a Bayesian estimation framework,
which includes both prior and posterior estimation.
For a specific text input, prior estimation leverages the corpus generated by closed-
source language models to derive an initial approximation for the distribution of the output.
Meanwhile, posterior estimation relies on another open-
source LLM as a proxy to fine-tune the results of prior estimation.
This proxy model serves as a bridge between the teacher (
closedsource) and the student (open-source) models,
as illustrated in Figure 2.
Therefore, the proxy model is selected to be a larger language model than the student model and is initially aligned with the closed-
source teacher model using the aforementioned corpus.
Finally, we perform knowledge distillation using the estimated output distributions of the closed-
source teacher LLM.
3

Under review as a conference paper at ICLR 2024

Notations T S M Y X p Yt qYt PY t

Descriptions Closed-source teacher model Open-source student model Open-
source proxy model Output token sequence Input token sequence Probability Pr(
Yt |X, Y<t ) given by T Probability Pr(Yt |X, Y<t ) given by S Discrete random variable associated with the value of pYt

Table 1: Main notations and descriptions.
3.
1

P ROBLEM S TATEMENT

In this section, we first introduce the objective function in traditional knowledge distillation for language models.
We use T and S to represent the closed-source teacher model and open-
source student model, respectively.
Let X denote the input sequence of tokens and Y denote the output sequence of tokens.
At time t, the probability of generating an output token Yt can be represented as Pr(
Yt |X, Y<t ). Let pYt be the probability Pr(Yt |X, Y<t ) given by T , let qYt be the probability Pr(
Yt |X, Y<t ) given by S. Let 1Yt be the one-hot encoded label at time t provided by T . The traditional token-
level objective function of knowledge distillation at time t be derived as follows:
Ltraditional =− t

X

1Yt =w log qYt =w +

w∈V

X w∈V

pYt =w log

pYt =w , qYt =w

(1)

where V is the vocabulary, w is a token in the vocabulary.
Ltraditional consists of two terms: the t first term involves computing cross-
entropy loss with hard labels, and the second term involves computing KL loss with soft labels.
In the context of knowledge distillation of T , the second term is typically omitted because obtaining pYt is not directly feasible.
3.
2

E STIMATION M ETHODS

In this section, we elaborate on the proposed estimation methods:
prior estimation and posterior estimation.
Both methods are designed to estimate the soft labels (
i.e., pYt ) of T . 3.2.1

P RIOR E STIMATION

The prior estimation aims to obtain a coarse-grained p̂Yt to approximate pYt at each time step t. The method achieves this by leveraging a corpus C generated by T , through an optimized n-gram algorithm.
Given a specific output token sequence Y≤t ∈ C, assuming Yt =
wt , where wt is a specific token in V. For those tokens w ∈ V, if w = wt :
p̂Yt =w =

#(Yt = w, Yt−1 = wt−1 , .
.
.
, Yt−n = wt−n ) γ−1 + , γ#(Yt−1 = wt−1 , .
.
.
, Yt−n = wt−n ) γ

(2)

#(Yt = w, Yt−1 = wt−1 , .
.
.
, Yt−n = wt−n ) , γ#(Yt−1 = wt−1 , .
.
.
, Yt−n = wt−n )

(3)

otherwise: p̂Yt =w =

where the # represents the count of a specific output token sequence appears in C. The n is the window size.
The γ is a hyperparameter, γ ∈ Z+ . The γ is used to adjust dominant probability contribution of the token wt .
For instance, when γ = 2, term γ−1 γ ensures that the probability p̂Yt =
wt is greater than 50%.
An assumption behind the prior estimation is that T typically generates the next token with a strong association to the most recent preceding tokens.
Through Equation 2 and 3, we obtain an initial estimate p̂Yt for the soft labels pYt .
We refer to p̂Yt as the prior distribution.
4

Under review as a conference paper at ICLR 2024

3.
2.
2

P OSTERIOR E STIMATION

The prior distribution p̂Yt serves as a coarse-grained approximation for pYt .
To further refine the prior distribution and get a better approximation for pYt ,
we introduce posterior estimation.
The posterior estimation is primarily achieved by introducing a proxy M of T (typically an open-
source LLM with a larger size than S) under the Bayesian estimation framework.
This estimation involves continuously sampling from M to refine the prior distribution.
The M is previously fine-tuned on the corpus C generated by T for preliminary alignment with T . The motivation behind introducing M is to leverage it as a bridge between the closed-
source teacher T and the open-source student S, serving a purpose of better estimating the soft labels pYt of T . We consider the value of pYt can be described by a discrete random variable denoted as PYt (
the transformation to continuous case is straightforward,
but we discuss the discrete case for better understanding.
).
We define PYt with m possible discrete values p1 , p2 , . . . , pm ,
where p1 , p2 , . . . , pm form a number sequence increasing by 1/
m from 0 to 1 (e.g., 0.00, 0.01, 0.02, . . . , 0.99, with m = 100). According to the prior distribution p̂Yt ,
the probability mass function (PMF) Pr(PYt = pi )
of PYt can be predefined in a way that satisfies the following constraint:
E(PYt ) =

m X

pi Pr(PYt = pi ) = p̂Yt

(4)

i=1

Equation 4 implies that the PMF can vary, as long as the expectation E(PYt )
equals p̂Yt .
In practice, m should be sufficiently large (e.g., m = 100). Calibrating the prior distribution involves updating the PMF through sampling from M. We feed X and Y<t into M, a token ŵ ∈ V is sampled at time t.
Given ŵ, and a token w ∈ V, event A is defined as follows:
if w = ŵ, A = 1; otherwise, A = 0. In a sampling round,
we update the PMF Pr(PYt = pi ) based on the event A. If event A = 1 occurs,
according to Bayes’ theorem: Pr(PYt =w = pi |A = 1) ∝ Pr(
A = 1|PYt =w = pi ) Pr(PYt =w = pi ) = pi Pr(PYt =
w = pi ),

(5)

where w ∈ V, i ∈ {1, 2, . . . , m}. We get a normalization factor η by:
η=

m X

pi Pr(PYt =w = pi )

(6)

i=1

Then the value of Pr(PYt =w = pi |A = 1) can be calculated as η1 pi Pr(
PYt =w = pi ).
If event A = 0 occurs instead, according to Bayes’ theorem:
Pr(PYt =w = pi |A = 0) ∝ Pr(A = 0|PYt =w = pi ) Pr(
PYt =w = pi ) = (1 − pi ) Pr(PYt =w = pi ),

(7)

where w ∈ V, i ∈ {1, 2, . . . , m}. We get a normalization factor η by:
η=

m X

(1 − pi ) Pr(PYt =w = pi )

(8)

i=1

Then the value of Pr(PYt =w = pi |A = 0) can be calculated as η1 (
1 − pi ) Pr(PYt =w = pi ).
At this point, one sampling iteration concludes.
The prior Pr(PYt = pi ) will be replaced by the posterior Pr(
PYt = pi |A = 1) or Pr(PYt = pi |A = 0) in the next iteration.
After multiple rounds of sampling from M, we denote the final PMF as Pr(
PYt = pi |M). The pYt can be approximated by calculating the conditional expectation as follow:

E(PYt |M) =

m X

pi Pr(PYt = pi |M)

i=1

We refer E(PYt |M) to as the posterior distribution.
5

(9)

Under review as a conference paper at ICLR 2024

3.
3

OVERALL O BJECTIVE

The overall objective function at time step t comprises three objectives.
Let 1Yt be the one-hot encoded label provided by T , the P first objective at time step t can be derived by calculating the cross-
entropy loss as Lce = − t w∈V 1Yt =w log qYt =w . The second objective at time step t can P p̂Yt =
w be derived based on the prior distribution as Lkl t =
w∈V p̂Yt =w log qY =w . We first normalize t

E(PYt |M) =

E(PYt |M) , w∈V E(PYt =w |M)

then the third objective at time step t can be derived based on P E(
PYt =w |M) . Given an output tokens the posterior distribution as Lkl w∈V E(PYt =
w |M) log t|M = qYt =w sequence with length T , the overall objective function can be derived as follows:
P

L=

T 1 X ce kl (Lt + αLkl t + βLt|M ) T t=1

(10)

kl Where the α and β are hyperparameters used to adjust the contributions of the Lkl t and Lt|
M in the total loss.
When α > 0 and β = 0, L becomes the loss for prior distillation.
When α = 0 and β > 0, L becomes the loss for posterior distillation.

4

E XPERIMENTAL S ETUP

In this section, we conduct a series of experiments to validate the effectiveness of our method.
4.
1

DATASETS

We mainly utilize the OpenOrca (Mukherjee et al.
, 2023) dataset as our training corpus.
The OpenOrca dataset was created by prompting closed-
source LLMs, such as GPT-4, with diverse inputs and collecting the corresponding output sequences.
We follow the settings in OpenOrcaPreview1-13B1 of paper Mukherjee et al.
(2023).
We also utilize the Alpaca (Taori et al.
, 2023) dataset as the training corpus.
The Alpaca dataset was generated by providing diverse inputs to the closed-
source LLM text-davinci-003 prompt and collecting the corresponding output sequences.
For evaluation, we utilize benchmarks including complex reasoning datasets BBH (
Suzgun et al.
, 2022) and ARC Clark et al.
(2018), knowledge-based datasets AGIEval (Zhong et al.
, 2023) and MMLU (Hendrycks et al.
, 2021), commonsense reasoning dataset CSQA (Talmor et al.
, 2019), and mathematical reasoning dataset GSM8K (Cobbe et al.
, 2021).
These benchmarks assess the model across wide range of capabilities including reading comprehension,
commonsense knowledge, mathematical skills and logical reasoning.
Following the settings of Mukherjee et al.
(2023), aside from GSM8K, we focus on tasks that involve multiple-
choice questions.
4.
2

BACKBONE M ODELS

We employ currently state-of-the-art closed-source LLMs GPT-
4 as well as text-davinci-003 as the closed-source teacher models.
We utilize LLaMA-7B and LLaMA-13B as student models,
which are initialized with pre-trained weights obtained from Hugging Face2 .
We choose LLaMA-33B as the proxy model.
We employ top-p sampling for decoding.
We train our models on 8 32GB V100 GPUs.
Additional details can be found in Appendix A. 4.3

BASELINES

We consider instruction fine-tuning (IFT) approach as our baseline.
IFT involves fine-tuning the student model on the samples generated by the teacher model without using soft labels.
We implement the baseline models of our own version ourselves.
We implement our own version of baseline models.
To ensure a fair comparison with other baseline models,
we exclusively include models that have access to their original fine-
tuning datasets.
As a result, our chosen baseline models are 1 2

https://huggingface.
co/Open-Orca/OpenOrca-Preview1-13B https://huggingface.
co/models

6

Under review as a conference paper at ICLR 2024

Models GPT-4 LLaMA-7B (IFT) LLaMA-7B (ours) OpenOrca-
Preview1-13B LLaMA-13B (IFT) LLaMA-13B (ours)

#Params

BBH

AGIEval

ARC

MMLU

CSQA

GSM8K

Average

7B 7B 13B 13B 13B

67.
4 36.
08 38.
52 41.
47 42.
77 44.
83

56.
4 24.
14 26.
92 30.
12 26.
74 29.
35

47.
49 52.
40 59.
77 58.
2 61.
84

86.
4 38.
81 41.
18 48.
10 45.
3 48.
17

58.
71 62.
52 69.
77 66.
27 68.
94

92.
0 12.
65 14.
97 18.
22 20.
93 23.
36

36.
31 39.
43 44.
58 43.
37 46.
08

Table 2: The results of the LLaMA models with different sizes on six benchmarks.
We compare our approach to methods directly instruction fine-
tuning on the hard labels.
The performance of OpenOrca-Preview1-13B is assessed through our own evaluation.
All student models are trained on the OpenOrca dataset.
Models

#Params

BBH

AGIEval

ARC

MMLU

CSQA

GSM8K

Average

7B 7B 13B 13B

70.
7 34.
19 34.
92 38.
1 40.
82

41.
9 24.
16 24.
32 26.
9 28.
35

39.
35 40.
3 52.
57 53.
84

64.
6 33.
66 34.
14 41.
41 42.
17

36.
16 38.
32 55.
27 56.
78

13.
99 14.
33 19.
27 19.
83

30.
25 31.
06 38.
92 40.
3

text-davinci-003 Alpaca-7B LLaMA-7B (ours) Alpaca-
13B LLaMA-13B (ours)

Table 3: The results of the LLaMA models with different sizes on six benchmarks.
We compare our method with Alpaca.
All student models are trained on the Alpaca dataset.
OpenOrca-Perview1-13B from Mukherjee et al.
(2023) and Alpaca (Taori et al.
, 2023), which have been fine-tuned on the samples generated by the teacher model.

5

R ESULT AND A NALYSIS

In this section, we present the main results, ablation studies and additional experiments.
All corpus for proxy model fine-tuning, prior estimation,
posterior estimation, and student distillation are identical.
Unless otherwise specified, the default training corpus we utilize is OpenOrca.
5.
1

M AIN R ESULTS

Table 2 shows the performance comparison of our method against baseline models.
Detailed experimental results can be found in Appendix C. The training corpus we utilized in Table 2 is the OpenOrca dataset.
Our method outperforms OpenOrca-Preview1-13B from Mukherjee et al.
(2023) as well as our own implemented IFT models under both the 7B and 13B model parameter scales.
Table 3 shows the performance comparison between our method and Alpaca.
The training corpus we utilized in Table 3 is the Alpaca dataset.
It shows a similar trend, with our method outperforming Alpaca.
A case study in Table 4 demonstrates that our model exhibits better comprehension and answer generation capabilities in terms of reasoning ability compared to the baseline IFT.
The experimental results not only demonstrate the effectiveness of our approach for both 7B and 13B student model scales but also validate the effectiveness of using estimated soft labels.
Distilling+Posterior MMLU

44 42 40 38 36 34 32

Distilling+Prior+Posterior CSQA

GSM8K

65

18

60

16

55 50 45

Accuracy (%)

Distilling+Prior

Accuracy (%)

30 28 26 24 22 20 18

IFT AGIEval

Accuracy (%)

BBH Accuracy (%)

Accuracy (%)

Pretrain 42 40 38 36 34 32 30

14 12 10 8

Figure 3: Ablation results of LLaMA-7B across five representative benchmarks.
In Distilling+Prior we adjust α = 0.
5, β = 0, to investigate the effect of the prior estimation.
In Distilling+Posterior we adjust α = 0, β = 1, to investigate the effect of the posterior estimation.
In Distilling+Prior+Posterior we adjust α = 0.
5, β = 1 to study combined effect of both.
7

Under review as a conference paper at ICLR 2024 Instruction

The fastest car was 1.
25 times faster than the 2nd fastest car.
The 2nd fastest car was 5 times faster than the average car.
The average car can go 120 Miles per hour.
How fast does the faster car go?

Ours

The 2nd fastest car is 5 times faster than the average car,
so it can go 5 x 120 = 600 Miles per hour.
The faster car is 1.
25 times faster than the 2nd fastest car, so it can go 1.
25 x 600 = 750 Miles per hour.
Therefore, the faster car goes 750 Miles per hour.
✓ The faster car can go 1.
25 x 5 = 60 Miles per hour.
Therefore, the faster car goes 60 Miles per hour.
×

IFT

Table 4: Examples from GSM8K dataset.
Our method enhances the student model’s comprehension and leads to more detailed responses compared to the IFT method.

27

36 5 10

20 Sampling Rounds

26 25 24

40 39 38

50

5 10

20 Sampling Rounds

50

5 10

CSQA

63 Accuracy (%)

37

IFT

41 Accuracy (%)

38

Accuracy (%)

Accuracy (%)

Distilling+Posterior MMLU

AGIEval

20 Sampling Rounds

61 60 59

50

GSM8K

15

62

Accuracy (%)

BBH

39

14 13 12

5 10

20 Sampling Rounds

50

5 10

20 Sampling Rounds

50

Figure 4: Comparing the performance of knowledge distillation utilizing the posterior distribution under various sampling round configurations with IFT,
employing the model with LLaMA-7B.

5.
2

A BLATION S TUDY

This ablation study examines the impact of components within our method.
While retaining the standard cross-entropy loss, we evaluate the effect of the prior estimation,
and the posterior estimation.
All results are presented in Figure 3.
Effect of the prior estimation Retaining the cross-
entropy loss, we incorporate the KL loss involving the prior distribution for training.
This training method is denoted as Distilling+Prior.
As shown in Figure 3, Distilling+Prior consistently outperforms IFT on all benchmarks,
demonstrating the advantages of the coarse-grained knowledge obtained through the prior estimation.
Effect of the posterior estimation Retaining the cross-
entropy loss, we incorporate the KL loss involving the posterior distribution for training.
This training method is denoted as Distilling+Posterior.
As shown in Figure 3, compared to IFT as well as Distilling+
Prior, Distilling+Posterior further boosts the performance.
The improvement in performance comes from the posterior distribution capturing more fine-
grained knowledge of the closed-source teacher model.
Combined effect of both We consider whether combining the KL loss of the prior distribution and the posterior distribution explicitly can improve the performance.
Retaining the cross-entropy loss, we directly add the KL loss involving prior distribution and the KL loss involving posterior distribution into the total loss.
This training method is denoted as Distilling+Prior+
Posterior.
As shown in Figure 3, we observe that the performance gain is marginal compared to Distilling+
Posterior, with limited improvements seen on only a subset of the benchmarks.
The reason for this is that the posterior distribution has already effectively integrated the knowledge from the prior distribution,
and the improvement brought by explicitly combining the KL loss terms is limited.

36 35 34

10 30 50 100 Dataset size (K)

200

10 30 50 100 Dataset size (K)

200

41 40 39 38 37 36

Distilling+Prior+Posterior MMLU

CSQA

10 30 50 100 Dataset size (K)

200

15

60

Accuracy (%)

37

27 26 25 24 23 22

Accuracy (%)

Accuracy (%)

Accuracy (%)

38

Distilling+Prior

AGIEval

Accuracy (%)

IFT BBH

55 50 10 30 50 100 Dataset size (K)

200

GSM8K

14 13 12 11

10 30 50 100 Dataset size (K)

200

Figure 5: Under different dataset sizes, we investigate the comparison of three methods:
IFT, Distilling+Prior, and Distilling+Prior+Posterior,
with the student model utilizing LLaMA-7B. 8

Under review as a conference paper at ICLR 2024

Models

BBH

AGIEval

MMLU

GSM8K

Average

GPT-4 (teacher) LLaMA-33B (proxy) LLaMA-13B (proxy)

67.
4 51.
4 42.
8

56.
4 33.
5 26.
7

86.
4 55.
7 45.
3

92.
0 42.
2 20.
9

75.
5 45.
7 33.
93

Table 5: The performance of closed-source teacher model and aligned proxy models.
Student Models

Proxy Models

BBH

AGIEval

MMLU

GSM8K

Average

LLaMA-7B LLaMA-7B

LLaMA-33B LLaMA-13B

38.
52 37.
41

26.
92 25.
67

41.
18 39.
56

14.
97 13.
83

30.
4 29.
12

Table 6: Performance of student model with different proxy models.

5.
3

I MPACT OF S AMPLING ROUNDS

In this section, we discuss the impact of the number of sampling rounds on the posterior estimation.
The results are represented in Figure 4.
We observe that the best performance is achieved on most benchmarks when the sampling rounds falls within the range of [
10,20].
And we find that excessive sampling (e.g., 50 times)
results in negative impact on the performance of knowledge distillation.
More discussions can be found in Appendix B.2. 5.4

I MPACT OF C ORPUS S IZE

We investigate the effect of training corpus C size,
as shown in Figure 5.
We observe that as the size of the training corpus C increases,
the method “Distilling+Prior+Posterior” consistently outperforms the performance of IFT across benchmarks.
A similar trend can also be observed in the method “Distilling+
Prior”.
We analyze that our method benefits from a larger corpus.
As the corpus size increases, it becomes more advantageous for the prior estimation to estimate a more accurate and information-
rich distribution, subsequently influencing the posterior estimation.
5.
5

P ROXY M ODEL S ELECTION

Proxy model serves as a bridge between the closed-
source teacher model and the open-source student model.
And it is first fine-tuned on the corpus generated by the closed-
source teacher for preliminary alignment.
We believe that opting for a larger and more capable proxy model is advantageous,
as it enhances the model’s ability to capture the capabilities of the closed-
source teacher.
Table 5 presents the performance of the proxy models compared to the closed-
source teacher.
And the student’s performance with different proxy models is shown in Table 6.
The results validate the advantage of choosing more powerful proxy model.

6

C ONCLUSION

In this work, we address the challenge of knowledge distillation for closed-
source language models, where directly access to the teacher’s output distribution is not available.
We proposed Bayesian estimation-based knowledge distillation to estimate the output distribution of closed-
source language models, achieving superior distillation performance.
Our method comprises two main components: prior estimation and posterior estimation.
The prior estimation involves obtaining a coarse-
grained prior distribution by leveraging the corpus generated by the closed-
source language model.
The posterior estimation updates prior distribution based on continued sampling results from a proxy model to obtain a fine-
grained posterior distribution.
Extensive experiments are conducted.
The results across various benchmarks consistently show that our method outperforms directly fine-
tuning on hard labels, when it comes to knowledge distillation of closed-
source language models.
9

Under review as a conference paper at ICLR 2024

R EFERENCES Rishabh Agarwal, Nino Vieillard, Yongchao Zhou,
Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem.
Generalized knowledge distillation for auto-regressive language models,
2023.
Zhangir Azerbayev, Ansong Ni, Hailey Schoelkopf, and Dragomir Radev.
Explicit knowledge transfer for weakly-supervised code generation,
2023.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah,
Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language models are few-shot learners.
Advances in Neural Information Processing Systems,
33:1877–1901, 2020.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto,
Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph,
Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger,
Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin,
Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov,
Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter,
Philippe Tillet, Felipe Petroski Such, Dave Cummings,
Matthias Plappert, Fotios Chantzis, Elizabeth Barnes,
Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol,
Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin,
Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse,
Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra,
Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage,
Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew,
Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba.
Evaluating large language models trained on code,
2021.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,
Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord.
Think you have solved question answering?
try arc, the ai2 reasoning challenge, 2018.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen,
Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek,
Jacob Hilton, Reiichiro Nakano, et al.
Training verifiers to solve math word problems.
arXiv preprint arXiv:2110.
14168, 2021.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
BERT: Pre-training of deep bidirectional transformers for language understanding.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers)
, pp.
4171–4186, Minneapolis, Minnesota, June 2019.
Association for Computational Linguistics.
doi: 10.
18653/v1/N19-1423. URL https: //aclanthology.
org/N19-1423. Zhibin Gou, Zhihong Shao, Yeyun Gong,
Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen.
Tora: A tool-integrated reasoning agent for mathematical problem solving,
2023.
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang.
Knowledge distillation of large language models, 2023.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,
Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
Measuring massive multitask language understanding,
2021.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.
Distilling the knowledge in a neural network.
arXiv preprint arXiv:1503.
02531, 2015.
Namgyu Ho, Laura Schmid, and Se-Young Yun.
Large language models are reasoning teachers.
arXiv preprint arXiv:2212.
10071, 2022.
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost,
Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna,
Chen-Yu Lee, and Tomas Pfister.
Distilling step-by-step!
outperforming larger language models with less training data and smaller model sizes.
arXiv preprint arXiv:2305.
02301, 2023.
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li,
Shean Wang, Lu Wang, Weizhu Chen, et al.
Lora: Low-rank adaptation of large language models.
In International Conference on Learning Representations,
2021.
10

Under review as a conference paper at ICLR 2024

Yuxin Jiang, Chunkit Chan, Mingyang Chen, and Wei Wang.
Lion: Adversarial distillation of closed-source large language model.
arXiv preprint arXiv:2305.
12870, 2023.
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen,
Linlin Li, Fang Wang, and Qun Liu.
TinyBERT: Distilling BERT for natural language understanding.
In Findings of the Association for Computational Linguistics:
EMNLP 2020, pp.
4163–4174, Online, November 2020.
Association for Computational Linguistics.
doi: 10.
18653/v1/2020.findings-emnlp.
372.
URL https://aclanthology.
org/2020.
findings-emnlp.
372.
Zhenwen Liang, Wenhao Yu, Tanmay Rajpurohit, Peter Clark,
Xiangliang Zhang, and Ashwin Kaylan.
Let gpt be a math tutor: Teaching math word problem solvers with customized exercise generation,
2023.
Alexander Lin, Jeremy Wohlwend, Howard Chen, and Tao Lei.
Autoregressive knowledge distillation through imitation learning.
In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (
eds.
), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (
EMNLP), pp.
6121–6133, Online, November 2020.
Association for Computational Linguistics.
doi: 10.
18653/v1/2020.emnlp-main.
494.
URL https://aclanthology.
org/2020.
emnlp-main.
494.
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung,
Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei,
and Adam Roberts.
The flan collection: Designing data and methods for effective instruction tuning,
2023.
Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine,
Akihiro Matsukawa, and Hassan Ghasemzadeh.
Improved knowledge distillation via teacher assistant,
2019.
Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar,
Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah.
Orca: Progressive learning from complex explanation traces of gpt-
4, 2023.
Guilherme Penedo, Quentin Malartic, Daniel Hesslow,
Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli,
Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay.
The refinedweb dataset for falcon llm: Outperforming curated corpora with web data,
and web data only, 2023.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf.
Distilbert, a distilled version of bert: smaller,
faster, cheaper and lighter.
arXiv preprint arXiv:1910.
01108, 2019.
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann,
Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le,
Ed H. Chi, Denny Zhou, and Jason Wei.
Challenging bigbench tasks and whether chain-of-thought can solve them,
2022.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant.
CommonsenseQA: A question answering challenge targeting commonsense knowledge.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers)
, pp.
4149–4158, Minneapolis, Minnesota, June 2019.
Association for Computational Linguistics.
doi: 10.
18653/v1/N19-1421. URL https://aclanthology.
org/N19-1421. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang,
Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
and Tatsunori B. Hashimoto.
Stanford alpaca: An instruction-following llama model.
https://github.
com/tatsu-lab/stanford_alpaca, 2023.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet,
Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière,
Naman Goyal, Eric Hambro, Faisal Azhar, et al.
Llama: Open and efficient foundation language models.
arXiv preprint arXiv:2302.
13971, 2023.
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang,
Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan.
Agieval: A human-centric benchmark for evaluating foundation models,
2023.

11

Under review as a conference paper at ICLR 2024

Models LLaMA-33B LLaMA-13B LLaMA-7B

Batch Size 1 4 6

Max Length 512 512 512

Lora Rank 96 16 16

#GPUs

Precision

Dimension

#Heads

#Layers

8 8 4

float16 float16 float16

6656 5120 4096

52 40 32

60 40 32

Table 7: Model configurations.

A

E XPERIMENTAL C ONFIGURATIONS

A.
1

T RAINING C ONFIGURATIONS

The model configurations are provided in Table 7.
We train the student models for three epochs, experimenting with learning rates of 1e-5, 3e-5, and 5e-5 during training.
In the knowledge distillation process, we use the following hyperparameters:
For the total loss, α = 0.
5 and β = 1.
For prior estimation, we set γ = 3 and n = 5. For posterior estimation,
we conduct 10 rounds of sampling.
We evaluate the models on the benchmarks using the final checkpoint.
For time efficiency and memory saving, we employ LoRA (
Hu et al.
, 2021) for more efficient training.
A.2

T RAINING C OST

We conducted all our model training on NVIDIA V100 GPUs equipped with 32GB memory.
The table 8 presents the GPU and time costs per epoch for various models trained on the OpenOrca dataset.
For all student models, we train on the dataset for 3 epochs.
Models LLaMA-7B LLaMA-13B LLaMA-33B

#GPUs 4 8 8

Hours/Epoch 17.
0 15.
5 40.
0

Table 8: The GPU and time costs for various models trained on the 200K OpenOrca dataset.
A.3

DATA U SAGE PER S TAGE

Table 9, summarizes the training data used for each model at every stage.
Specifically, Orca200K denotes the OpenOrca corpus (
Mukherjee et al.
, 2023) with 200K samples, while Alpaca52K represents the Alpaca corpus (
Taori et al.
, 2023) with 52K samples.

LLaMA-7B (IFT)

Prior Estimation Stage -

Posterior Estimation Stage -

Training Stage Orca200K

LLaMA-7B (ours)

Orca200K

Orca200K

Orca200K

OpenOrca-Preview1-13B

-

-

Orca200K

LLaMA-13B (IFT)

-

-

Orca200K

LLaMA-13B (ours)

Orca200K

Orca200K

Orca200K

LlaMA-33B (Proxy)

-

-

Orca200K

Alpaca-7B

-

-

Alpaca52K

Alpaca52K

Alpaca52K

Alpaca52K

-

-

Alpaca52K

LLaMA-13B (ours)

Alpaca52K

Alpaca52K

Alpaca52K

LlaMA-33B (Proxy)

-

-

Alpaca52K

Models

LLaMA-7B (ours) Alpaca-13B

Table 9: Summary of training data for each model at each stage.

12

0.
7 0.
6 0.
5 0.
4 0.
3 0.
2 0.
1 0.
0

1.
0 0.
8 Probability

Probability

Under review as a conference paper at ICLR 2024

282

0.
0

10664 10662 10657 10669 10665 10666 10667 10668 Token Indices

310

10663 10659 10671 10666 10667 10668 10669 10670 31999 Token Indices

0.
4 Probability

0.
5 Probability

0.
4 0.
2

5353

0.
6 0.
4 0.
3 0.
2

0.
3 0.
2 0.
1

0.
1 0.
0

0.
6

450

1763

350

16696

313 3869 Token Indices

319

512

315

0.
0

360

13

2

910

10671 10670 10666 10667 10668 10669 31999 Token Indices

Figure 6: The issue of probability sparsity in the output distribution.
A significant portion of probability values concentrates on a few tokens,
while the probabilities for other tokens are close to zero.
Hard Label Proxy Distribution Posterior Distribution

Figure 7: Discrepancies between the the ground-truth distribution and the output distribution of proxy model (
proxy distribution) in terms of the top-4 token, while the posterior distribution can stay consistent with the ground-
truth distribution.

B B.
1

D ISTRIBUTION A NALYSIS P ROBABILITY S PARSITY

During the distillation process, we observed a phenomenon of probability sparsity in the output distribution of the proxy model.
Typically, only a few tokens have high probabilities,
while the probabilities of other tokens are close to zero,
as shown in Figure 6.
In our distillation process, we retained only the probabilities of the top ten tokens with the highest probabilities,
setting the probabilities of the remaining tokens to zero.
This phenomena indicates that during the sampling process of the proxy model,
we don’t need to perform a large number of samples to cover all tokens with non-
zero probabilities.
B.2

D ISTRIBUTION D ISCREPANCY

We observe that as the number of sampling rounds increased,
the model’s performance improved on most benchmarks.
However, when the number of sampling rounds becomes excessive,
such as 50 rounds, the model’s performance started to decrease,
as shown in Figure 4.
We analyze that when the number of sampling rounds becomes excessive,
the posterior distribution tends to degenerate into the proxy distribution.
When directly using the proxy distribution for knowledge distillation,
we observe discrepancies between the proxy distribution and labels generated by teacher (
For example, when the label generated by teacher at the current position is “\
n”, the proxy distribution assigns a high probability (
e.g., 0.99) to “<\s>”, while the probability of “\
n” becomes close to 0.
), which can lead to issues in distillation.
More cases are shown in Figure 7.
13

Under review as a conference paper at ICLR 2024

Tasks Boolean Expressions Causal Judgement Date Understanding Disambiguation QA Formal Fallacies Geometric Shapes Hyperbaton Logical Deduction (
5 objects) Logical Deduction (3 objects) Logical Deduction (
7 objects) Movie Recommendation Navigate Penguins in a Table Reasoning about Colored Objects Ruin Names Salient Translation Error Detection Snarks Sports Understanding Temporal Sequences Tracking Shuffled Objects (
5 objects) Tracking Shuffled Objects (7 objects) Tracking Shuffled Objects (
3 objects) Average

LLaMA-13B (IFT) 58.
8 61.
27 50.
0 56.
8 56.
4 25.
2 63.
6 33.
8 23.
39 44.
2 77.
59 51.
6 32.
61 39.
6 36.
4 31.
6 48.
31 60.
8 17.
28 19.
46 14.
63 37.
5 42.
77

LLaMA-13B (ours) 62.
4 63.
01 54.
02 60.
0 54.
4 23.
6 66.
8 36.
14 30.
12 51.
6 79.
32 56.
8 36.
11 42.
8 33.
8 37.
2 52.
25 60.
4 11.
2 21.
1 17.
17 36.
02 44.
83

LLaMA-7B (IFT) 65.
06 56.
98 49.
3 49.
4 54.
0 12.
42 49.
2 26.
51 18.
7 42.
17 50.
78 45.
6 30.
58 27.
54 15.
2 24.
0 43.
82 56.
0 13.
49 17.
2 11.
98 33.
9 36.
08

LLaMA-7B (ours) 66.
4 61.
85 49.
26 54.
8 54.
0 22.
4 54.
8 30.
96 18.
11 42.
8 53.
42 55.
2 34.
91 30.
33 14.
8 28.
4 45.
7 55.
6 9.
68 17.
74 14.
8 32.
52 38.
52

Table 10: Zero-shot performance comparison in Big-
Bench Hard benchmark on multiple-choice questions.

C

E XPERIMENTAL R ESULTS

C.
1

D ETAILED R ESULTS

Following the settings in OpenOrca-Preview1-13B3 of paper Mukherjee et al.
(2023), and considering time efficiency, we conduct training on a subset of the original corpus containing 200k instances.
The detailed experimental results for the LLaMA model on BBH,
AGIEval, and MMLU benchmarks are presented in Table 10,
Table 11 and Table 12.
C.2

R ESULTS OF F LAN T5

We also conducted experiments on the FlanT5 (Longpre et al.
, 2023) model using the OpenOrca dataset, and the results are shown in the Table 13.
We find that, compared to the IFT method, our approach does lead to some improvement,
although the improvement is limited.
We speculate that this might be because FlanT5 is a model that has been fine-
tuned with instructions, and its original model already had some basic capabilities for these tasks.
Therefore, the additional training results in limited improvement.
C.3

C ONTINUOUS T RAINING OF P ROXY M ODEL

We also investigated the impact of continuous fine-
tuning of the proxy model on the OpenOrca corpus, as shown in the Figure 8.
We find that as the number of epochs for fine-tuning the proxy model increases,
it leads to a decrease in the performance of posterior estimation.
We speculate that this may be due to the proxy model overfitting to the current corpus,
resulting in a decrease in the effectiveness.
During training, we avoid excessive fine-tuning epochs for the proxy model.
C.4

O RDER OF N

We investigate the impact of the order of n. Intuitively,
the order of n should be selected within a limited range.
We conduct experiments distilling on the prior distribution with LLaMA-
7B under different order of n, as shown in Table 14.

3

https://huggingface.
co/Open-Orca/OpenOrca-Preview1-13B

14

Under review as a conference paper at ICLR 2024

Models

#Params

AQuA-RAT

LogiQA

LSAT-AR

LSAT-LR

SAT-English (w/o Psg.
)

SAT-Math

Average

7B 7B 13B 13B

19.
71 22.
39 18.
61 25.
22

26.
81 29.
68 27.
59 29.
63

18.
22 19.
46 17.
7 19.
65

27.
44 33.
33 34.
58 36.
67

30.
35 30.
46 36.
27 33.
5

22.
29 26.
19 25.
7 31.
43

24.
14 26.
92 26.
74 29.
35

LLaMA-7B (IFT) LLaMA-7B (ours) LLaMA-13B (IFT) LLaMA-
13B (ours)

Table 11: Performance comparison in AGIEval benchmark on the selected multiple-
choice English questions.
We use OpenOrca dataset as training corpus.

Models

#Params

Humanities

Other

Social Sciences

STEM

Average

7B 7B 13B 13B

38.
49 41.
4 46.
02 47.
81

44.
63 47.
32 53.
19 56.
7

40.
24 42.
17 48.
24 51.
36

31.
87 33.
82 33.
91 36.
79

38.
81 41.
18 45.
34 48.
17

LLaMA-7B (IFT) LLaMA-7B (ours) LLaMA-13B (IFT) LLaMA-
13B (ours)

Table 12: Performance comparison on the Massive Multitask Language Understanding benchmark.

Accuracy (%)

38 37 36 1

2 3 Training Epoch

4

CSQA

63

41

Accuracy (%)

39 Accuracy (%)

Distilling+Posterior MMLU

BBH

40 39

62 61 60 59

1

2 3 Training Epoch

4

1

2 3 Training Epoch

4

Figure 8: The change in performance of distilling on the posterior distribution (
Distilling+Posterior) with the fine-tuning epochs of the proxy model.
We utilize LLaMA-7B as the student model, and LLaMA-
33B as the proxy model.

Models GPT-4 FlanT5-large (IFT) FlanT5-large (ours)
FlanT5-xl (IFT) FlanT5-xl (ours)

#Params

BBH

AGIEval

ARC

MMLU

CSQA

GSM8K

Average

780M 780M 3B 3B

34.
63 35.
22 38.
47 39.
51

56.
4 28.
12 28.
84 28.
34 30.
1

46.
44 46.
61 59.
6 60.
12

86.
4 39.
41 39.
34 46.
91 46.
78

76.
78 76.
93 84.
79 85.
38

92.
0 4.
54 4.
71 6.
12 7.
1

38.
32 38.
61 44.
04 44.
83

Table 13: The results of the FlanT5 models with different parameter sizes on the six benchmarks.
We compare our method with IFT.

Models GPT-4 LLaMA-7B (IFT) LLaMA-7B (ours) LLaMA-
7B (ours) LLaMA-7B (ours) LLaMA-7B (ours)

Order of n 3 5 8 100

BBH 67.
4 36.
8 37.
3 37.
3 37.
3 36.
2

AGIEval 56.
4 24.
14 25.
53 25.
7 24.
84 24.
3

MMLU 86.
4 38.
81 40.
1 40.
0 39.
6 38.
7

GSK8K 92.0 12.65 13.1 13.2 13.0 12.7

Table 14: The results of LLaMA-7B distilled on the prior distribution with different orders of n.

15

