13,14c13
< Recently, many studies have focused on enhancing the capabilities of smaller models,
< through knowledge distillation (KD) on those closed-
---
> Many recent studies focus on enhancing the capabilities of smaller models through knowledge distillation from closed-
16,29c15,28
< However, due to the inability to directly access the closedsource language model’s output distribution,
< KD methods can currently only be performed using one-
< hot labels, which hinders the effectiveness of KD.
< To address this limitation, we propose a Bayesian estimation-
< based knowledge distillation method.
< Specifically, our method comprises prior estimation and posterior estimation.
< The prior estimation obtains a prior distribution by leveraging the corpus generated by the closed-
< source language model.
< The posterior estimation updates the prior distribution to obtain a posterior distribution,
< based on continued sampling results.
< Then we utilize the prior and posterior distributions for distillation.
< Experimental results showcase that, in the context of KD for closed-
< source language model, our method outperforms the current KD methods that directly fine-
< tune on the one-hot labels.
---
> However, due to the incapability to directly access the weights,
> hidden states, and output distributions of these closed-
> source models, the distillation can only be performed by fine-
> tuning smaller models with samples generated by closed-
> source language models, which constrains the effectiveness of knowledge distillation.
> In this paper, we propose to estimate the output distributions of closed-
> source language models within a Bayesian estimation framework,
> involving both prior and posterior estimation.
> The prior estimation aims to derive a prior distribution by utilizing the corpus generated by closed-
> source language models, while the posterior estimation employs a proxy model to update the prior distribution and derive a posterior distribution.
> By leveraging the estimated output distribution of closed-
> source language models, traditional knowledge distillation can be executed.
> Experimental results demonstrate that our method surpasses the performance of current models directly fine-
> tuned on data generated by closed-source language models.
38,39c37,38
< source counterparts like LLaMA(Touvron et al.
< , 2023) and Falcon(Penedo et al.
---
> source counterparts like LLaMA (Touvron et al.
> , 2023) and Falcon (Penedo et al.
44,52c43,44
< tuning open-source language models on the generated one-
< hot labels.
< Knowledge distillation (KD) (Hinton et al.
< , 2015) is an effective technology that aims to obtain a small but strong student model by distilling knowledge from a large teacher model.
< The objective function in Hinton et al.
< (2015) involves calculating the Kullback-Leibler (
< KL) divergence between the output distributions of the teacher model and the student model.
< By minimizing the KL divergence, the student model is able to mimic the behavior and learn the intrinsic knowledge of the teacher model.
< However, many current methods (Hsieh et al.
---
> tuning open-source language models on these samples (
> Hsieh et al.
55,61c47,66
< , 2022) that perform KD on the closed-source LLMs involves solely fine-
< tuning student model on one-hot labels generated by the teacher model,
< as illustrated in Figure1.
< In contrast to using output distribution (soft labels)
< to compute KL divergence, transferring deeper and more fundamental knowledge from teacher model to student model is constrained when relying solely on fine-
< tuning with one-hot labels.
< This represents a limitation in current KD methods for closed-
---
> , 2022).
> However, this approach only enables open-source language models to emulate the input-
> output behavior of closed-source LLMs without acquiring their intrinsic knowledge related to logits,
> weights, activations, and so forth.
> Knowledge distillation (KD) (Hinton et al.
> , 2015) is a popular compression technology that aims to train a small but strong student model by distilling knowledge from a large teacher model.
> Among various sources of knowledge, the logits of the teacher model are typically utilized as an essential part of the objective function,
> implemented by minimizing the Kullback-Leibler (KL)
> divergence between the output distribution (soft labels)
> of the teacher model and the output distribution of the student model.
> This approach enables the student model to mimic the predictive behavior and acquire the knowledge of the teacher model.
> However, such approaches are not readily applicable to closed-
> source LLMs as the soft labels are not feasible.
> To tackle this challenge, we propose to estimate the output distributions of closed-
> source LLMs within a Bayesian estimation framework,
> including both prior and posterior estimation.
> The aim of prior estimation is to derive a prior distribution by leveraging the corpus generated by closed-
> source language models.
> The rationale is that the corpus may contain coarse-
> grained information regarding the output distributions of closed-
63,77c68,75
< To address this limitation, we propose Bayesian estimation-
< based knowledge distillation to perform effective knowledge distillation on closed-
< source language model (LM).
< Our method first estimates the inaccessible output distribution (
< referred as to latent distribution) of closed-source LM,
< and then performs KD on the estimated distribution.
< Our approach comprises two main components: prior estimation and posterior estimation.
< (1) The prior estimation is designed to estimate the latent distribution by leveraging corpus generated by the closed-
< source LM.
< Our hypothesis is that within the generated corpus,
< there are underlying patterns that characterize the latent distribution.
< Through prior estimation, a prior distribution that approximates the latent distribution can be obtained.
< (2) By continuously sampling from a proxy of the closed-
< source LM, posterior estimation derives a posterior distribution to approximate the latent distribution.
< Then we perform KD on these esti1
---
> Meanwhile, the posterior estimation utilizes a proxy model,
> another open-source LLM typically larger than the student model,
> to calibrate the results of the prior estimation.
> This proxy model is initially aligned with the closed-
> source teacher model and then functions as a bridge between the teacher and the student,
> as illustrated in Figure 1.
> By leveraging the estimated output distribution of closed-
> source LLMs, traditional knowledge distillation can 1
105,108c103,106
< Figure 1: (a) In knowledge distillation of closed-
< source models, only one-hot labels (hard labels) can be obtained.
< (b) In knowledge distillation of open-source models,
< output distributions (soft labels) can be obtained.
---
> Figure 1: (a) In current knowledge distillation of closed-
> source models, only hard labels can be obtained.
> (b) In traditional knowledge distillation of open-
> source models, soft labels can be obtained.
111,118c109,117
< (d) Compared to hard labels, soft labels allow students to learn more profound knowledge by guiding them to learn from multiple valid targets.
< mated distributions.
< The utilization of the estimated distributions enables student model to tap into more profound and essential aspects of the closed-
< source teacher model’s knowledge during distillation process.
< It fosters a more comprehensive and insightful learning experience compared to the previous closed-
< source KD paradigm relying solely on one-hot labels.
< We conduct extensive experiments with LLaMA (Touvron et al.
< , 2023) across various representative benchmarks, such as BBH(
---
> (d) Compared to hard labels, soft labels allow students to learn more profound knowledge by guiding them to learn from multiple valid targets during distillation.
> 
> be carried out.
> Compared to previous approaches addressing this objective,
> our method enables the student model to learn from both the generated samples by the closed-
> source teacher and the soft labels provided by the proxy model,
> allowing the distillation of more intrinsic knowledge.
> To validate our approach, we performed comprehensive experiments on a range of well-
> established benchmarks, including complex reasoning datasets BBH (
120,124c119,123
< , 2022), AGIEval(Zhong et al.
< , 2023), ARCClark et al.
< (2018), MMLU(Hendrycks et al.
< , 2021), CSQA(Talmor et al.
< , 2019) and GSM8K(Cobbe et al.
---
> , 2022) and ARC Clark et al.
> (2018), knowledge-based datasets AGIEval (Zhong et al.
> , 2023) and MMLU (Hendrycks et al.
> , 2021), commonsense reasoning dataset CSQA (Talmor et al.
> , 2019), and mathematical reasoning dataset GSM8K (Cobbe et al.
126,132c125,136
< In the context of KD for closed-source LM, the empirical results demonstrate the effectiveness of our method over directly fine-
< tuning on the one-hot labels.
< For example, our method achieves an average accuracy improvement across the six benchmarks from 36.
< 31% to 39.
< 43% with LLaMA-7B, over methods that solely fine-tune on one-
< hot labels.
< These findings provide compelling evidence of the effectiveness of the proposed method.
---
> We used GPT-4 as the closed-source teacher model,
> LLaMA33B as the proxy model, and LLaMA-13B/7B as the student model.
> The empirical results demonstrate the superiority of our method over directly fine-
> tuning the student model on samples generated by GPT-
> 4, with an average improvement from points 36.
> 31 to 39.
> 43 across the six benchmarks.
> The experimental results show that, the introduction of a proxy model can serve as an intermediary bridge for student model to learn knowledge from the closed-
> source teacher model.
> It benefits from the proxy model that aligns better with the teacher model.
> This facilitates the transfer of more profound knowledge from the closed-
> source teacher model to the student model more effectively.
140,141c144
< This knowledge transfer is achieved by minimizing KL divergence between output distributions of the teacher model and the student model.
< Current KD methods can be categorized into two primary types:
---
> Current KD methods can be organized into two primary categories:
149,150c152
< Knowledge distillation was first applied to distilling open-
< source models.
---
> KD can be applied to open-source models for natural language understanding.
152c154
< (2019) applies KD to the pre-training process of BERT (
---
> (2019) applied KD to the pre-training process of BERT (
156c158
< (2020) allows the student model’s intermediate features to mimic the teacher model’s intermediate features,
---
> (2020) allowed the student model’s intermediate features to mimic the teacher model’s intermediate features,
157a160,165
> KD can also be applied to open-source models for natural language generation.
> Lin et al.
> (2020) investigated the exposure bias problem in the process of distillation for opensource language models.
> Similarly, Agarwal et al.
> (2023) studied the distribution mismatch between output sequences during training and the sequences generated by the open-
> source student during its deployment.
159,167c167,170
< (2023), focus on distilling opensource generative language model like LLaMA.
< Additionally, Park et al.
< (2019) leverages samplewise relative information within the teacher model to perform knowledge distillation on ResNet (
< He et al.
< , 2016).
< Mirzadeh et al.
< (2019) introduces an intermediate network to bridge the parameter size gap between the CNN teacher model and the CNN student model.
< However, it’s important to note that in all these methods,
< the student model needs access to the internal features or parameters of the teacher model,
---
> (2023), focused on distilling open-source LLMs like LLaMA (
> Touvron et al.
> , 2023).
> However, in all these methods, the student model needs access to the internal weights and features of the teacher model,
169,170c172,176
< source LM.
< 2.
---
> source LLMs.
> Most similar to our work, Mirzadeh et al.
> (2019) introduced an intermediate network to bridge the parameter size gap between the CNN teacher model and the CNN student model.
> In contrast to their approach, we introduce an intermediate network with the specific purpose of estimating output distributions of closed-
> source LLMs and achieving enhanced knowledge distillation.
173,184d178
< C LOSED -S OURCE K NOWLEDGE D ISTILLATION
< 
< Given the outstanding performance of current SOTA closed-
< source LLMs like GPT-3.
< 5 and GPT4, many studies have shifted their focus towards transferring knowledge from these closed-
< source LLMs into smaller models.
< Some approaches, such as Hsieh et al.
< (2023); Ho et al.
< (2022); Mukherjee et al.
< (2023) utilize rationales generated by closed-source LLMs as training data.
< They then perform fine-tuning on these generated rationales to transfer the teacher model’s reasoning abilities 2
< 
187c181
< Notations
---
> ℒ𝑡𝑐𝑒
189c183
< Descriptions
---
> Hard Label
191,192c185
< C V M I wt Qwt Pw∗t Pwt Y fWt (Pwt ) fWt |Y (Pwt |
< Y ) E(Pwt ) E(Pwt |M)
---
> Closed-Source LLM
194,203c187
< Corpus generated by the closed-source language model Vocabulary of language model Proxy model Input instruction The tth response token,
< wt ∈ V Probability Pr(wt |wt−1 , .
< .
< .
< , w1 , I) in the student model Probability Pr(wt |
< wt−1 , .
< .
< .
< , w1 , I) in the closed-source model Random variable associated with the value of Pw∗t Discrete random event,
< Y ∈ {0, 1} Probability dense function of Pwt Conditional probability dense function of Pwt given event Y Prior probability Posterior probability
---
> ℒ𝑡𝑘𝑙
205c189
< Table 1: Notations and descriptions.
---
> Prior Distribution
207,220c191
< into the student model.
< To enhance the student’s capabilities, Jiang et al.
< (2023), for instance, identifies challenging samples and has the closed-
< source teacher generate more to fine-tune the student.
< However, in the context of knowledge distillation for closed-
< source LM, most existing methods stop at fine-tuning on the teacher-
< generated one-hot labels.
< Our work, on the other hand, focuses on distilling knowledge from the closed-
< source LM more efficiently by estimating the latent distribution.
< We achieve this by introducing Bayesian estimation-
< based methods to soften the one-hot labels provided by the closed-
< source teacher.
< We enhance the effectiveness of knowledge transferring from the closed-
< source teacher model to the student model, by minimizing the KL divergence between the output distribution of the student model and the estimated output distribution.
---
> Student Model 𝑘𝑙 ℒ𝑡|ℳ
222c193
< 3
---
> Posterior Distribution
224c195
< M ETHOD
---
> Corpus 𝓒
226,229c197
< We present Bayesian estimation-based knowledge distillation to enhance the efficiency of knowledge distillation for closed-
< source LM.
< 3.
< 1
---
> Prior Estimation
231c199
< P ROBLEM S TATEMENT
---
> generated by closed-source LLM
233,248c201
< In this section, we first provide notations in Table 1.
< We consider a language model with vocabulary V, takes an instruction I as input and generates response tokens w1 , w2 , w3 . . . as output.
< At time t, the probability of generating token wt can be represented as Pr(
< wt |wt−1 , .
< .
< .
< , w1 , I). We refer the distribution as the probabilities Pr(
< wt |wt−1 , .
< .
< .
< , w1 , I) encompassing all words within vocabulary V. Let Pw∗t be the probability Pr(
< wt |wt−1 , .
< .
< .
< , w1 , I) in closed-source LM, then token-level objective function of KD for the closed-
< source LM at time t can be derived as follows:
---
> Fine-Tuning
250c203
< Lkl t =
---
> Sampling applying
252c205
< X
---
> applying
254c207
< Pw∗t log
---
> Proxy Model (Open-Source LLM)
256c209
< wt ∈V
---
> putting
258c211
< Pw∗t Qwt
---
> applying
260c213
< (1)
---
> using
262,272c215
< Where the Qwt is the probability Pr(wt |wt−1 , .
< .
< .
< , w1 , I) in student model.
< Due to the inaccessibility of Pw∗t , this objective function degrades to computing cross entropy with one-
< hot labels, which might limit the performance of KD.
< To this end, our goal is to estimate a probability to approximate the Pw∗t (referred to as latent probability)
< .
< Subsequently, we perform KD on the estimated probabilities.
< The overall architecture of our method is shown in Figure 2.
< 3
---
> Posterior Estimation
274c217,223
< Under review as a conference paper at ICLR 2024
---
> Figure 2: Overview of our method.
> The output distributions of closed-source LLMs are estimated within a Bayesian estimation framework,
> including both prior and posterior estimation.
> The prior estimation leverages the corpus generated by closed-
> source language models to derive a prior distribution,
> while the posterior estimation utilizes a proxy model to calibrate the results of the prior estimation.
> Traditional knowledge distillation is applied using the estimated output distributions.
276c225,226
< ℒ𝑡𝑐𝑒
---
> 2.
> 2
278c228
< One-Hot Label One-Hot Label
---
> C LOSED -S OURCE K NOWLEDGE D ISTILLATION
280c230,255
< Closed-Source Language Model
---
> In light of the remarkable performance of closed-source LLMs such as GPT-
> 3.
> 5 and GPT-4, numerous studies have shifted their attention toward transferring the diverse capabilities from these proprietary LLMs into smaller open-
> source models.
> For instance, Liang et al.
> (2023) improved the mathematical capability of a small model by training it with tailored exercise samples generated by GPT-
> 3 (Brown et al.
> , 2020).
> To transfer the code generation capability, Azerbayev et al.
> (2023) prompted Codex (Chen et al.
> , 2021) to create natural language-code pairs and fine-
> tuned a smaller model on those samples.
> To transfer the tool usage capability, Gou et al.
> (2023) utilized GPT-4 to generate interactive tool-
> use trajectories as training samples for the target model.
> Other approaches, such as Hsieh et al.
> (2023); Ho et al.
> (2022); Mukherjee et al.
> (2023) utilized rationales generated by closed-source LLMs as training data to transfer their general reasoning capabilities.
> To sum up, these works typically transfer the capabilities of closed-
> source LLMs by prompting them to generate samples,
> which are then utilized to train a smaller open-source model.
> Essentially, these approaches mainly capture the input-
> output patterns of closed-source LLMs without delving into more nuanced knowledge as traditional knowledge distillation methods.
> In contrast, our approach aims to estimate the output distribution of closed-
> source LLMs to train the student model within the traditional knowledge distillation framework.
282c257
< ℒ𝑡𝑘𝑙
---
> 3
284c259
< Prior Distribution
---
> M ETHOD
286c261,276
< Student Model 𝑘𝑙 ℒ𝑡|𝑀
---
> To perform knowledge distillation in traditional approaches,
> we propose to estimate the output distributions of closed-
> source LLMs within a Bayesian estimation framework,
> which includes both prior and posterior estimation.
> For a specific text input, prior estimation leverages the corpus generated by closed-
> source language models to derive an initial approximation for the distribution of the output.
> Meanwhile, posterior estimation relies on another open-
> source LLM as a proxy to fine-tune the results of prior estimation.
> This proxy model serves as a bridge between the teacher (
> closedsource) and the student (open-source) models,
> as illustrated in Figure 2.
> Therefore, the proxy model is selected to be a larger language model than the student model and is initially aligned with the closed-
> source teacher model using the aforementioned corpus.
> Finally, we perform knowledge distillation using the estimated output distributions of the closed-
> source teacher LLM.
> 3
288c278
< Posterior Distribution Prior Distribution
---
> Under review as a conference paper at ICLR 2024
290c280
< Corpus 𝓒
---
> Notations T S M Y X p Yt qYt PY t
292c282,284
< Posterior Distribution
---
> Descriptions Closed-source teacher model Open-source student model Open-
> source proxy model Output token sequence Input token sequence Probability Pr(
> Yt |X, Y<t ) given by T Probability Pr(Yt |X, Y<t ) given by S Discrete random variable associated with the value of pYt
294c286,288
< Prior Estimation
---
> Table 1: Main notations and descriptions.
> 3.
> 1
296c290
< generated by closed-source model
---
> P ROBLEM S TATEMENT
298c292,300
< Fine-Tuning
---
> In this section, we first introduce the objective function in traditional knowledge distillation for language models.
> We use T and S to represent the closed-source teacher model and open-
> source student model, respectively.
> Let X denote the input sequence of tokens and Y denote the output sequence of tokens.
> At time t, the probability of generating an output token Yt can be represented as Pr(
> Yt |X, Y<t ). Let pYt be the probability Pr(Yt |X, Y<t ) given by T , let qYt be the probability Pr(
> Yt |X, Y<t ) given by S. Let 1Yt be the one-hot encoded label at time t provided by T . The traditional token-
> level objective function of knowledge distillation at time t be derived as follows:
> Ltraditional =− t
300c302
< Sampling applying
---
> X
302c304
< applying
---
> 1Yt =w log qYt =w +
304c306
< putting
---
> w∈V
306c308
< applying
---
> X w∈V
308c310
< using
---
> pYt =w log
310c312
< Proxy Model Posterior Estimation
---
> pYt =w , qYt =w
312,317c314
< Figure 2: Overview of our method.
< We first obtain prior distribution through the prior estimation.
< Then in the posterior estimation, the prior distribution is updated through iterative sampling from a proxy of the closed-
< source LM.
< The final objective function involves three targets:
< one-hot label, prior distribution, and posterior distribution.
---
> (1)
318a316,319
> where V is the vocabulary, w is a token in the vocabulary.
> Ltraditional consists of two terms: the t first term involves computing cross-
> entropy loss with hard labels, and the second term involves computing KL loss with soft labels.
> In the context of knowledge distillation of T , the second term is typically omitted because obtaining pYt is not directly feasible.
320,322c321
< 2 3.
< 2.
< 1
---
> 2
324c323
< E STIMATION M ETHODS P RIOR E STIMATION
---
> E STIMATION M ETHODS
326,338c325,328
< In this section, we elaborate on the proposed prior estimation method.
< The prior estimation aims to estimate a probability to approximate the latent probability Pw∗t at each time step t. Given sequence ′ (
< wt′ , wt−1 , .
< .
< .
< , w1′ , I), the prior estimation aims to inform, at time step t, a high probabilities for the student model to generate the ground-
< truth token wt′ while still allowing for some probability of other valid tokens.
< Given corpus C generated by the closed-source LM,
< for a specific sequence ′ ′ , .
< .
< .
< , w1′ , I) , . . . , w1′ , I) ∈ C, and for ∀wt ∈ V, if wt =
< wt′ , then the value of Pr(wt |wt−1 (wt′ , wt−1 can be computed as:
---
> In this section, we elaborate on the proposed estimation methods:
> prior estimation and posterior estimation.
> Both methods are designed to estimate the soft labels (
> i.e., pYt ) of T . 3.2.1
340c330
< pwt =
---
> P RIOR E STIMATION
342c332,337
< ′ ′ ) γ−1 #(wt , wt−1 , .
---
> The prior estimation aims to obtain a coarse-grained p̂Yt to approximate pYt at each time step t. The method achieves this by leveraging a corpus C generated by T , through an optimized n-gram algorithm.
> Given a specific output token sequence Y≤t ∈ C, assuming Yt =
> wt , where wt is a specific token in V. For those tokens w ∈ V, if w = wt :
> p̂Yt =w =
> 
> #(Yt = w, Yt−1 = wt−1 , .
345c340
< , wt−n + ′ ′ γ#(wt−1 , .
---
> , Yt−n = wt−n ) γ−1 + , γ#(Yt−1 = wt−1 , .
348c343
< , wt−n ) γ
---
> , Yt−n = wt−n ) γ
352,359c347
< ′ If wt ̸= wt′ , then the value of Pr(wt |wt−1 , .
< .
< .
< , w1′ , I) can be computed as:
< 
< pwt =
< 
< ′ ′ #(wt , wt−1 , .
---
> #(Yt = w, Yt−1 = wt−1 , .
362c350
< , wt−n ) ′ ′ γ#(wt−1 , .
---
> , Yt−n = wt−n ) , γ#(Yt−1 = wt−1 , .
365c353
< , wt−n )
---
> , Yt−n = wt−n )
369,385c357,366
< Where the # represents the count of a particular response tokens sequence appears in C. The n is the window size.
< The γ is a hyperparameter, γ ∈ Z+ . The γ is used to adjust the dominant probability contribution of the ground-
< truth token wt′ .
< For instance, when γ = 2, term γ−1 γ ensures that the ′ probability Pr(
< wt′ |wt−1 , .
< .
< .
< , w1′ , I) of generating ground-truth token wt′ is greater than 50%
< .
< An assumption behind the prior estimation is that language models typically generate the next token with a strong association to the most recent preceding tokens.
< Through Equation 2 and Equation 3, we obtain a scalar probability value pwt .
< We consider the value of Pw∗t as a continuous random variable denoted as Pwt ,
< Pwt ∈ [0, 1], with probability density function fWt (
< Pwt ).
< The fWt (Pwt ) can be predefined in a way that the expected value of Pwt is equal to the previously computed scalar pwt .
< Then a prior probability for approximating the latent probability Pw∗t can be obtained by calculating the expectation of Pwt (
< replace Pwt with x): 4
---
> otherwise: p̂Yt =w =
> 
> where the # represents the count of a specific output token sequence appears in C. The n is the window size.
> The γ is a hyperparameter, γ ∈ Z+ . The γ is used to adjust dominant probability contribution of the token wt .
> For instance, when γ = 2, term γ−1 γ ensures that the probability p̂Yt =
> wt is greater than 50%.
> An assumption behind the prior estimation is that T typically generates the next token with a strong association to the most recent preceding tokens.
> Through Equation 2 and 3, we obtain an initial estimate p̂Yt for the soft labels pYt .
> We refer to p̂Yt as the prior distribution.
> 4
389c370,372
< 1
---
> 3.
> 2.
> 2
391c374
< Z E(Pwt ) =
---
> P OSTERIOR E STIMATION
393c376,392
< xfWt (x)dx = pwt
---
> The prior distribution p̂Yt serves as a coarse-grained approximation for pYt .
> To further refine the prior distribution and get a better approximation for pYt ,
> we introduce posterior estimation.
> The posterior estimation is primarily achieved by introducing a proxy M of T (typically an open-
> source LLM with a larger size than S) under the Bayesian estimation framework.
> This estimation involves continuously sampling from M to refine the prior distribution.
> The M is previously fine-tuned on the corpus C generated by T for preliminary alignment with T . The motivation behind introducing M is to leverage it as a bridge between the closed-
> source teacher T and the open-source student S, serving a purpose of better estimating the soft labels pYt of T . We consider the value of pYt can be described by a discrete random variable denoted as PYt (
> the transformation to continuous case is straightforward,
> but we discuss the discrete case for better understanding.
> ).
> We define PYt with m possible discrete values p1 , p2 , . . . , pm ,
> where p1 , p2 , . . . , pm form a number sequence increasing by 1/
> m from 0 to 1 (e.g., 0.00, 0.01, 0.02, . . . , 0.99, with m = 100). According to the prior distribution p̂Yt ,
> the probability mass function (PMF) Pr(PYt = pi )
> of PYt can be predefined in a way that satisfies the following constraint:
> E(PYt ) =
395c394
< (4)
---
> m X
397c396
< 0
---
> pi Pr(PYt = pi ) = p̂Yt
399,401c398
< 3.
< 2.
< 2
---
> (4)
403c400
< P OSTERIOR E STIMATION
---
> i=1
405,427c402,410
< The posterior estimation is based on the prior estimation to estimate Pw∗t . Specifically,
< the posterior estimation involves continued sampling from the closed-
< source LM.
< An intuitive idea is that, given a sampled token ŵt and a target token wt ,
< if the sampling results in ŵt = wt , the probability of generating wt should be increased;
< on the other hand, if the sampling results in ŵt ̸= wt ,
< then the probability of generating wt should be decreased.
< A discrete random event Y is defined as follows: In a sampling round of the closed-
< source LM, given input sequence (wt−1 , .
< .
< .
< , w1 , I) and a target token wt , if the sampled token ŵt = wt ,
< then Y = 1; otherwise, Y = 0. In practice, we achieve this by introducing an open-
< source language model M as a proxy of the closed-source model.
< The M is first fine-tuned on the corpus C for preliminary alignment.
< We feed the sequence (wt−1 , .
< .
< .
< , w1 , I) into M to sample a generated token ŵt at time t. In a sampling round,
< we update the prior probability dense function fWt (
< Pwt ) based on the event Y . If Y = 1 occurs, according to Bayes’ theorem:
< fWt |Y (Pwt |Y = 1) ∝ Pr(Y = 1|Pwt )fWt (Pwt ) = Pwt fWt (
< Pwt )
---
> Equation 4 implies that the PMF can vary, as long as the expectation E(PYt )
> equals p̂Yt .
> In practice, m should be sufficiently large (e.g., m = 100). Calibrating the prior distribution involves updating the PMF through sampling from M. We feed X and Y<t into M, a token ŵ ∈ V is sampled at time t.
> Given ŵ, and a token w ∈ V, event A is defined as follows:
> if w = ŵ, A = 1; otherwise, A = 0. In a sampling round,
> we update the PMF Pr(PYt = pi ) based on the event A. If event A = 1 occurs,
> according to Bayes’ theorem: Pr(PYt =w = pi |A = 1) ∝ Pr(
> A = 1|PYt =w = pi ) Pr(PYt =w = pi ) = pi Pr(PYt =
> w = pi ),
431,433c414,415
< Where fWt |Y (Pwt |Y = 1) is the posterior probability dense function conditioned on event Y = 1. Then,
< we integrating over Pwt fWt (Pwt ) to get a normalization factor η:
< 1
---
> where w ∈ V, i ∈ {1, 2, . . . , m}. We get a normalization factor η by:
> η=
435c417
< Z η=
---
> m X
437c419
< xfWt (x)dx
---
> pi Pr(PYt =w = pi )
441c423
< 0
---
> i=1
443,445c425,429
< Then the value of fWt |Y (Pwt |Y = 1) can be calculated as fWt |
< Y (Pwt |Y = 1) = In a sampling round, if event Y = 0 occurs instead,
< according to Bayes’ theorem:
---
> Then the value of Pr(PYt =w = pi |A = 1) can be calculated as η1 pi Pr(
> PYt =w = pi ).
> If event A = 0 occurs instead, according to Bayes’ theorem:
> Pr(PYt =w = pi |A = 0) ∝ Pr(A = 0|PYt =w = pi ) Pr(
> PYt =w = pi ) = (1 − pi ) Pr(PYt =w = pi ),
447c431
< 1 η Pwt fWt (Pwt ).
---
> (7)
449,450c433,434
< fWt |Y (Pwt |Y = 0) ∝ Pr(Y = 0|Pwt )fWt (Pwt ) = (
< 1 − Pwt )fWt (Pwt )
---
> where w ∈ V, i ∈ {1, 2, . . . , m}. We get a normalization factor η by:
> η=
452c436
< (7)
---
> m X
454,456c438
< Where fWt |Y (Pwt |Y = 0) is the posterior probability dense function conditioned on event Y = 0. Similarly,
< we integrating over (1 − Pwt )fWt (Pwt ) to get the normalization factor η:
< Z
---
> (1 − pi ) Pr(PYt =w = pi )
458c440
< 1
---
> (8)
460c442
< (1 − x)fWt (x)dx
---
> i=1
462c444,450
< η=
---
> Then the value of Pr(PYt =w = pi |A = 0) can be calculated as η1 (
> 1 − pi ) Pr(PYt =w = pi ).
> At this point, one sampling iteration concludes.
> The prior Pr(PYt = pi ) will be replaced by the posterior Pr(
> PYt = pi |A = 1) or Pr(PYt = pi |A = 0) in the next iteration.
> After multiple rounds of sampling from M, we denote the final PMF as Pr(
> PYt = pi |M). The pYt can be approximated by calculating the conditional expectation as follow:
464c452
< (8)
---
> E(PYt |M) =
466c454
< 0
---
> m X
468,477c456
< Then the value of fWt |Y (Pwt |Y = 0) can be calculated as fWt |
< Y (Pwt |Y = 0) = η1 (1 − Pwt )fWt (Pwt ).
< The sampling process for M typically involves multiple iterations,
< where posterior probability density function fWt |
< Y (Pwt |Y ) of each round will update the prior probability density function fWt (
< Pwt ) for the next round.
< And we define fWt (Pwt ) in the first round as the probability density function obtained through prior estimation.
< We denote the final posterior probability dense function as fWt |
< M (Pwt |M). Then a posterior probability for approximating the latent probability Pw∗t can be obtained by calculating the conditional expectation:
< Z E(Pwt |M) =
---
> pi Pr(PYt = pi |M)
479c458
< 1
---
> i=1
481c460,461
< xfWt |M (x|M)dx
---
> We refer E(PYt |M) to as the posterior distribution.
> 5
485c465
< 0
---
> Under review as a conference paper at ICLR 2024
493,512c473,484
< Let 1wt be the onehot label,Pthe first objective at time step t can be derived by calculating the cross entropy as Lce t =
< − wt ∈V 1wt log Qwt .
< The second objective at time step t can be derived based on the prior 5
< 
< Under review as a conference paper at ICLR 2024
< 
< estimation as Lkl t =
< 
< P
< 
< wt ∈V
< 
< E(Pwt ) log
< 
< E(Pwt ) Qwt .
< 
< E(Pwt |M) E(Pw′ |M) , t estimation as Lkl t|M =
< 
< We first normalize E(Pwt |M) =
< 
---
> Let 1Yt be the one-hot encoded label provided by T , the P first objective at time step t can be derived by calculating the cross-
> entropy loss as Lce = − t w∈V 1Yt =w log qYt =w . The second objective at time step t can P p̂Yt =
> w be derived based on the prior distribution as Lkl t =
> w∈V p̂Yt =w log qY =w . We first normalize t
> 
> E(PYt |M) =
> 
> E(PYt |M) , w∈V E(PYt =w |M)
> 
> then the third objective at time step t can be derived based on P E(
> PYt =w |M) . Given an output tokens the posterior distribution as Lkl w∈V E(PYt =
> w |M) log t|M = qYt =w sequence with length T , the overall objective function can be derived as follows:
515c487
< ′ ∈V wt
---
> L=
517,520c489
< then the third objective at time step t can be derived based on the posterior P E(
< Pwt |M) . Given a sequence with length T , the overall objective function wt ∈V E(Pwt |
< M) log Qwt can be derived as follows: T 1 X ce kl L=
< (L + αLkl t + βLt|M ) T t=1 t
---
> T 1 X ce kl (Lt + αLkl t + βLt|M ) T t=1
524c493
< Where the α and β are hyperparameters used to adjust the contributions of the Lt and Lt|
---
> kl Where the α and β are hyperparameters used to adjust the contributions of the Lkl t and Lt|
526,528c495,496
< When α = 0 and β > 0, the student model does not learn from the prior distribution.
< And the student model does not learn from the posterior distribution when α >
< 0 and β = 0.
---
> When α > 0 and β = 0, L becomes the loss for prior distillation.
> When α = 0 and β > 0, L becomes the loss for posterior distillation.
534,536c502
< In this section, we setup a series of experiments to test the distilled models’ capabilities on various benchmarks.
< These benchmarks assess the model across wide range of capabilities including reading comprehension,
< commonsense knowledge, mathematical skills and logical reasoning.
---
> In this section, we conduct a series of experiments to validate the effectiveness of our method.
542c508
< We utilize the OpenOrca(Mukherjee et al.
---
> We mainly utilize the OpenOrca (Mukherjee et al.
544,558c510,527
< The OpenOrca dataset is a collection of FLAN(Longpre et al.
< , 2023) data augmented by closed-source LLMs like GPT-
< 4 and GPT-3.
< 5.
< Following the settings in OpenOrca-Preview1-13B1 of paper Mukherjee et al.
< (2023), and consider time efficiency, we conduct training on a subset of the original corpus containing 200k instances.
< We also utilize the Alpaca(Taori et al.
< , 2023) dataset as an additional experimental configuration.
< We utilize benchmarks including BBH(Suzgun et al.
< , 2022), AGIEval(Zhong et al.
< , 2023), ARC(Challenge)(Clark et al.
< , 2018), MMLU(Hendrycks et al.
< , 2021), CSQA(Talmor et al.
< , 2019) and GSM8K(Cobbe et al.
< , 2021) for evaluation.
---
> The OpenOrca dataset was created by prompting closed-
> source LLMs, such as GPT-4, with diverse inputs and collecting the corresponding output sequences.
> We follow the settings in OpenOrcaPreview1-13B1 of paper Mukherjee et al.
> (2023).
> We also utilize the Alpaca (Taori et al.
> , 2023) dataset as the training corpus.
> The Alpaca dataset was generated by providing diverse inputs to the closed-
> source LLM text-davinci-003 prompt and collecting the corresponding output sequences.
> For evaluation, we utilize benchmarks including complex reasoning datasets BBH (
> Suzgun et al.
> , 2022) and ARC Clark et al.
> (2018), knowledge-based datasets AGIEval (Zhong et al.
> , 2023) and MMLU (Hendrycks et al.
> , 2021), commonsense reasoning dataset CSQA (Talmor et al.
> , 2019), and mathematical reasoning dataset GSM8K (Cobbe et al.
> , 2021).
> These benchmarks assess the model across wide range of capabilities including reading comprehension,
> commonsense knowledge, mathematical skills and logical reasoning.
560c529
< (2023), we focus on datasets that involve multiple-
---
> (2023), aside from GSM8K, we focus on tasks that involve multiple-
562,565d530
< For all datasets, we conduct evaluation under zero-
< shot setting without any exemplars and without any CoT(
< Wei et al.
< , 2022).
578,579d542
< To accelerate training, we leverage LoRA (Hu et al.
< , 2021).
585,586c548,552
< For baseline models, to ensure a fair comparison,
< we only consider models that have access to their original fine-
---
> IFT involves fine-tuning the student model on the samples generated by the teacher model without using soft labels.
> We implement the baseline models of our own version ourselves.
> We implement our own version of baseline models.
> To ensure a fair comparison with other baseline models,
> we exclusively include models that have access to their original fine-
588,599c554
< Therefore we select OpenOrca-Perview1-13B from Mukherjee et al.
< (2023) and Alpaca (Taori et al.
< , 2023) as our baseline models.
< In addition, we also train our own version of baseline models.
< 
< 5
< 
< R ESULT AND A NALYSIS
< 
< In this section, we present the main results, ablation studies and additional experiments.
< All corpus for proxy model fine-tuning, prior estimation,
< posterior estimation, and student distillation are iden1 2
---
> As a result, our chosen baseline models are 1 2
685c640
< tuning on the one-hot labels.
---
> tuning on the hard labels.
759,761c714,725
< tical.
< Unless otherwise specified, ”IFT” represents the baseline model that we have implemented ourselves,
< and the default training corpus we utilize is OpenOrca.
---
> OpenOrca-Perview1-13B from Mukherjee et al.
> (2023) and Alpaca (Taori et al.
> , 2023), which have been fine-tuned on the samples generated by the teacher model.
> 
> 5
> 
> R ESULT AND A NALYSIS
> 
> In this section, we present the main results, ablation studies and additional experiments.
> All corpus for proxy model fine-tuning, prior estimation,
> posterior estimation, and student distillation are identical.
> Unless otherwise specified, the default training corpus we utilize is OpenOrca.
767,768c731,732
< Table 2 shows the performance comparison of our method against baseline models on the six benchmarks.
< Detailed experimental results can be found in Appendix C. The training corpus we utilized in this table is the OpenOrca dataset.
---
> Table 2 shows the performance comparison of our method against baseline models.
> Detailed experimental results can be found in Appendix C. The training corpus we utilized in Table 2 is the OpenOrca dataset.
772c736
< The training corpus we utilized in this table is the Alpaca dataset.
---
> The training corpus we utilized in Table 3 is the Alpaca dataset.
774,786c738,740
< A case study in Table 4 demonstrates that our model exhibits better comprehension and answer generation capabilities in terms of reasoning ability compared to the baseline IFT model.
< The experimental results demonstrate that in the context of KD for closedsource LM,
< distilling knowledge using the estimated soft labels through our method yields superior results compared to directly using one-
< hot labels.
< 5.
< 2
< 
< A BLATION S TUDY
< 
< This ablation study examines the impact of components within our method.
< While retaining the standard cross-entropy loss Lce t ,
< we evaluate the effect of using only the prior estimation (
< α > 0, Distilling+Posterior MMLU
---
> A case study in Table 4 demonstrates that our model exhibits better comprehension and answer generation capabilities in terms of reasoning ability compared to the baseline IFT.
> The experimental results not only demonstrate the effectiveness of our approach for both 7B and 13B student model scales but also validate the effectiveness of using estimated soft labels.
> Distilling+Posterior MMLU
825c779
< In ”Distilling+Prior” we adjust α = 0.
---
> In Distilling+Prior we adjust α = 0.
827,829c781,782
< In ”Distilling+Posterior” we adjust α = 0, β = 1,
< to investigate the effect of the posterior estimation.
< In ”Distilling+Prior+Posterior” we adjust α = 0.
---
> In Distilling+Posterior we adjust α = 0, β = 1, to investigate the effect of the posterior estimation.
> In Distilling+Prior+Posterior we adjust α = 0.
833,835c786
< Under review as a conference paper at ICLR 2024
< 
< Instruction
---
> Under review as a conference paper at ICLR 2024 Instruction
869,870d819
< 50
< 
872a822,823
> 50
> 
933,939c884,894
< Figure 4: The comparison of knowledge distillation performance using the posterior distribution under different sampling round settings,
< as well as the comparison with IFT, with the model utilizing LLaMA-
< 7B.
< 
< β = 0), and using only posterior estimation (α = 0,
< β > 0), and using both (α > 0, β > 0).
< We select five representative benchmarks.
---
> Figure 4: Comparing the performance of knowledge distillation utilizing the posterior distribution under various sampling round configurations with IFT,
> employing the model with LLaMA-7B.
> 
> 5.
> 2
> 
> A BLATION S TUDY
> 
> This ablation study examines the impact of components within our method.
> While retaining the standard cross-entropy loss, we evaluate the effect of the prior estimation,
> and the posterior estimation.
941,957c896,915
< Effect of the prior estimation Compared to IFT, distilling on the prior distribution (
< Distilling+Prior) can enhance the model performance.
< The results indicate that, in addition to guiding the student towards learning from ground-
< truth token, informing the student model about other valid tokens benefits the distillation.
< The consistent improvement over IFT suggests that the prior estimation can capture these valid tokens that represent the capabilities of the teacher model.
< Effect of the posterior estimation Compared to IFT,
< distilling on the posterior distribution (Distilling+
< Posterior) significantly boosts the performance.
< The improvement over ”Distilling+Prior” indicates that,
< the sampling results from proxy model further refines the prior distribution.
< The posterior distribution can provide more comprehensive information that is beneficial for distillation.
< Combined effect of both As shown in Figure 3, we incorporate the prior distribution and the posterior distribution into the distillation process (
< Distilling+Prior+Posterior).
< We observe that the effect is similar to ”Distilling+
< Posterior”, with limited improvements seen on only a subset of the benchmarks.
< We analyze the reason for this phenomenon is the posterior distribution already contains the information from the prior distribution,
< the improvement gained from incorporating the prior distribution is limited.
---
> Effect of the prior estimation Retaining the cross-
> entropy loss, we incorporate the KL loss involving the prior distribution for training.
> This training method is denoted as Distilling+Prior.
> As shown in Figure 3, Distilling+Prior consistently outperforms IFT on all benchmarks,
> demonstrating the advantages of the coarse-grained knowledge obtained through the prior estimation.
> Effect of the posterior estimation Retaining the cross-
> entropy loss, we incorporate the KL loss involving the posterior distribution for training.
> This training method is denoted as Distilling+Posterior.
> As shown in Figure 3, compared to IFT as well as Distilling+
> Prior, Distilling+Posterior further boosts the performance.
> The improvement in performance comes from the posterior distribution capturing more fine-
> grained knowledge of the closed-source teacher model.
> Combined effect of both We consider whether combining the KL loss of the prior distribution and the posterior distribution explicitly can improve the performance.
> Retaining the cross-entropy loss, we directly add the KL loss involving prior distribution and the KL loss involving posterior distribution into the total loss.
> This training method is denoted as Distilling+Prior+
> Posterior.
> As shown in Figure 3, we observe that the performance gain is marginal compared to Distilling+
> Posterior, with limited improvements seen on only a subset of the benchmarks.
> The reason for this is that the posterior distribution has already effectively integrated the knowledge from the prior distribution,
> and the improvement brought by explicitly combining the KL loss terms is limited.
987c945
< Distilling+Prior
---
> 27 26 25 24 23 22
989c947
< AGIEval Accuracy (%)
---
> Accuracy (%)
997c955,957
< 27 26 25 24 23 22
---
> Distilling+Prior
> 
> AGIEval
1016,1019c976,977
< IFT, distilling on prior distribution (Distilling+
< Prior), and distilling on both prior and posterior distributions (
< Distilling+Prior+Posterior), with the student model utilizing LLaMA-
< 7B. 8
---
> IFT, Distilling+Prior, and Distilling+Prior+Posterior,
> with the student model utilizing LLaMA-7B. 8
1022a981,1060
> Models
> 
> BBH
> 
> AGIEval
> 
> MMLU
> 
> GSM8K
> 
> Average
> 
> GPT-4 (teacher) LLaMA-33B (proxy) LLaMA-13B (proxy)
> 
> 67.
> 4 51.
> 4 42.
> 8
> 
> 56.
> 4 33.
> 5 26.
> 7
> 
> 86.
> 4 55.
> 7 45.
> 3
> 
> 92.
> 0 42.
> 2 20.
> 9
> 
> 75.
> 5 45.
> 7 33.
> 93
> 
> Table 5: The performance of closed-source teacher model and aligned proxy models.
> Student Models
> 
> Proxy Models
> 
> BBH
> 
> AGIEval
> 
> MMLU
> 
> GSM8K
> 
> Average
> 
> LLaMA-7B LLaMA-7B
> 
> LLaMA-33B LLaMA-13B
> 
> 38.
> 52 37.
> 41
> 
> 26.
> 92 25.
> 67
> 
> 41.
> 18 39.
> 56
> 
> 14.
> 97 13.
> 83
> 
> 30.
> 4 29.
> 12
> 
> Table 6: Performance of student model with different proxy models.
> 
1028c1066
< In this section, we discuss the impact of the number of sampling rounds on posterior estimation.
---
> In this section, we discuss the impact of the number of sampling rounds on the posterior estimation.
1032,1053c1070,1072
< Furthermore, excessive sampling, such as 50 times,
< leads to a decline in performance.
< We analyze this phenomenon can be attributed to the distribution discrepancy and prior distribution vanishing.
< Distribution discrepancy We observe there exist discrepancies between the ground-
< truth one-hot labels provided by the closed-source LM and the output distribution of the proxy model.
< Although the proxy model has been aligned by fine-
< tuning on corpus C generated by the closed-source LM,
< the token with the highest probability given by the proxy model at some positions is different from the ground-
< truth token (For example, when the ground-truth label at the current position is ”\
< n”, the proxy model assigns a high probability (e.g., 0.99) to ”<
< \s>”, while the probability of ”\n” becomes close to 0)
< , as elaborated in Appendix B.2. In this case, the inconsistency in distributions may negatively impact the performance of the distillation.
< Prior Distribution Vanishing In Bayesian estimation,
< there exists a phenomenon where the prior distribution vanishing as the posterior estimation undergoes excessive iterations.
< In other words, the impact of the prior distribution weakens with each successive iteration.
< We analyze that in Figure 4, excessive sampling (e.g., 50 times)
< leads to the degeneration of the posterior distribution into the proxy model’s output distribution,
< resulting in negative impact on the performance of knowledge distillation.
< Therefore, it is important to control the number of samples within a reasonable range.
< Based on our experimental results, we find that choosing a sampling count between 10 and 20 works fine.
< 5.
< 4
---
> And we find that excessive sampling (e.g., 50 times)
> results in negative impact on the performance of knowledge distillation.
> More discussions can be found in Appendix B.2. 5.4
1060,1061c1079,1080
< the method ”Distilling+Prior+Posterior” consistently outperforms the performance of IFT across benchmarks.
< A similar trend can also be observed in the method ”Distilling+
---
> the method “Distilling+Prior+Posterior” consistently outperforms the performance of IFT across benchmarks.
> A similar trend can also be observed in the method “Distilling+
1065a1085,1100
> 5.
> 5
> 
> P ROXY M ODEL S ELECTION
> 
> Proxy model serves as a bridge between the closed-
> source teacher model and the open-source student model.
> And it is first fine-tuned on the corpus generated by the closed-
> source teacher for preliminary alignment.
> We believe that opting for a larger and more capable proxy model is advantageous,
> as it enhances the model’s ability to capture the capabilities of the closed-
> source teacher.
> Table 5 presents the performance of the proxy models compared to the closed-
> source teacher.
> And the student’s performance with different proxy models is shown in Table 6.
> The results validate the advantage of choosing more powerful proxy model.
1074,1076c1109,1112
< source language models, enabling effective knowledge distillation.
< Our approach comprises two main components: prior estimation and posterior estimation.
< The prior estimation involves obtaining a prior distribution by leveraging the corpus generated by the closed-
---
> source language models, achieving superior distillation performance.
> Our method comprises two main components: prior estimation and posterior estimation.
> The prior estimation involves obtaining a coarse-
> grained prior distribution by leveraging the corpus generated by the closed-
1078,1079c1114,1116
< The posterior estimation updates prior distribution based on continued sampling results from a proxy model.
< Extensive experiments are conducted based on LLaMA.
---
> The posterior estimation updates prior distribution based on continued sampling results from a proxy model to obtain a fine-
> grained posterior distribution.
> Extensive experiments are conducted.
1081c1118
< tuning on one-hot labels, when it comes to knowledge distillation of closed-
---
> tuning on hard labels, when it comes to knowledge distillation of closed-
1082a1120
> 9
1084,1090c1122,1159
< R EFERENCES Peter Clark, Isaac Cowhey, Oren Etzioni,
< Tushar Khot, Ashish Sabharwal, Carissa Schoenick,
< and Oyvind Tafjord.
< Think you have solved question answering?
< try arc, the ai2 reasoning challenge, 2018.
< Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen,
< Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek,
---
> Under review as a conference paper at ICLR 2024
> 
> R EFERENCES Rishabh Agarwal, Nino Vieillard, Yongchao Zhou,
> Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem.
> Generalized knowledge distillation for auto-regressive language models,
> 2023.
> Zhangir Azerbayev, Ansong Ni, Hailey Schoelkopf, and Dragomir Radev.
> Explicit knowledge transfer for weakly-supervised code generation,
> 2023.
> Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah,
> Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
> Pranav Shyam, Girish Sastry, Amanda Askell, et al.
> Language models are few-shot learners.
> Advances in Neural Information Processing Systems,
> 33:1877–1901, 2020.
> Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto,
> Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph,
> Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger,
> Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin,
> Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov,
> Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter,
> Philippe Tillet, Felipe Petroski Such, Dave Cummings,
> Matthias Plappert, Fotios Chantzis, Elizabeth Barnes,
> Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol,
> Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin,
> Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse,
> Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra,
> Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage,
> Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew,
> Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba.
> Evaluating large language models trained on code,
> 2021.
> Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,
> Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord.
> Think you have solved question answering?
> try arc, the ai2 reasoning challenge, 2018.
> Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen,
> Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek,
1095,1098d1163
< 9
< 
< Under review as a conference paper at ICLR 2024
< 
1108c1173,1177
< org/N19-1423. Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang.
---
> org/N19-1423. Zhibin Gou, Zhihong Shao, Yeyun Gong,
> Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen.
> Tora: A tool-integrated reasoning agent for mathematical problem solving,
> 2023.
> Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang.
1110,1118d1178
< Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
< Deep residual learning for image recognition.
< In 2016 IEEE Conference on Computer Vision and Pattern Recognition (
< CVPR), pp.
< 770–778, 2016.
< doi: 10.
< 1109/CVPR.
< 2016.
< 90.
1142a1203,1206
> 10
> 
> Under review as a conference paper at ICLR 2024
> 
1160a1225,1243
> Zhenwen Liang, Wenhao Yu, Tanmay Rajpurohit, Peter Clark,
> Xiangliang Zhang, and Ashwin Kaylan.
> Let gpt be a math tutor: Teaching math word problem solvers with customized exercise generation,
> 2023.
> Alexander Lin, Jeremy Wohlwend, Howard Chen, and Tao Lei.
> Autoregressive knowledge distillation through imitation learning.
> In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (
> eds.
> ), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (
> EMNLP), pp.
> 6121–6133, Online, November 2020.
> Association for Computational Linguistics.
> doi: 10.
> 18653/v1/2020.emnlp-main.
> 494.
> URL https://aclanthology.
> org/2020.
> emnlp-main.
> 494.
1174,1177d1256
< Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho.
< Relational knowledge distillation.
< In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (
< CVPR), June 2019.
1188,1191d1266
< 10
< 
< Under review as a conference paper at ICLR 2024
< 
1218,1232d1292
< Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma,
< brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou.
< Chain-of-thought prompting elicits reasoning in large language models.
< In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
< K. Cho, and A. Oh (eds.
< ), Advances in Neural Information Processing Systems,
< volume 35, pp.
< 24824–24837.
< Curran Associates, Inc.
< , 2022.
< URL https://proceedings.
< neurips.
< cc/paper_files/paper/2022/ file/9d5609613524ecf4f15af0f7b31abca4-
< Paper-Conference.
< pdf.
1250c1310,1394
< #GPUs 8 8 4
---
> #GPUs
> 
> Precision
> 
> Dimension
> 
> #Heads
> 
> #Layers
> 
> 8 8 4
> 
> float16 float16 float16
> 
> 6656 5120 4096
> 
> 52 40 32
> 
> 60 40 32
> 
> Table 7: Model configurations.
> 
> A
> 
> E XPERIMENTAL C ONFIGURATIONS
> 
> A.
> 1
> 
> T RAINING C ONFIGURATIONS
> 
> The model configurations are provided in Table 7.
> We train the student models for three epochs, experimenting with learning rates of 1e-5, 3e-5, and 5e-5 during training.
> In the knowledge distillation process, we use the following hyperparameters:
> For the total loss, α = 0.
> 5 and β = 1.
> For prior estimation, we set γ = 3 and n = 5. For posterior estimation,
> we conduct 10 rounds of sampling.
> We evaluate the models on the benchmarks using the final checkpoint.
> For time efficiency and memory saving, we employ LoRA (
> Hu et al.
> , 2021) for more efficient training.
> A.2
> 
> T RAINING C OST
> 
> We conducted all our model training on NVIDIA V100 GPUs equipped with 32GB memory.
> The table 8 presents the GPU and time costs per epoch for various models trained on the OpenOrca dataset.
> For all student models, we train on the dataset for 3 epochs.
> Models LLaMA-7B LLaMA-13B LLaMA-33B
> 
> #GPUs 4 8 8
> 
> Hours/Epoch 17.
> 0 15.
> 5 40.
> 0
> 
> Table 8: The GPU and time costs for various models trained on the 200K OpenOrca dataset.
> A.3
> 
> DATA U SAGE PER S TAGE
> 
> Table 9, summarizes the training data used for each model at every stage.
> Specifically, Orca200K denotes the OpenOrca corpus (
> Mukherjee et al.
> , 2023) with 200K samples, while Alpaca52K represents the Alpaca corpus (
> Taori et al.
> , 2023) with 52K samples.
> 
> LLaMA-7B (IFT)
> 
> Prior Estimation Stage -
> 
> Posterior Estimation Stage -
> 
> Training Stage Orca200K
> 
> LLaMA-7B (ours)
> 
> Orca200K
> 
> Orca200K
> 
> Orca200K
1252c1396
< Precision float16 float16 float16
---
> OpenOrca-Preview1-13B
1254c1398
< Dimension 6656 5120 4096
---
> -
1256c1400
< #Heads 52 40 32
---
> -
1258c1402,1470
< #Layers 60 40 32
---
> Orca200K
> 
> LLaMA-13B (IFT)
> 
> -
> 
> -
> 
> Orca200K
> 
> LLaMA-13B (ours)
> 
> Orca200K
> 
> Orca200K
> 
> Orca200K
> 
> LlaMA-33B (Proxy)
> 
> -
> 
> -
> 
> Orca200K
> 
> Alpaca-7B
> 
> -
> 
> -
> 
> Alpaca52K
> 
> Alpaca52K
> 
> Alpaca52K
> 
> Alpaca52K
> 
> -
> 
> -
> 
> Alpaca52K
> 
> LLaMA-13B (ours)
> 
> Alpaca52K
> 
> Alpaca52K
> 
> Alpaca52K
> 
> LlaMA-33B (Proxy)
> 
> -
> 
> -
> 
> Alpaca52K
> 
> Models
> 
> LLaMA-7B (ours) Alpaca-13B
> 
> Table 9: Summary of training data for each model at each stage.
> 
> 12
1276,1278c1488
< Table 5: Model configurations.
< 
< 5353
---
> Under review as a conference paper at ICLR 2024
1300a1511,1512
> 5353
> 
1350a1563
> Hard Label Proxy Distribution Posterior Distribution
1352,1363c1565,1567
< A
< 
< E XPERIMENTAL D ETAILS
< 
< The model configurations are provided in Table 5.
< We train the student models for three epochs, experimenting with learning rates of 1e-5, 3e-5, and 5e-5 during training.
< In the knowledge distillation process, we use the following hyperparameters:
< For the total loss, α = 0.
< 5 and β = 1.
< For prior estimation, we set γ = 3 and n = 5. For posterior estimation,
< we conduct 10 rounds of sampling.
< We evaluate the models on the benchmarks using the final checkpoint.
---
> Figure 7: Discrepancies between the the ground-truth distribution and the output distribution of proxy model (
> proxy distribution) in terms of the top-4 token, while the posterior distribution can stay consistent with the ground-
> truth distribution.
1391,1392c1595,1596
< we observe discrepancies between the proxy distribution and ground-
< truth labels (For example, when the ground-truth label at the current position is ”\
---
> we observe discrepancies between the proxy distribution and labels generated by teacher (
> For example, when the label generated by teacher at the current position is “\
1394c1598
< e.g., 0.99) to ”<\s>”, while the probability of ”\
---
> e.g., 0.99) to “<\s>”, while the probability of “\
1398,1407c1602
< 12
< 
< Under review as a conference paper at ICLR 2024 One-
< Hot Label Proxy Distribution Posterior Distribution
< 
< Figure 7: Discrepancies between the the ground-truth distribution and the output distribution of proxy model (
< proxy distribution) in terms of the top-4 token, while the posterior distribution can stay consistent with the ground-
< truth distribution.
< 
< Accuracy (%)
---
> 13
1409c1604
< 38 37 36 1
---
> Under review as a conference paper at ICLR 2024
1411c1606,1610
< 2 3 Training Epoch
---
> Tasks Boolean Expressions Causal Judgement Date Understanding Disambiguation QA Formal Fallacies Geometric Shapes Hyperbaton Logical Deduction (
> 5 objects) Logical Deduction (3 objects) Logical Deduction (
> 7 objects) Movie Recommendation Navigate Penguins in a Table Reasoning about Colored Objects Ruin Names Salient Translation Error Detection Snarks Sports Understanding Temporal Sequences Tracking Shuffled Objects (
> 5 objects) Tracking Shuffled Objects (7 objects) Tracking Shuffled Objects (
> 3 objects) Average
1413c1612,1635
< 4
---
> LLaMA-13B (IFT) 58.
> 8 61.
> 27 50.
> 0 56.
> 8 56.
> 4 25.
> 2 63.
> 6 33.
> 8 23.
> 39 44.
> 2 77.
> 59 51.
> 6 32.
> 61 39.
> 6 36.
> 4 31.
> 6 48.
> 31 60.
> 8 17.
> 28 19.
> 46 14.
> 63 37.
> 5 42.
> 77
1415c1637,1660
< CSQA
---
> LLaMA-13B (ours) 62.
> 4 63.
> 01 54.
> 02 60.
> 0 54.
> 4 23.
> 6 66.
> 8 36.
> 14 30.
> 12 51.
> 6 79.
> 32 56.
> 8 36.
> 11 42.
> 8 33.
> 8 37.
> 2 52.
> 25 60.
> 4 11.
> 2 21.
> 1 17.
> 17 36.
> 02 44.
> 83
1417c1662,1685
< 63
---
> LLaMA-7B (IFT) 65.
> 06 56.
> 98 49.
> 3 49.
> 4 54.
> 0 12.
> 42 49.
> 2 26.
> 51 18.
> 7 42.
> 17 50.
> 78 45.
> 6 30.
> 58 27.
> 54 15.
> 2 24.
> 0 43.
> 82 56.
> 0 13.
> 49 17.
> 2 11.
> 98 33.
> 9 36.
> 08
1419c1687,1710
< 41
---
> LLaMA-7B (ours) 66.
> 4 61.
> 85 49.
> 26 54.
> 8 54.
> 0 22.
> 4 54.
> 8 30.
> 96 18.
> 11 42.
> 8 53.
> 42 55.
> 2 34.
> 91 30.
> 33 14.
> 8 28.
> 4 45.
> 7 55.
> 6 9.
> 68 17.
> 74 14.
> 8 32.
> 52 38.
> 52
1421c1712,1713
< Accuracy (%)
---
> Table 10: Zero-shot performance comparison in Big-
> Bench Hard benchmark on multiple-choice questions.
1423c1715
< 39 Accuracy (%)
---
> C
1425c1717
< Distilling+Posterior MMLU
---
> E XPERIMENTAL R ESULTS
1427c1719,1720
< BBH
---
> C.
> 1
1429c1722
< 40 39
---
> D ETAILED R ESULTS
1431c1724,1729
< 62 61 60 59
---
> Following the settings in OpenOrca-Preview1-13B3 of paper Mukherjee et al.
> (2023), and considering time efficiency, we conduct training on a subset of the original corpus containing 200k instances.
> The detailed experimental results for the LLaMA model on BBH,
> AGIEval, and MMLU benchmarks are presented in Table 10,
> Table 11 and Table 12.
> C.2
1433c1731
< 1
---
> R ESULTS OF F LAN T5
1435c1733,1740
< 2 3 Training Epoch
---
> We also conducted experiments on the FlanT5 (Longpre et al.
> , 2023) model using the OpenOrca dataset, and the results are shown in the Table 13.
> We find that, compared to the IFT method, our approach does lead to some improvement,
> although the improvement is limited.
> We speculate that this might be because FlanT5 is a model that has been fine-
> tuned with instructions, and its original model already had some basic capabilities for these tasks.
> Therefore, the additional training results in limited improvement.
> C.3
1437c1742
< 4
---
> C ONTINUOUS T RAINING OF P ROXY M ODEL
1439,1463d1743
< 1
< 
< 2 3 Training Epoch
< 
< 4
< 
< Figure 8: The change in performance of distilling on the posterior distribution (
< Distilling+Posterior) with the fine-tuning epochs of the proxy model.
< We utilize LLaMA-7B as the student model, and LLaMA-
< 33B as the proxy model.
< 
< C
< 
< E XPERIMENTAL R ESULTS
< 
< The detailed experimental results for the LLaMA model on BBH,
< AGIEval, and MMLU benchmarks are presented in Table 8,
< Table 7 and Table 9.
< We also conducted experiments on the FlanT5(Longpre et al.
< , 2023) model using the OpenOrca dataset, and the results are shown in the Table 6.
< We find that, compared to the IFT method, our approach does lead to some improvement,
< although the improvement is limited.
< We speculate that this might be because FlanT5 is a model that has been fine-
< tuned with instructions, and its original model already had some basic capabilities for these tasks.
< Therefore, the additional training results in limited improvement.
1470a1751
> C.4
1472,1477c1753
< 13
< 
< Under review as a conference paper at ICLR 2024
< 
< Models GPT-4 FlanT5-large (IFT) FlanT5-large (ours)
< FlanT5-xl (IFT) FlanT5-xl (ours)
---
> O RDER OF N
1479c1755,1758
< #Params
---
> We investigate the impact of the order of n. Intuitively,
> the order of n should be selected within a limited range.
> We conduct experiments distilling on the prior distribution with LLaMA-
> 7B under different order of n, as shown in Table 14.
1481,1527c1760
< BBH
< 
< AGIEval
< 
< ARC
< 
< MMLU
< 
< CSQA
< 
< GSM8K
< 
< Average
< 
< 780M 780M 3B 3B
< 
< 34.
< 63 35.
< 22 38.
< 47 39.
< 51
< 
< 56.
< 4 28.
< 12 28.
< 84 28.
< 34 30.
< 1
< 
< 46.
< 44 46.
< 61 59.
< 6 60.
< 12
< 
< 86.
< 4 39.
< 41 39.
< 34 46.
< 91 46.
< 78
< 
< 76.
< 78 76.
< 93 84.
< 79 85.
< 38
---
> 3
1529,1534c1762,1763
< 92.
< 0 4.
< 54 4.
< 71 6.
< 12 7.
< 1
---
> https://huggingface.
> co/Open-Orca/OpenOrca-Preview1-13B
1536,1540c1765
< 38.
< 32 38.
< 61 44.
< 04 44.
< 83
---
> 14
1542,1543c1767
< Table 6: The results of the FlanT5 models with different parameter sizes on the six benchmarks.
< We compare our method with IFT.
---
> Under review as a conference paper at ICLR 2024
1545,1546c1769
< Models LLaMA-7B (IFT) LLaMA-7B (ours) LLaMA-13B (IFT)
< LLaMA-13B (ours)
---
> Models
1609c1832,1835
< Table 7: Performance comparison in AGIEval benchmark on the selected multiple-
---
> LLaMA-7B (IFT) LLaMA-7B (ours) LLaMA-13B (IFT) LLaMA-
> 13B (ours)
> 
> Table 11: Performance comparison in AGIEval benchmark on the selected multiple-
1613,1723c1839
< Tasks Boolean Expressions Causal Judgement Date Understanding Disambiguation QA Formal Fallacies Geometric Shapes Hyperbaton Logical Deduction (
< 5 objects) Logical Deduction (3 objects) Logical Deduction (
< 7 objects) Movie Recommendation Navigate Penguins in a Table Reasoning about Colored Objects Ruin Names Salient Translation Error Detection Snarks Sports Understanding Temporal Sequences Tracking Shuffled Objects (
< 5 objects) Tracking Shuffled Objects (7 objects) Tracking Shuffled Objects (
< 3 objects) Average
< 
< LLaMA-13B (IFT) 58.
< 8 61.
< 27 50.
< 0 56.
< 8 56.
< 4 25.
< 2 63.
< 6 33.
< 8 23.
< 39 44.
< 2 77.
< 59 51.
< 6 32.
< 61 39.
< 6 36.
< 4 31.
< 6 48.
< 31 60.
< 8 17.
< 28 19.
< 46 14.
< 63 37.
< 5 42.
< 77
< 
< LLaMA-13B (ours) 62.
< 4 63.
< 01 54.
< 02 60.
< 0 54.
< 4 23.
< 6 66.
< 8 36.
< 14 30.
< 12 51.
< 6 79.
< 32 56.
< 8 36.
< 11 42.
< 8 33.
< 8 37.
< 2 52.
< 25 60.
< 4 11.
< 2 21.
< 1 17.
< 17 36.
< 02 44.
< 83
< 
< LLaMA-7B (IFT) 65.
< 06 56.
< 98 49.
< 3 49.
< 4 54.
< 0 12.
< 42 49.
< 2 26.
< 51 18.
< 7 42.
< 17 50.
< 78 45.
< 6 30.
< 58 27.
< 54 15.
< 2 24.
< 0 43.
< 82 56.
< 0 13.
< 49 17.
< 2 11.
< 98 33.
< 9 36.
< 08
< 
< LLaMA-7B (ours) 66.
< 4 61.
< 85 49.
< 26 54.
< 8 54.
< 0 22.
< 4 54.
< 8 30.
< 96 18.
< 11 42.
< 8 53.
< 42 55.
< 2 34.
< 91 30.
< 33 14.
< 8 28.
< 4 45.
< 7 55.
< 6 9.
< 68 17.
< 74 14.
< 8 32.
< 52 38.
< 52
< 
< Table 8: Zero-shot performance comparison in Big-Bench Hard benchmark on multiple-
< choice questions.
< 
< Models LLaMA-7B (IFT) LLaMA-7B (ours) LLaMA-13B (IFT)
< LLaMA-13B (ours)
---
> Models
1769c1885,1886
< Table 9: Performance comparison on the Massive Multitask Language Understanding benchmark.
---
> LLaMA-7B (IFT) LLaMA-7B (ours) LLaMA-13B (IFT) LLaMA-
> 13B (ours)
1771c1888,2035
< 14
---
> Table 12: Performance comparison on the Massive Multitask Language Understanding benchmark.
> 
> Accuracy (%)
> 
> 38 37 36 1
> 
> 2 3 Training Epoch
> 
> 4
> 
> CSQA
> 
> 63
> 
> 41
> 
> Accuracy (%)
> 
> 39 Accuracy (%)
> 
> Distilling+Posterior MMLU
> 
> BBH
> 
> 40 39
> 
> 62 61 60 59
> 
> 1
> 
> 2 3 Training Epoch
> 
> 4
> 
> 1
> 
> 2 3 Training Epoch
> 
> 4
> 
> Figure 8: The change in performance of distilling on the posterior distribution (
> Distilling+Posterior) with the fine-tuning epochs of the proxy model.
> We utilize LLaMA-7B as the student model, and LLaMA-
> 33B as the proxy model.
> 
> Models GPT-4 FlanT5-large (IFT) FlanT5-large (ours)
> FlanT5-xl (IFT) FlanT5-xl (ours)
> 
> #Params
> 
> BBH
> 
> AGIEval
> 
> ARC
> 
> MMLU
> 
> CSQA
> 
> GSM8K
> 
> Average
> 
> 780M 780M 3B 3B
> 
> 34.
> 63 35.
> 22 38.
> 47 39.
> 51
> 
> 56.
> 4 28.
> 12 28.
> 84 28.
> 34 30.
> 1
> 
> 46.
> 44 46.
> 61 59.
> 6 60.
> 12
> 
> 86.
> 4 39.
> 41 39.
> 34 46.
> 91 46.
> 78
> 
> 76.
> 78 76.
> 93 84.
> 79 85.
> 38
> 
> 92.
> 0 4.
> 54 4.
> 71 6.
> 12 7.
> 1
> 
> 38.
> 32 38.
> 61 44.
> 04 44.
> 83
> 
> Table 13: The results of the FlanT5 models with different parameter sizes on the six benchmarks.
> We compare our method with IFT.
> 
> Models GPT-4 LLaMA-7B (IFT) LLaMA-7B (ours) LLaMA-
> 7B (ours) LLaMA-7B (ours) LLaMA-7B (ours)
> 
> Order of n 3 5 8 100
> 
> BBH 67.
> 4 36.
> 8 37.
> 3 37.
> 3 37.
> 3 36.
> 2
> 
> AGIEval 56.
> 4 24.
> 14 25.
> 53 25.
> 7 24.
> 84 24.
> 3
> 
> MMLU 86.
> 4 38.
> 81 40.
> 1 40.
> 0 39.
> 6 38.
> 7
> 
> GSK8K 92.0 12.65 13.1 13.2 13.0 12.7
> 
> Table 14: The results of LLaMA-7B distilled on the prior distribution with different orders of n.
> 
> 15
