Keywords: language models, NLP, prompting, distillation
TL;DR: applying context distillation as a general tool to translate language into parameter updates; used to distill instructions, explanations, examples, knowledge, and reasoning procedure.
Abstract: Language models significantly benefit from context tokens, such as prompts or scratchpads. They perform better when prompted with concrete training examples and abstract statements about the target task (instructions), and they acquire new capabilities to perform complex tasks by generating step-by-step reasoning (scratch-pad) before predicting the final answers. However, they do not internalize these performance gains, which disappear when the context tokens are gone. Consequently, we always need to pay extra computation for this gain, and it is unclear how to transfer the capabilities acquired by context tokens to other tasks, or how to leverage the context tokens when their length exceeds the context window size. Our work proposes to apply context distillation so that a language model can internalize these gains. Concretely, given an input for the target task, we let the model use all relevant context tokens to generate the output, using ``[instructions] + [task-input]'' to predict ``[scratch-pad] + [final answer]''; then we fine-tune the same model to predict the above ``[final answer]'' conditioned on the ``[task-input]'', without seeing the ``[instructions]'' or using the ``[scratch-pad]''. This incentivizes the model to behave as if the context were present, hence updating the parameters to internalize the context information. We show that context distillation can be used as a general method for learning. In particular, we demonstrate that context distillation can effectively internalize 3 types of contexts: 1) abstract task instructions and natural language explanations of why an output is correct or incorrect on Natural-Instructions-V2; 2) step-by-step reasoning on 8-digit addition questions, where we show the model can apply this newly acquired capability to downstream question answering tasks; and 3) concrete training examples on the SPIDER Text-to-SQL dataset, where context distillation outperforms directly learning with gradient descent by 7%.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
Supplementary Material: zip
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/learning-by-distilling-context/code)
14 Replies
Loading