TL;DR: We formalize Context-Enhanced Learning, a paradigm where LLMs leverage context without direct gradient updates, achieving exponential sample efficiency gains over standard autoregressive training when the model has ICL capabilities
Abstract: We formalize a new concept for LLMs, **context-enhanced learning**. It involves standard gradient-based learning on text except that the context is enhanced with additional data on which no auto-regressive gradients are computed. This setting is a gradient-based analog of usual in-context learning (ICL) and appears in some recent works.
Using a multi-step reasoning task, we prove in a simplified setting that context-enhanced learning can be **exponentially more sample-efficient** than standard learning when the model is capable of ICL. At a mechanistic level, we find that the benefit of context-enhancement arises from a more accurate gradient learning signal.
We also experimentally demonstrate that it appears hard to detect or recover learning materials that were used in the context during training. This may have implications for data security as well as copyright.
Lay Summary: Kids use open textbooks for homework. Can LLM training benefit from "helpful textbooks" in context with no gradients computed on these tokens? The answer is yes, and we call it Context-Enhanced Learning (CEL) โ it can exponentially accelerate training while avoiding verbatim memorization of โtextbooksโ.
To systematically study context enhanced learning, we design a multi-step translation (MLT) task as a synthetic testbed, where a token sequence undergoes ๐ translation steps via ๐ phrasebooks + circular shifts. Without having the phrasebooks in context, vanilla SFT from input-output pairs alone provably requires ฮฉ(exp(๐)) samples. However, with phrasebooks provided in context and slowly removed with a simple dropout curriculum, a 3B Llama model that is capable of in-context learning can learn the task with 20x less data.
Based on the understanding of the CEL-trained Llama model, we construct a mechanistically similar surrogate model to theoretically analyze CEL. We show that if the model has certain ICL capability, CEL can reduce the sample complexity to just O(๐ log ๐) compared to ฮฉ(exp(๐))!
We conduct further mechanistic experiments on optimization gradients and show that additional context enhancement provides more accurate learning signal in gradients. We also note that while having phrasebooks significantly accelerated learning of the task, since no autoregressive gradients were computed on them, they are not verbatim memorized (weโre unable to recover them by querying the model). This may have implications for data security as well as copyright.
Link To Code: https://github.com/princeton-pli/Context-Enhanced-Learning
Primary Area: Deep Learning->Theory
Keywords: In-Context Learning, Sample Efficiency, Knowledge Internalization, Privileged Information, Multi-step reasoning
Submission Number: 12095
Loading