Keywords: online learning, large language models, dynamic evaluation, context extension
TL;DR: The paper reframes dynamic evaluation as a context extension mechanism, and studies the trade-off it offers between computational cost and performance when models are faced with different levels of distribution shift.
Abstract: We consider the problem of online finetuning the parameters of a language model at test time, also known as dynamic evaluation. While it is generally known that this approach improves the overall predictive performance, especially when considering distributional shift between training and evaluation data, we here emphasize the perspective that online-adaptation turns parameters into temporally changing states and provides a form of context-length extension with _memory in weights_, more in line with the concept of _memory_ in neuroscience.
We pay particular attention to the speed of adaptation (in terms of sample efficiency), sensitivity to overall distributional drift,
and computational overhead for performing gradient computation and parameter updates. Our empirical study provides insights on when online adaptation is particularly interesting. We highlight that with online adaptation the conceptual distinction between in-context learning and finetuning blurs: Both are methods to condition the model on previously observed tokens.
Submission Number: 64
Loading