Measuring the Limits of Continual Learning for LLMs
Keywords: continual learning, knowledge update, self-distillation
TL;DR: We design a live benchmark for continual learning of hard composition questions (multi-hop, indirect references), going beyond direct recall of knowledge updates and exposing several shortcomings of current approaches.
Abstract: Language models are trained in stages but deployed as mostly static artifacts, leaving them poorly matched to a world that continually produces novel information. This trait has motivated a broad class of *continual learning* systems that adapt models to new information through weight updates, retrieval, memory, long-context inference, or hybrid mechanisms. Yet, existing evaluations do not tell us whether such systems have truly *internalized* new information: whether they can go beyond memorizing new information and update stale beliefs, resolve indirect references, compose new facts with prior knowledge, surface facts even when only implicitly relevant, and confidently recognize gaps in its own knowledge. We construct ImprintBench, a benchmark of realistic settings that expose these systematic shortcomings. ImprintBench consists of a refreshable pipeline that automatically constructs evaluations across three domains: news events, open-source API changes, and evolving personalization histories, with queries spanning six capability families requiring various degrees of compositional reasoning: acquisition, temporal update, referential resolution, multi-hop, implicit relevance, and boundary awareness. Across in-the-wild update scenarios, we find common systematic failures in both retrieval-based and training-based methods, showing that current systems still fall short of robustly learning from new experience.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 136
Loading