TiC-LM: A Multi-Year Benchmark for Continual Pretraining of Language Models

Jeffrey Li; Mohammadreza Armandpour; Seyed Iman Mirzadeh; Sachin Mehta; Vaishaal Shankar; Raviteja Vemulapalli; Oncel Tuzel; Mehrdad Farajtabar; Hadi Pouransari; Fartash Faghri

TiC-LM: A Multi-Year Benchmark for Continual Pretraining of Language Models

Jeffrey Li, Mohammadreza Armandpour, Seyed Iman Mirzadeh, Sachin Mehta, Vaishaal Shankar, Raviteja Vemulapalli, Oncel Tuzel, Mehrdad Farajtabar, Hadi Pouransari, Fartash Faghri

Published: 10 Oct 2024, Last Modified: 26 Oct 2024Continual FoMo OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Continual Learning, LLMs pre-training

Abstract: Large language models (LLMs) are trained on data crawled over many years from the web. We investigate how quickly LLMs become outdated over time and how to best update them with newer data. Specifically, we simulate a world in which the latest dump of Common Crawl (CC), the most prominent public source of pre-training data, is used every month to *continually* train an LLM. We design various dynamic evaluations from the CC data, Wikipedia, and StackExchange to measure continual learning metrics such as forgetting and forward transfer. We discover that recent DataComp-LM models trained on data before {2023} have already become outdated, incurring up to 45\% larger noun-perplexity on 2024 Wikipedia articles compared to pre-2023 articles. Further, we use our setup to evaluate the effectiveness of several large-scale continual learning methods and find that replaying older data is most effective for combating forgetting: for previously seen CC dumps, it can reduce the regret on held-out loss by 60\% compared to other optimizer and loss-based interventions. However, some domains evolve more quickly than others, favoring different trade-offs between mixing old and new data.

Submission Number: 16

Loading